# Classification, Pt 2

Classification, a method popular in machine learning, determines whether and how a model can distinguish between sets of text.

In the previous lesson, we learned how to:
* prepare a dataframe with data and metada for scikit-learn's classifier models
* perform leave-one-out classification with a logistic regression model
* begin to analyze results and calculate accuracy

In this lesson, we will learn how to:
* investigate the model's features in terms of p-values and logstic regression weights

## Review

We will quickly review what we learned from last time. For fuller explanations, see the previous notebook.

## Imports

As always, we begin with some imports.

In [None]:
import pandas as pd
import glob
from pathlib import Path
from pandas import DataFrame
from pandas import Series, DataFrame
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import pearsonr, norm

## Corpus

For this notebook, we'll return to our corpus of _New York Times_ obituaries.

In [None]:
# For downloading large files from Google Drive
import gdown

# Download the zip file
gdown.download('https://drive.google.com/uc?export=download&id=1G0Aeg8dzZGPOCNFZ77U-s9CfEnR8efbB', quiet=False)

In [None]:
# unzip it
!unzip NYT-Obituaries.zip

In [None]:
# collect filepaths as files
directory = "./NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")

len(files)

In [None]:
# and collect obit titles, which are also the final section of the filepaths
obit_titles = [Path(file).stem for file in files]
obit_titles

## Create document-term matrix

### Initiate CountVectorizer as vectorizer

In [None]:
# another sklearn library to help load stopwords
from sklearn.feature_extraction import text

# Download the stopwords file
gdown.download('https://drive.google.com/uc?export=download&id=1BQ8zVSiG_WKpXNB81Y1P9yyi2L43UeJD', quiet=False)

# open it
text_file = open('./jockers_stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = list(text.ENGLISH_STOP_WORDS.union(jockers_words))

# create dtm
corpus_path = './NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')

### Make list of filepaths

In [None]:
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)

### Get feature names and set as column titles

In [None]:
vocab = vectorizer.get_feature_names_out()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

Our dataframee has 378 rows, one for each document, or obituary, and 2985 columns, one for each word that's not in stopwords and appears at least 20 times in the corpus.

## Import metadata

In [None]:
gdown.download('https://drive.google.com/uc?export=download&id=1Pca0n-vWTy_FcF0oKWt9iHwxuBrDnxfd', quiet=False)

meta = pd.read_csv("./NYT-Obituaries.csv", encoding = 'utf-8')
meta = meta.rename(columns={'title': 'obit_title'})
meta = meta[["obit_title", "gender", "date"]]
meta

Our metadata is stored as a pandas dataframe with a row for each obituary and three columns: title, gender, and year.

## Concatenate metadata and doc-term dataframe

In [None]:
df_concat = pd.concat([meta, df], axis = 1)

In [None]:
df_concat.head()

## Equalize numbers of men and women

We want our dataframe to have equal numbers of men and women. How many women are there? Women are counted as 1 and men as 0, so if we sum the gender column, we'll have the number of women:

In [None]:
meta['gender'].sum()

Then we separate men and women into two dataframes and take a random sample of 93 obituaries about men.

In [None]:
df_men = df_concat[df_concat['gender'] == 0]
df_women = df_concat[df_concat['gender'] == 1]
df_men = df_men.sample(n=93)

We then concatenate the sampled men dataframe with the women dataframe and reset the index.

In [None]:
df_final = pd.concat([df_men, df_women])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
df_final

We now have 186 rows: 93 men, 93 women.

### Match meta and data dataframes with subset of df_final

We'll continue to use meta and df, so we need to ensure they match our subsetted df_final

In [None]:
meta = df_final[["obit_title", "gender", "date"]]
meta

In [None]:
df = df_final.loc[:,'000':]
df

## Let's run our classifier!

Once we have a dataframe with metadata and vocab counts we're ready to run our classifier!

### We add columns for probabilities and predicted class to our metadata

As we run the model, we are going to store its output with our metadata. This will allow us to easily examine the model's output.

In [None]:
meta['PROBS'] = ''
meta['PREDICTED'] = ''

### Load model

We will use scikit-learn's `LogisticRegression` model. There are many other options for classifier models. Some are better for some tasks, other for others. LogisticRegression is standard for classifying literature. We set the penalty as l1 and the 'C' value as 1.0. If you decide to specialize in classification, you can explore further the implications of these arguments.

In [None]:
model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')

### Run the model!

We run the model in the following for-loop.

Classification models need classes: they need the texts grouped into different sets. Our metadata has built-in classes: gender. Men are stored as 0; women as 1. We could, if we wanted, create a new 0/1 class based on year.

Each iteration trains on all the titles except one, then predicts which class the excluded title belongs to. We'll call this leave-one-out classification. There are other ways of dividing training and testing sets, which we won't explore today.

The first four indented lines simply track our progress by printing index, title, and class. The next four lines exclude a single title, and set the training data and the test data.

The final six lines fit the model, calculate the probabilities and predicted class of the test case, and add that information to our metadata dataframe.

In [None]:
for this_index in df_final.index.tolist():
    print(this_index) # keep track of where we are in the corpus
    title = meta.loc[meta.index[this_index], 'obit_title']
    CLASS = meta.loc[meta.index[this_index], 'gender']
    print(title, CLASS)

    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'gender'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata
    print('Class is: ' + str(CLASS) + '\n' + 'Prediction is: ' + str(predicted) + ' ' + str(prediction) + '\n')

## Results

Remember, we've stored our results in our metadata dataframe. Let's take a look!

In [None]:
meta

Let's get rid of those brackets in the PREDICTED column.

In [None]:
meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
meta

### Result column

Now we can add a 'RESULT' column that is the result of subtracting the predicted gender from the actual gender.

0 means the model was correct.
-1 means the model mistook a man for a woman.
1 means the model mistook a woman for a man.

In [None]:
sum_column = meta['gender'] - meta['PREDICTED']
meta['RESULT'] = sum_column
meta

Let's look at the accurate guesses.

In [None]:
meta_correct = meta[meta['RESULT'] == 0]
meta_correct

How many did the model get correct?

We can calculate its accuracy by dividing the correct number by the total.

In [None]:
len(meta_correct) / len(meta)

Hmm...

## P-values and weights

The model's accuracy, its classifications and misclassifications, tells us a lot.

But we can learn still much more by exploring the features—the words—that help the model make its classifications.

Which words are most likely to tip off the model that a given obituary is about a man or that another is about a woman?

### Z-test

We write a function to perform a Z-test and calculate p-values. We'll use these to determine the statistical significance of individual features.

(Our null hypothesis here, as in all z-tests, is a normal distribution.)

**NB**: _As many of you might know, p-values have come under scrutiny in recent years as a measure of significance. The standard threshold for signficance (0.05) is arbitrary. Its meaning is debated. But its authority for a long time incentivized what came to be called p-hacking: the practice of manipulating ones work so it would pass the 0.05 threshold and count as significant. Very recently statisticians and other scholars have argued for abandoning the term "statistical significance," arguing that it reduces the complexity of determining whether a given result is meaningful in context. **tl;dr: don't put too much stock in significance; consider the holistic context of results**_

In [None]:
canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval

### P-values and logistic regression weights

We write a function that takes our metadata and our data as dataframes. It calculates p-values using the Z-test function. It reruns the logistic regression model with all the data. The model makes its predictions by giving each feature, each word, a weight that pulls a text toward the class of man or woman. After running the model, we can draw out the feauture weights, or coefficients, with `clf.coef_[0]`.

Then we build a pandas dataframe whose rows are features and whose columns are p-values and logistic regression weights. This function returns that dataframe.

In [None]:
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['gender']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['gender']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['gender']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df

### Let's take it for a spin!

In [None]:
feat_df = feat_pval_weight(meta, df)
feat_df

We can't see much, here. It'll be easier to see what's happening if we sort.

In [None]:
feat_df.sort_values('LR_WEIGHT', ascending = True)

This is more interesting! The negative LR weights tell the model that the obituary is likely about a woman. The positive LR weights tell the model the obituary is likely about a man.

We can pull up more data by using the `.head()` method.

In [None]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

In [None]:
feat_df.sort_values('LR_WEIGHT', ascending = False).head(20)

### Exercise

What can you infer from the top ten features for each?

ANSWER
* *
* *
* *
* *
* *
* *

### More filtering

We can filter further by setting a p-value threshold. For our purposes here, we can set a high threshold, which to normalize we need to divide by the number of features.

**Remember** significance is a contested concept. What most matters is understanding the meaning of numbers in context. For us, any feature with a logistic regression weight gives us useful information, and p-values help us understand just how robust that feature is.

In [None]:
# to account for the many features / mulitple hypothesis tests
sig_thresh = 50.00 / len(df.columns)

We then filter our data by p-values that pass that threshold, sorted by weights. We get rid of features without weights, and separate positive and negative weights.

In [None]:
out = feat_df[(feat_df['P_VALUE'] <= sig_thresh)].sort_values('LR_WEIGHT', ascending = True)
out = out[out['LR_WEIGHT'] != 0]
outM = out[out['LR_WEIGHT'] >= 0]
outW = out[out['LR_WEIGHT'] <= 0]

Then we pass the remaining features in each dataframe to a list and print them out.

In [None]:
outM = outM['FEAT'].tolist()
print("Here are significant words that distinguish men: " + str(outM))
outW = outW['FEAT'].tolist()
print("Here are significant words that distinguish women: " + str(outW))

It can be easier to explore features in a CSV outside of Colab. So we write out and move offline.

In [None]:
from google.colab import files

feat_df.to_csv('features_obits.csv')

files.download('features_obits.csv')