# Classification, Pt 1

Classification, a method popular in machine learning, determines whether and how a model can distinguish between sets of text.

It works like this. Everyone with email relies on classification to separate spam from legitimate emails. Email providers train classification models to recognize the difference by giving them emails they have labeled “spam” and “not spam.” They then ask the model to learn the features that most reliably distinguish the two types, which could include a preponderance of all caps or phrases like “free money” or “get paid.” They test the model by giving it unlabeled emails and asking it to classify them. If the model can do it accurately a high percentage of the time, that’s a good spam filter.

We can take the underlying idea and apply it to many experiments.

## Today's experiment

We are going to use a corpus of obituaries from _The New York Times_ (Halloween / Dia de los Muertos appropriate!) in order to test whether our model can learn to distinguish between obituaries about men and women.

## Imports

As always, we begin with some imports.

In [None]:
import pandas as pd
import glob
from pathlib import Path
from pandas import DataFrame
from pandas import Series, DataFrame
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import pearsonr, norm

## Corpus 

For this notebook, we'll return to our corpus of _New York Times_ obituaries.

In [None]:
# collect filepaths as files
directory = "../corpora/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")

In [None]:
# and collect obit titles, which are also the final section of the filepaths
obit_titles = [Path(file).stem for file in files]
obit_titles

## Create document-term matrix

### Initiate CountVectorizer as vectorizer

Remember document-term matrices, aka doc-term matrices, aka dtms? We learned about them in our sklearn and tf-idf notebooks. Our classifier uses a dtm as its input. We build it with scikit-learn's CountVectorizer, which we already imported up above. 

When we load our vectorizer, we include an argument to encode as utf-8 and we load our stopwords. In this case, we're using a custom stopwords list rather than the default sklearn one. You may end up using a custom stopwords list in your final projects. 

In addition, we can set the minimum number of times a word must appear in the corpus for it to be included in the dtm. In this case, I've set it at 20.

In [None]:
# load stopwords
from sklearn.feature_extraction import text
text_file = open('../corpora/jockers_stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

# create dtm
corpus_path = '../corpora/NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')

### Make list of filepaths

If you recall, CountVectorizer builds a dtm from a list of filepaths. So we will provide that:

In [None]:
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)

### Get feature names and set as column titles

The columns store word counts. We want to name the columns with the words stored in each, and to transform the dtm into a pandas dataframe, as follows:

In [None]:
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

Our dataframee has 378 rows, one for each document, or obituary, and 2985 columns, one for each word that's not in stopwords and appears at least 20 times in the corpus.

### Pandas interlude ###

I saw [this](https://twitter.com/mmitchell_ai/status/1454931443386228751) on Twitter the other night, posted by Dr. Margaret Mitchell, the former co-lead of Google's EthicalAI group and now Big Deal at HuggingFace (the folx behind the transformer libraries we'll be using next week):

<img src="http://lklein.com/wp-content/uploads/2021/11/Screen-Shot-2021-11-01-at-10.14.24-AM.png" width=500px>

In any case, now it's time to import our metadata:

## Import metadata

In [None]:
meta = pd.read_csv("../corpora/NYT-Obituaries.csv", encoding = 'utf-8')
meta = meta.rename(columns={'title': 'obit_title'})
meta = meta[["obit_title", "gender", "date"]]
meta

Our metadata is stored as a pandas dataframe with a row for each obituary and three columns: title, gender, and year.

Let's now concatinate the dtm to it so that everything is in one place. 

## Concatenate metadata and doc-term dataframe

We'll use the pandas `concat` methods, specifying that the data should be concatinated as additional columns (that's the `axis = 1` parameter. (The default would be `0` to concatinate as additional rows.) 

In [None]:
df_concat = pd.concat([meta, df], axis = 1)

In [None]:
df_concat.head()

## Equalize numbers of men and women

We want our dataframe to have equal numbers of men and women. How many women are there? Women are counted as 1 and men as 0, so if we sum the gender column, we'll have the number of women:

In [None]:
meta['gender'].sum()

Then we separate men and women into two dataframes and take a random sample of 93 obituaries about men.

In [None]:
df_men = df_concat[df_concat['gender'] == 0]
df_women = df_concat[df_concat['gender'] == 1]
df_men = df_men.sample(n=93)

We then concatenate the sampled men dataframe with the women dataframe and reset the index.

In [None]:
df_final = pd.concat([df_men, df_women])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
df_final

We now have 186 rows: 93 men, 93 women.

### Match meta and data dataframes with subset of df_final

We'll continue to use meta and df, so we need to ensure they match our subsetted df_final

In [None]:
meta = df_final[["obit_title", "gender", "date"]]
meta

In [None]:
df = df_final.loc[:,'000':]
df

## Let's run our classifier!

Once we have a dataframe with metadata and vocab counts we're ready to run our classifier!

### We add columns for probabilities and predicted class to our metadata

As we run the model, we are going to store its output with our metadata. This will allow us to easily examine the model's output.

In [None]:
meta['PROBS'] = ''
meta['PREDICTED'] = ''

### Load model

We will use scikit-learn's `LogisticRegression` model. There are many other options for classifier models. Some are better for some tasks, other for others. LogisticRegression is standard for classifying literature. We set the penalty as l1 and the 'C' value as 1.0. If you decide to specialize in classification, you can explore further the implications of these arguments.

In [None]:
model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')

### Run the model!

We run the model in the following for-loop.

Classification models need classes: they need the texts grouped into different sets. Our metadata has built-in classes: gender. Men are stored as 0; women as 1. We could, if we wanted, create a new 0/1 class based on something else.

Each iteration trains on all the titles except one, then predicts which class the excluded title belongs to. We'll call this leave-one-out classification. It's helpful when you're working with a small dataset. There are other ways of dividing training and testing sets, which we won't explore today.

The first four indented lines simply track our progress by printing index, title, and class. The next four lines exclude a single title, and set the training data and the test data.

The final six lines fit the model, calculate the probabilities and predicted class of the test case, and add that information to our metadata dataframe.

In [None]:
for this_index in df_final.index.tolist():
    print(this_index) # keep track of where we are in the corpus
    title = meta.loc[meta.index[this_index], 'obit_title'] 
    CLASS = meta.loc[meta.index[this_index], 'gender']
    print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'gender'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata
    print('Class is: ' + str(CLASS) + '\n' + 'Prediction is: ' + str(predicted) + ' ' + str(prediction) + '\n')

How cool is this! For each obituary, we see who it's about, that person's gender (0 or 1), and which gender the model thinks it's about, by which probabilities. 

## Results

Remember, we've stored our results in our metadata dataframe. Let's take a look!

In [None]:
meta

There's lots to look at here. We could explore probabilities: which obituaries is the model most sure about? Which are closest to 50-50? Which does it get most right and most wrong? Is there a pattern to misclassified obituaries?

For now, we just want to calculate its accuracy. Let's get rid of those brackets in the PREDICTED column.

In [None]:
meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
meta

### Result column

Now we can add a 'RESULT' column that is the result of subtracting the predicted gender from the actual gender.

0 means the model was correct.
-1 means the model mistook a man for a woman.
1 means the model mistook a woman for a man.

In [None]:
sum_column = meta['gender'] - meta['PREDICTED']
meta['RESULT'] = sum_column
meta

Let's look at the accurate guesses.

In [None]:
# note that we're not wanting to rewrite the "meta" df here, just look at it
# so we won't reassign it
meta[meta['RESULT'] == 0]

## Accuracy

How many did the model get correct?

We can calculate its accuracy by dividing the correct number by the total.

In [None]:
# remember our filter approach from last week's pandas class
accuracy_filter =  # complete the rest here

# apply the filter
accurate_results = # complete the rest here 

# do our division 
# hint: you can just use the len() method to give you the length of a dataframe 

Pretty good rate! At random, the model should guess correctly 50% of the time. It does **much** better than that!

## BONUS - Plotting a confusion matrix

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
actual = meta['gender'].array   # remember that pandas columns are Series objects, so we need to convert them
predicted = meta['PREDICTED'].array

cm = confusion_matrix(actual, predicted) 

print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['male','female'])

disp.plot()
plt.show()

**What is this showing us?**

That's all for now!

In the next lesson, we'll explore _how_ the model made its calculations by learning which words matter.