In [None]:
%load_ext autoreload
%autoreload 2
import lib
from collections import Counter
from sklearn.model_selection import train_test_split
import pandas as pd
import itertools
import nltk
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import FunctionTransformer

# Classification

Our final task will be to use the tools that we have explored to classify gender based on happiness. Along the way, we will see how to split data to train and test classifiers and how data is represented as input in NLP.

<span style="color:red">TODO:</span> maybe we should have the students implement a simple classifier like NB, which is what the Stanford project does. We could do what we are doing here, using a classifier out-of-the-box, then have them implement their own?

## Splitting Data

Before we train any classifiers, we need to split our data into a train set, dev set, and test set.

Create three lists of writer IDs: train (80%), test (10%), and dev (10%). Make sure that these lists do not have any overlap, and contain all writers with their gender labeled as male or female. As you saw in section 1, we do not have very many authors whose gender is other, so it would be impossible to perform classification.

Scikit-learn has a funciton, `train_test_split`, that will split data for you. Note that it only does a single split; think about how you can use it to create three distinct datasets. If you do not want to use scikit-learn, you may implement this yourself. However, for debugging, you should seed your random number generator, which will cause it to have the same results each time you use it.

<span style="color:red">TODO:</span> should we expect them to look up documentation on how to use these functions?

In [None]:
demographics = lib.load_demographics()
happy_moments = lib.load_happy_moments()

In [None]:
joined_data = pd.merge(demographics, happy_moments, left_on='wid', right_on='wid')

# only keep relevant columns for simplicity
joined_data = joined_data[['gender', 'cleaned_hm']]

# drop where gender is not m or f
joined_data = joined_data[joined_data['gender'].isin(['m', 'f'])]


train, temp = train_test_split(joined_data, test_size=.2, random_state=10)
dev, test = train_test_split(temp, test_size=.5, random_state=10)

## Defining a Baseline
One good baseline is the _majority class_. In a classification problem, it is often the case that one class appears more frequently in the data than the other.

The simplest baseline is random, which would be 50% on a binary classification task like ours. However, with unbalanced data, that does not take into account the fact that guessing the most common class 100% of the time would yield a higher baseline. What is our majority class baseline? Print it out, and be sure to compare your results to the baseline!

In [None]:
class_counts = Counter(dev['gender'])
print(class_counts.most_common()[0][1] / sum(class_counts.values()))

## First Feature: Counts
We first train our model using sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). The count vectorizer represents a sentence by counting the number of times that each word appears. Each position in the vector represents one word.

You should
* Create a CountVectorizer
* Create the input about output variables that will be used in your classifier  
  Think about where you should use `transform`, `fit`, or `fit_transform`!

In [None]:
vectorizer = CountVectorizer()
train_input = vectorizer.fit_transform(train['cleaned_hm'])
train_output = train['gender']

dev_input = vectorizer.transform(dev['cleaned_hm'])
dev_output = dev['gender']

Now that you have created your features, you can train your classifier. For this exercise, use the LogisticRegression classifier.

In [None]:
# train the model
model = LogisticRegression()
model.fit(train_input, train_output)

# test the model on dev set
predictions = model.predict(dev_input)
print(metrics.accuracy_score(predictions, dev_output))

## Adding a new feature: length
We saw in section 2 that length of happiness reflections can differ for men and women. What happens if we add this feature in addition to counts? Does it help with our performance?

In [None]:
length_feature_train = np.array([len(nltk.word_tokenize(x)) for x in train['cleaned_hm']]).reshape(-1, 1)
length_feature_dev = np.array([len(nltk.word_tokenize(x)) for x in dev['cleaned_hm']]).reshape(-1, 1)

In [None]:
# TODO: find a nicer (faster) way to do this. Maybe provide it as a library function.
combo_train = np.concatenate((train_input.todense(), length_feature_train), axis=1)
combo_dev = np.concatenate((dev_input.todense(), length_feature_dev), axis=1)

model = LogisticRegression()
model.fit(combo_train, train_output)
predictions = model.predict(combo_dev)
print(metrics.accuracy_score(predictions, dev_output))

## TF-IDF Vectorizer
TF-IDF stands for term frequency-inverse document frequency. It is a way of weighting words such that words have the highest weights if they are _common_ in a single document but _uncommon_ in the full set of documents. This means that words like "a" would have a lower weight, even if they appear frequently in a single document, because they are so common overall.

Create your features again, this time using the TfidfVectorizer. Do you see any change in performance?

In [None]:
vectorizer = TfidfVectorizer()
train_input = vectorizer.fit_transform(train['cleaned_hm'])
train_output = train['gender']

dev_input = vectorizer.transform(dev['cleaned_hm'])
dev_output = dev['gender']

In [None]:
model = LogisticRegression()
model.fit(train_input, train_output)
predictions = model.predict(dev_input)
print(metrics.accuracy_score(predictions, dev_output))

## Examining Model Weights
In addition to succeeding at classification, we can look at the _weights_ of our classifier. This will tell us which words are most influential in making correct classifications!

This helps us to determine what makes men happy and not women, and vice-versa.

The model weights are stored as `model.coef_`. They will line up with the feature names in your vectorizer, which you can find by running `vectorizer.get_feature_names()`.

Once you have the weights for all features, you can sort by coefficient to find the largest and smallest coefficients, which will link to men and women.

Do you see any similarities between the coefficient lists and your word clouds?

In [None]:
feature_names = vectorizer.get_feature_names()
coefficients = model.coef_.tolist()[0]
weight_df = pd.DataFrame({'Word': feature_names,
                          'Coeff': coefficients})
weight_df = weight_df.sort_values(['Coeff', 'Word'], ascending=[0, 1])
weight_df.head(n=10)

In [None]:
weight_df.tail(n=10)

## Your Turn: Other Features?
Are there any other features that you think could help your classifier performance? If so, try adding them!

In [None]:
# TODO: add ngram features to CountVectorizer
# maybe ask them to create their own counts matrix?