In [None]:
%load_ext autoreload
%autoreload 2
import lib
from collections import Counter
from sklearn.model_selection import train_test_split
import pandas as pd
import itertools
import nltk
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import FunctionTransformer

# Classification

Our final task will be to use the tools that we have explored to classify gender based on happiness. Along the way, we will see how to split data to train and test classifiers and how data is represented as input in NLP.

<span style="color:red">TODO:</span> maybe we should have the students implement a simple classifier like NB, which is what the Stanford project does. We could do what we are doing here, using a classifier out-of-the-box, then have them implement their own?

## Splitting Data

Before we train any classifiers, we need to split our data into a train set, dev set, and test set.

Create three lists of writer IDs: train (80%), test (10%), and dev (10%). Make sure that these lists do not have any overlap, and contain all writers with their gender labeled as male or female. As you saw in section 1, we do not have very many authors whose gender is other, so it would be impossible to perform classification.

Scikit-learn has a funciton, `train_test_split`, that will split data for you. Note that it only does a single split; think about how you can use it to create three distinct datasets. If you do not want to use scikit-learn, you may implement this yourself. However, for debugging, you should seed your random number generator, which will cause it to have the same results each time you use it. You can see the [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

Load your data (use the function for _joined_ data!)

In [None]:
joined_data = lib.load_joined_data()

Create a new list, `joined_data_clean`, that only contains happy moments where the author identifies as male or female.

In [None]:
joined_data_clean = [hm for hm in joined_data if hm['gender'] in ['m', 'f']]

Split your data into three separate lists: `train`, `dev`, and `test`.

In [None]:
train, temp = train_test_split(joined_data_clean, test_size=.2, random_state=10)
dev, test = train_test_split(temp, test_size=.5, random_state=10)

## Defining a Baseline
One good baseline is the _majority class_. In a classification problem, it is often the case that one class appears more frequently in the data than the other.

The simplest baseline is random, which would be 50% on a binary classification task like ours. However, with unbalanced data, that does not take into account the fact that guessing the most common class 100% of the time would yield a higher baseline. What is our majority class baseline? Print it out, and be sure to compare your results to the baseline!

In [None]:
### YOUR WORK HERE
class_counts = Counter([hm['gender'] for hm in joined_data_clean])
print(class_counts.most_common()[0][1] / sum(class_counts.values()))
### END YOUR WORK

## First Feature: Counts
We will first train our model by using counts of words as features. You should create a feature matrix (using numpy) with the following properties:
* There is one row for each sentence
* Each column is a count of the number of times that each word appears in that sentence

You can think of this as a grid, where on the top you have words and on the side you have sentences.

You should
* Fill in the class `CountMatrix`. The two methods you will write, `fit_transform` and `transform` are analogous to
wording used in sklearn. `fit_transform` will create a new matrix based on the words in your sentence, while `transform` will create a matrix with the column -> word mapping that was used when you called `fit_transform`! Make sure that `transform` can only be called if `fit_transform` has already been called!
* Think about what to do with unknown words. You can search online to see if you can find any solutions to this problem!

In [None]:
class CountMatrix:
    def __init__(self):
        ### YOUR WORK HERE
        self.word_to_int = {}
        ### END YOUR WORK
    
    def fit_transform(self, sentences):
        ### YOUR WORK HERE
        # create a word to column mapping
        col = 0
        for sentence in sentences:
            for token in nltk.word_tokenize(sentence):
                if token not in self.word_to_int:
                    self.word_to_int[token] = col
                    col += 1
        
        return self.transform(sentences)
        ### END YOUR WORK
    
    def transform(self, sentences):
        ### YOUR WORK HERE
        count_matrix = np.zeros((len(sentences), len(self.word_to_int)))
        for i, sentence in enumerate(sentences):
            for token in nltk.word_tokenize(sentence):
                if token in self.word_to_int:
                    count_matrix[i][self.word_to_int[token]] += 1
        return count_matrix
        ### END YOUR WORK

Use your CountMatrix to create input and output variables for your classifier

In [None]:
count_matrix = CountMatrix()
train_input = count_matrix.fit_transform([hm['hm_text'] for hm in train])
dev_input = count_matrix.transform([hm['hm_text'] for hm in dev])

train_output = [hm['gender'] for hm in train]
dev_output = [hm['gender'] for hm in dev]

Now that you have created your features, you can train your classifier. For this exercise, use the LogisticRegression classifier.

In [None]:
# train the model
model = LogisticRegression()
model.fit(train_input, train_output)

# test the model on dev set
predictions = model.predict(dev_input)
print(metrics.accuracy_score(predictions, dev_output))

## Adding a new feature: length
We saw in section 2 that length of happiness reflections can differ for men and women. What happens if we add this feature in addition to counts? Does it help with our performance?

Create feature vectors that include only the length of the sequence

In [None]:
length_feature_train = np.array([len(nltk.word_tokenize(hm['hm_text'])) for hm in train]).reshape(-1, 1)
length_feature_dev = np.array([len(nltk.word_tokenize(hm['hm_text'])) for hm in dev]).reshape(-1, 1)

Next, use `np.concatenate` to conbine them with your count features

In [None]:
combo_train = np.concatenate((train_input, length_feature_train), axis=1)
combo_dev = np.concatenate((dev_input, length_feature_dev), axis=1)

Finally, train the model again with the new features to see if the results change

In [None]:
model = LogisticRegression()
model.fit(combo_train, train_output)
predictions = model.predict(combo_dev)
print(metrics.accuracy_score(predictions, dev_output))

## TF-IDF Counts
TF-IDF stands for term frequency-inverse document frequency. It is a way of weighting words such that words have the highest weights if they are _common_ in a single document but _uncommon_ in the full set of documents. This means that words like "a" would have a lower weight, even if they appear frequently in a single document, because they are so common overall. You can think of a document as a happy moment sentence in our case!

[Wikipedia gives a very complete description of how TF-IDF is calculated](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition), and you should refer to this when implementing the method. If you have questions about notation, please ask an instructor or a neighbor, as it is a bit tricky!

Fill in the class `TFIDFMatrix`, which will contain TF-IDF values instead of raw counts. Please feel free to add additional helper methods to this class as you calculate TF-IDF!

In [None]:
class TFIDFMatrix:
    def __init__(self):
        self.word_to_int = {}
    
    def fit_transform(self, sentences):
        ### YOUR WORK HERE
        # create a word to column mapping
        col = 0
        for sentence in sentences:
            for token in nltk.word_tokenize(sentence):
                if token not in self.word_to_int:
                    self.word_to_int[token] = col
                    col += 1
        
        return self.transform(sentences)
        ### END YOUR WORK
    
    def transform(self, sentences):
        ### YOUR WORK HERE
        if len(self.word_to_int) == 0:
            raise Exception('Must call fit_transform before transform!')
        
        # calculate document frequency
        document_frequency = {}
        for sentence in sentences:
            for token in set(nltk.word_tokenize(sentence)):
                if token not in document_frequency:
                    document_frequency[token] = 0
                document_frequency[token] += 1
                    
        # calculate tf-idf
        tfidf_matrix = np.zeros((len(sentences), len(self.word_to_int)))
        n_documents = len(sentences)
        for i, sentence in enumerate(sentences):
            token_counts = {}
            for token in nltk.word_tokenize(sentence):
                if token in self.word_to_int:
                    if token not in token_counts:
                        token_counts[token] = 0
                    token_counts[token] += 1
                        
            for token in set(token_counts):
                tf = token_counts[token] / sum(token_counts.values())
                idf = np.log(n_documents / (1 + document_frequency[token]))
                tfidf_value = tf * idf
                tfidf_matrix[i][self.word_to_int[token]] = tfidf_value
        
        return tfidf_matrix
    
    def get_feature_names(self):
        return [x[0] for x in sorted(self.word_to_int.items(), key=lambda x: x[1])]
            
        ### END YOUR WORK

Use your TFIDFMatrix to create input and output variables for your classifier

In [None]:
tfidf_matrix = TFIDFMatrix()
train_input = tfidf_matrix.fit_transform([hm['hm_text'] for hm in train])

dev_input = tfidf_matrix.transform([hm['hm_text'] for hm in dev])

Finally, train your classifier

In [None]:
model = LogisticRegression()
model.fit(train_input, train_output)
predictions = model.predict(dev_input)
print(metrics.accuracy_score(predictions, dev_output))

## Examining Model Weights
In addition to succeeding at classification, we can look at the _weights_ of our classifier. This will tell us which words are most influential in making correct classifications!

This helps us to determine what makes men happy and not women, and vice-versa.

The model weights are stored as `model.coef_`. They will line up with the feature names in your vectorizer, which you can find by running `vectorizer.get_feature_names()`.

Once you have the weights for all features, you can sort by coefficient to find the largest and smallest coefficients, which will link to men and women.

Do you see any similarities between the coefficient lists and your word clouds?

In [None]:
feature_names = tfidf_matrix.get_feature_names()
coefficients = model.coef_.tolist()[0]

combo = []
for i in range(len(feature_names)):
    combo.append((feature_names[i], coefficients[i]))
    
sorted_combos = sorted(combo, key=lambda x: x[1])

In [None]:
sorted_combos[:10]

In [None]:
sorted_combos[-10:]

## Modifying your Features
After seeing the results of top weights, is there anything that you would change with how you created your features? Is there any additional pre-processing that you might do?

If so, try making these modifications in your CountMatrix and TFIDFMatrix, and see if it improves your results.

## Your Turn: Other Features?
Are there any other features that you think could help your classifier performance? If so, try adding them!

## Testing!

Once you're done playing around wiht different features, you can test your best classifier on the test set!

## Reflection
These results could tell us that different things make men and women happy. What else could they tell us?