In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'

    import nltk
    nltk.download('punkt')

In [None]:
%load_ext autoreload
%autoreload 2
import lib
from sklearn.model_selection import train_test_split
import pandas as pd
import itertools
import nltk
import numpy as np
from numpy.random import rand


from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from scipy.sparse import lil_matrix, hstack

# Classification

Our final task will be to use the tools that we have explored to classify gender based on happiness. Along the way, we will see how to split data to train and test classifiers and how data is represented as input in NLP.

## Splitting Data

Before we train any classifiers, we need to split our data into a train set, dev set, and test set.

Create three lists of writer IDs: train (80%), test (10%), and dev (10%). Make sure that these lists do not have any overlap, and contain all writers with their gender labeled as male or female. As you saw in section 1, we do not have very many authors whose gender is other, so it would be impossible to perform classification.

Scikit-learn has a funciton, `train_test_split`, that will split data for you. Note that it only does a single split; think about how you can use it to create three distinct datasets. If you do not want to use scikit-learn, you may implement this yourself. However, for debugging, you should seed your random number generator, which will cause it to have the same results each time you use it. You can see the [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

There are two ways that you should consider splitting the data

1. Split by happy moment: create one list of happy moments, then split them into train, dev, and test
1. Split by worker (more complex, but better): splitting by worker is better because you won't be training on workers who are in the test set. If, for instance, a father constantly mentions his son "Gregory," the classifier might learn that "Gregory" is more commonly said by men, even though it is really just Gregory's father. If Gregory's father is in the test set as well as the train set, you will have higher accuracy than you should.  
To prevent this, you can split __writers__ into train, dev, and test sets. Then, create a list of the corresponding happy moments for train, dev, and test.

Load your data (use the function for _joined_ data!)

In [None]:
joined_data = lib.load_joined_data()

Create a new list, `joined_data_clean`, that only contains happy moments where the author identifies as male or female.

In [None]:
joined_data_clean = [hm for hm in joined_data if hm['gender'] in ['m', 'f']]
all_writers = list(set([writer['wid'] for writer in lib.load_demographics() if writer['gender'] in ['m', 'f']]))

Split your data into three separate lists: `train`, `dev`, and `test`.

In [None]:
### YOUR WORK HERE



If you are splitting by happy moment, you are done this section. If you are splitting by worker, use this cell to make train, test, and dev lists of _happy moments_ based on the splits of workers.

In [None]:
### YOUR WORK HERE



## Defining a Baseline
One good baseline is the _majority class_. This is defined as the percentage of data that comes from the most common class. In a classification problem, it is often the case that one class appears more frequently in the data than the other.

The simplest baseline is random, which would be 50% on a binary classification task like ours. However, with unbalanced data, that does not take into account the fact that guessing the most common class 100% of the time would yield a higher baseline. What is our majority class baseline? Print it out, and be sure to compare your results to the baseline!

In [None]:
### YOUR WORK HERE


### END YOUR WORK

## First Feature: Counts
We will first train our model by using counts of words as features. You should create a feature matrix (using numpy) with the following properties:
* There is one row for each sentence
* Each column is a count of the number of times that each word appears in that sentence

You can think of this as a grid, where on the top you have words and on the side you have sentences.

You should
* Fill in the class `CountMatrix`. The two methods you will write, `fit_transform` and `transform` are analogous to
terminology used in sklearn. `fit_transform` will create a new matrix based on the words in your sentence, while `transform` will create a matrix with the column -> word mapping that was used when you called `fit_transform`! Make sure that `transform` can only be called if `fit_transform` has already been called!
* Think about what to do with unknown words. You can search online to see if you can find any solutions to this problem!

You will need to use a sparse matrix from scipy to accomplish this without creating a data structure that is too big for colab. I would recommend using scipy's [lil_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html). Compared to some of the other sparse matrices, it is easy to construct in a similar way to how you would construct a matrix in numpy. There is a warning that "to construct a matrix efficiently, make sure the items are pre-sorted by index, per row" but in our case this does not seem to matter very much in terms of time, so do not worry about sorting if you don't want to.

*List of Lists Format (LIL)*

Examples:






In [None]:
#create an empty LIL matrix of 4 rows and 5 columns
mtx = lil_matrix((4,5))
print(mtx)

In [None]:
#create a random array data 
data = np.round(rand(2,3))
print(data)


In [None]:
#initialize the LIL matrix with the random array and then print it. Do you notice something interesting? How does the lil_matrix differ from the original matrix?
mtx[:2,[1,2,3]] = data
print("Lil matrix:")
print(mtx)
print("Original matrix:")
print(mtx.todense())


Now, create your `CountMatrix`

Note: you can insert to your lil_matrix using the folowing:
`matrix[x, y] = z`

Hint (highlight text to see):<font color='white'>this is how you should initialize your count matrix: count_matrix = lil_matrix((len(sentences), len(self.word_to_int)), dtype=np.int64)</font>

In [None]:
class CountMatrix:
    def __init__(self):
        self.word_to_int = {}
    
    def fit_transform(self, sentences):
        # this function should create a map from each unique token to a column number
        # then it should convert this list of sentences into a matrix, and return that matrix
        ### YOUR WORK HERE

        
        
        ### END YOUR WORK
    
    def transform(self, sentences):
        # this should convert a list of sentences into a matrix, then return that matrix
        
        ### YOUR WORK HERE

        
        ### END YOUR WORK

Use your CountMatrix to create input and output variables for your classifier

In [None]:
# here, define variables for the following: text for train/dev happy moments
# use as many lines of code as you need, the following are just placeholders
train_text = ???

dev_text = ???



In [None]:
count_matrix = CountMatrix()
train_input = count_matrix.fit_transform(train_text)
dev_input = count_matrix.transform(dev_text)

Define train/dev output. This is just the gender variable for each HM in train and dev

In [None]:
### YOUR WORK HERE




Now that you have created your features, you can train your classifier. For this exercise, use the LogisticRegression classifier.

In [None]:
# train the model
model = LogisticRegression()

# call fit on the model


# test the model on dev set
# call predict



# use metrics.accuracy_score to calculate accuracy. usage is: metrics.accuracy_score(y_true, y_pred)



## Adding a new feature: length
We saw in section 2 that length of happiness reflections can differ for men and women. What happens if we add this feature in addition to counts? Does it help with our performance?

Create feature vectors that include only the length of the sequence. Create them as a list. The cell below the next one will convert them to a lil_matrix.

In [None]:
# count up length features. use nltk tokenizer
def count_lengths(text):
    lengths = []
    ### YOUR WORK HERE
    
    
    ### END YOUR WORK
    return lengths

train_lengths = count_lengths(train_text)
dev_lengths = count_lengths(dev_text)

In [None]:
length_feature_train = lil_matrix(train_lengths).reshape(-1, 1)
length_feature_dev = lil_matrix(dev_lengths).reshape(-1, 1)

Next, use [`hstack` from scipy.sparse](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html) to combine them with your count features

Note: you'll need to make the length features a sparse matrix as well!

In [None]:
# combine the features together into one matrix for dev and train



Finally, train the model again with the new features to see if the results change

In [None]:
# follow what you did above to train your model




## TF-IDF Counts
TF-IDF stands for term frequency-inverse document frequency. It is a way of weighting words such that words have the highest weights if they are _common_ in a single document but _uncommon_ in the full set of documents. This means that words like "a" would have a lower weight, even if they appear frequently in a single document, because they are so common overall. You can think of a document as a happy moment sentence in our case!

[Wikipedia gives a very complete description of how TF-IDF is calculated](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition), and you should refer to this when implementing the method. If you have questions about notation, please ask an instructor or a neighbor, as it is a bit tricky!

Fill in the class `TFIDFMatrix`, which will contain TF-IDF values instead of raw counts. Please feel free to add additional helper methods to this class as you calculate TF-IDF!

In [None]:
class TFIDFMatrix:
    def __init__(self):
        self.word_to_int = {}
    
    def fit_transform(self, sentences):
        ### YOUR WORK HERE
        # create a word to column mapping
        # this should be a lot like what you did before!

        
        
        ### END YOUR WORK
    
    def transform(self, sentences):
        ### YOUR WORK HERE
        if len(self.word_to_int) == 0:
            raise Exception('Must call fit_transform before transform!')
        
        # calculate document frequency

        
        
        # calculate tf-idf

        
        
        # return tfidf-matrix at the end
        
        
        
        ### END YOUR WORK
    
    def get_feature_names(self):
        # this function will be used later, don't worry too much about it!
        return [x[0] for x in sorted(self.word_to_int.items(), key=lambda x: x[1])]
            


Use your TFIDFMatrix to create input and output variables for your classifier

In [None]:
tfidf_matrix = TFIDFMatrix()
train_input = tfidf_matrix.fit_transform(train_text)

dev_input = tfidf_matrix.transform(dev_text)

Finally, train your classifier

In [None]:
# do the same thing that you have done before



## Examining Model Weights
In addition to succeeding at classification, we can look at the _weights_ of our classifier. This will tell us which words are most influential in making correct classifications!

This helps us to determine what makes men happy and not women, and vice-versa.

The model weights are stored as `model.coef_`. They will line up with the feature names in your vectorizer, which you can find by running `vectorizer.get_feature_names()`.

Once you have the weights for all features, you can sort by coefficient to find the largest and smallest coefficients, which will link to men and women.

Do you see any similarities between the coefficient lists and your word clouds?

In [None]:
feature_names = tfidf_matrix.get_feature_names()
coefficients = model.coef_.tolist()[0]

combo = []
for i in range(len(feature_names)):
    combo.append((feature_names[i], coefficients[i]))
    
sorted_combos = sorted(combo, key=lambda x: x[1])

In [None]:
sorted_combos[:10]

In [None]:
sorted_combos[-10:]

## Modifying your Features
After seeing the results of top weights, is there anything that you would change with how you created your features? Is there any additional pre-processing that you might do?

If so, try making these modifications in your CountMatrix and TFIDFMatrix, and see if it improves your results.

## Your Turn: Other Features?
Are there any other features that you think could help your classifier performance? If so, try adding them!

## Testing!

Once you're done playing around wiht different features, you can test your best classifier on the test set!

## Reflection
These results could tell us that different things make men and women happy. What else could they tell us?