# Exercise 2.2 - Train a Nationality Classifier (Word Embedding)

# Working with word vectors
Word vectors aren't just used for cool word analogies! They're frequently used for downstream applications in NLP when we need to know something about meaning of words. For example, if we're training a classifier that uses a term-per-dimension representation (like a term-document matrix) and we've only seen the word "cat" in our training data but not "kitty", then we'll be unable to reason about a new document that has "kitty" instead of "cat". However, if we use word vectors trained on a much larger corpus (not just the training data), the vectors hopefully encode that "cat" and "kitty" mean similar things and our model might be able to generalize.

In this notebook, we'll try one simple application of word vectors, building on the code you developed in the very first notebook. Here again, we'll train a nationality classifier. However, instead of training on a term-document matrix representation, we'll use the _average word vector_ of the words in the document. This representation has a substantial advantage of being compact - 100 dimensions! - instead of the size of our vocabulary. This dense representation makes for very efficient learning. However, does it make for _effective_ learning? In this notebook you'll get a sense of the performance through different types of tests.

The exercises we'll do for 2.2 will be _very_ similar initially to those from week one and will use the same data. However, when we get to creating features, you'll build a wholly different set using dense vectors!

In [1]:
import gensim
import gzip
import json
import matplotlib.pyplot as plt
import numpy as np
import re
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from sklearn.dummy import DummyClassifier

It's good practice to manually set your random seed when performing machine learning experiments so that they are reproducible for others. Here, we set our seed to 655 to ensure your models and experiments get the expected results when evaluating your homework.

In [2]:
RANDOM_SEED = 655

### Data Loading

Let's first read in the corpus file as a `pd.DataFrame`, where each row contains a cleaned-up Wikipedia biography for a person, the name of the person and the nationality of the person. 

In [3]:
nationality_df = pd.read_csv('assets/nationality.tsv.gz', sep='\t', compression='gzip')
nationality_df.head()

Unnamed: 0,bio,name,nationality
0,Alain Connes (born 1 April 1947) is a French m...,Alain Connes,french
1,Life\n=== Early life ===\nSchopenhauer's birth...,Arthur Schopenhauer,german
2,Life and career\nAlfred Nobel at a young age i...,Alfred Nobel,swedish
3,"Early life\nAlfred Vogt (both ""Elton"" and ""van...",A. E. van Vogt,canadian
4,Alfons Maria Jakob (2 July 1884 in Aschaffenbu...,Alfons Maria Jakob,german


In [4]:
#hidden tests are within this cell

To reduce the burden on memory, let's grab only the first 75,000 biographies/rows of the data frame.

In [5]:
nationality_df = nationality_df[:75000]

### Task 2.2.1: Fix the nationality labels
Just as in Assignment 1, we'll fix the nationality labels of the data to clean up noise in Wikipedia's manually labeled entries. Use python's `split()` function to divide these labels when they have a comma and take the last word, which we'll treat as the official national label. 

*Important note:* Remember that `split` matches exactly what you put in, but there might be variable whitespace around the final token. Use `strip()` to ensure that no nationality has leading or training white space.

In [6]:
nationality_df['nationality'] = nationality_df['nationality'].apply(lambda x: x.split(',')[-1].strip())

In [7]:
len(set(nationality_df.nationality))

3197

In [8]:
#hidden tests are within this cell

Looks pretty good!

### Task 2.2.2: Filter dataset to only those nationalities with at least 500 occurrences
When training any classifier, you need enough examples to learn features that reliably predict the labels. For this homework, let's restrict ourselves to working with only nationalities that have at least 500 occurrences. Create a set called `final_nationalities` that contains only those with at least 500 occurrences. Then, from this restricted label set, let's take the subset of `nationality_df` that use these labels and extract a subset called `cleaned_nationality_df` that holds our final dataset that we'll use for train, test, and development.

*Side note:* Often, removing rare labels is another good way of getting rid of noise in our dataset. However, in practice, it's important to check these labels to make sure there are no (or few) systematic errors that would bias your model. Sometimes these biases can have significant real-world impact (e.g., underrepresenting people) and as an ethical data scientist, it's your job to combat the introduction of them.

In [9]:
final_nationalities = set()
for item, count in Counter(nationality_df.nationality).items():
    if count >=500:
        final_nationalities.add(item)
cleaned_nationality_df = nationality_df[nationality_df['nationality'].isin(final_nationalities)]

In [10]:
len(final_nationalities), len(cleaned_nationality_df)

(19, 51931)

In [11]:
#hidden tests are within this cell

### Task 2.2.3: Split dataset into test, train and dev
We have a large enough dataset that we can effectively split it into train, development, and test sets, using the standard ratio of 80%, 10%, 10% for each, respectively. We'll use `split` from `numpy` to split the data into train, dev, and test separately. We'll call these `train_df`, `dev_df`, and `test_df`.  Note that `split` does not shuffle, so we'll use `DataFrame.sample()` and randomly resample our entire dataset to get a random shuffle before the split.

*Important note*: Remember to set  `random_state` in `DataFrame.sample()` to our seed so that you end up with the same (random) ordering.

In [12]:
cleaned_nationality_df = cleaned_nationality_df.sample(frac=1,random_state=RANDOM_SEED)
train_df, dev_df, test_df = np.split(cleaned_nationality_df, [int(0.8 * len(cleaned_nationality_df)), \
                                                              int(0.9 * len(cleaned_nationality_df))])

### Task 2.2.4: print the first instance of your training and your test sets

In [13]:
print(train_df.iloc[0,:]['bio'])
print(test_df.iloc[0,:]['bio'])
#hidden tests are within this cell

Early life and World War II
Lasky was born in The Bronx of New York City and schooled at City College of New York, where he wrote for the student newspaper, ''The Campus.'' He continued his education at University of Michigan and Columbia University. He briefly considered himself a Trotskyist but at 22 moved away from communism entirely because of disgust with Stalin. He began working for the ''New Leader'' in New York and was editor from 1942–1943. Lasky wrote an editorial during this time criticizing the Allies for failing to address The Holocaust directly in their World War II efforts.

He served in World War II as a combat historian for the 7th Army. Lasky remained in Germany after the war, making his home in Berlin, where he worked for American military governor Lucius D. Clay. During this time, Lasky was an outspoken critic of the United States' earlier reluctance to intervene to stop the genocide of European Jews.
Other activities and private life
Lasky's grave in Berlin

Lasky 

# Classifying with dense vectors

In this exercise we'll be working with dense representations of documents instead of the bag-of-words representations we used earlier. To do this, we'll use the _average_ (or *mean*) word vector of a document and classify from those representations.

As a first step, let's tokenize the biographies here using regular expressions like we did in Exercise 2.1. However, since we're going to be computing an average word vector, let's remove stop words. Here, we'll use NLTK's list of English stop words. Since these words shouldn't affect our classification decision, we can remove them to avoid adding any noisy they might cause. Note that all of the stopwords in NLTK's list are lower-cased, but it's possible that some stopwords in your documents are not entirely lower-cased, so they may not match without some further processing.  

Your task is to generate a list that contains the list of all non-stopword tokens in each bio of your training set. Call this `tokenized_train_items`. Use the same regular expression you used for Exercise 2.1 to determine tokens.

In [14]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# tokenize and remove stop words from training data
tokenized_train_items = []
for bio in train_df['bio']:
    # tokenize
    tokens = re.findall('\w+', bio.lower())
    # remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    tokenized_train_items.append(tokens)

[nltk_data] Downloading package stopwords to /opt/conda/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
# len(tokenized_train_items)

In [16]:
len(set(tokenized_train_items[0]))

116

Let's create dense representations from the word2vec models trained on our own wikipedia corpus (`assets/wikipedia.100.word-vecs.kv`). Load it using gensim as `KeyedVectors` and call it `model_wp`.

In [17]:
from gensim.models import KeyedVectors

model_wp = KeyedVectors.load("assets/wikipedia.100.word-vecs.kv")

Complete the function below that takes in a list of lists of tokens (i.e., those tokenized bios you just made) and a set of word vectors to use. Then for each tokenized document in `tokenized_texts`, it computes the mean word vector of all words in the document. Skip those words that don't exist in the vocabulary of your `word_vectors`.These mean word vectors should be returned as a `numpy array` (i.e., a matrix). If a document has no tokens left after filtering (rare, but happens!), use a vector of all zeros that is equal in length with the vector size of your `word_vectors` as the representation of the document. 

For more information on numpy’s functions visit: https://numpy.org/doc/stable/reference/generated/numpy.mean.html


In [18]:
import numpy as np

def generate_dense_features(tokenized_texts, word_vectors):
    """
    Generates dense features from tokenized texts using pre-trained word vectors.
    
    Args:
    - tokenized_texts: A list of lists of tokens.
    - word_vectors: A set of word vectors to use.
    
    Returns:
    - A numpy array of dense features with shape (len(tokenized_texts), word_vectors.vector_size).
    """
    results = []
    for tokens in tokenized_texts:
        valid_tokens = [token for token in tokens if token in word_vectors.vocab]
        if len(valid_tokens) > 0:
            mean_vec = np.mean([word_vectors[token] for token in valid_tokens], axis=0)
            results.append(mean_vec)
        else:
            results.append(np.zeros(word_vectors.vector_size))
    return np.array(results)


Finally, let's create the dense vector representations by calling `generate_dense_features` on the tokenized training data. Let's generate a representation using `model_wp` (i.e., our wikipedia vectors) and call it `X_train_wp`.

In [19]:
# tokenized_train_items[:10]

In [20]:
X_train_wp = generate_dense_features(tokenized_train_items, model_wp)

### Task 2.2.5: Sanity Check: print the shape of X_train_wp
Let's ensure that we featurized everything as expected. You should have 100 word features in your Wikipedia training data.

In [21]:
len(tokenized_train_items)

41544

In [22]:
print(X_train_wp.shape)
#hidden tests are within this cell

(41544, 100)


### Task 2.2.6: Get the list of labels
We need to get the final list of labels in a python `list` for sklearn to use. Create this list from `train_df` and let's call it `y_train`. `y` (lower case!) is normally used to refer to the label of the classifier (or value in  a regressor) in machine learning. We use the lower case here to indicate it's a vector, whereas `X` is upper case because it's a matrix.

In [23]:
y_train = list(train_df.nationality)

### Task 2.2.7: Fit a classifier on a subset of the data

Finally, let's fit a classifier on our dense data. For a start we'll use `LogisticRegression`. Don't forget to set the `random_state` to use our `RANDOM_SEED` so you get deterministic (but random) results. To train your classifier, create a `LogisticRegression` object called `clf_wp` and call `fit` with `X_train_wp` and `y_train`. This classifier will get trained on the dense representations we just made.

For this cell, let's just use the first 10,000 rows of `X_train_wp` and `y_train` to fit the classifier. In general, when you have a large dataset, it's useful to go end-to-end and train one of these half-baked classifiers to verify that your model works as expected. You can even do some analyses if the performance is good enough to get a sense of how things are working. Then you can train on the full data.

*Notes:*
1. You should use the `lbfgs` solver, as this generally Just Works™ and is fast.
2. Since we have more than two nationalities, we'll set `multi_class='auto'` so that the classifier isn't binary.
3. `X_train_wp` is a numpy array, so you'll need to use array indexing operations to get the first 10,000 rows.
4. Since we have many classes, we'll increase the maximum number of iterations to 10,000 to ensure convergence

In [24]:
from sklearn.linear_model import LogisticRegression

clf_wp = LogisticRegression(random_state=RANDOM_SEED, solver='lbfgs', max_iter=10000, multi_class='auto')
clf_wp.fit(X_train_wp[:10000], y_train[:10000])

LogisticRegression(max_iter=10000, random_state=655)

### Task 2.2.8: Generate dev data
Let's tokenize the dev data so we can create a dense representation. The list `tokenized_dev_items` should contain the list of tokens but exclude all tokens in the nltk stopwords list (just like you did for tokenizing the training data).

In [25]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

tokenized_dev_items = []
for bio in dev_df['bio']:
    # tokenize
    tokens = re.findall('\w+', bio.lower())
    # remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    tokenized_dev_items.append(tokens)

[nltk_data] Downloading package stopwords to /opt/conda/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now let's generate the dense vector for the dev data.

In [26]:
X_dev_wp = generate_dense_features(tokenized_dev_items, model_wp)

In [27]:
print(X_dev_wp.shape)
#hidden tests are within this cell

(5193, 100)


### Task 2.2.9: Create Dummy classifiers
It's always important to contextualize your results by comparing it with naive classifiers. If these classifiers do well, then your task is easy! If not, then you can see how much better your system does at first. We'll use two different strategies using the [Dummy Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) class. Create two `DummyClassifier` instances that use the `uniform` (guess randomly) and `most_frequent` strategies and fit these on the training data so we can compare them with our classifier that was trained on 10K instances. In general, you probably always want to at least compare with these two baselines in a classification task.

*NOTE:* Be sure to set the `random_state` of the `DummyClassifier` to be `RANDOM_SEED` so your scores match.

In [28]:
from sklearn.dummy import DummyClassifier
# Create a Dummy Classifier with uniform strategy
clf_dummy_uniform = DummyClassifier(strategy='uniform', random_state=RANDOM_SEED)
clf_dummy_uniform.fit(X_train_wp[:10000], y_train[:10000])

# Create a Dummy Classifier with most_frequent strategy
clf_dummy_most_frequent = DummyClassifier(strategy='most_frequent', random_state=RANDOM_SEED)
clf_dummy_most_frequent.fit(X_train_wp[:10000], y_train[:10000])

DummyClassifier(random_state=655, strategy='most_frequent')

### Task 2.2.10: Create logistic regression classifier as in exercise 1.1 that does not use word embedding for comparison

As a comparison to see how well our dense-vector-based classifier stacks up, let's create a comparison model that uses words as features (not the vectors). Use the same [TfIdfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) setup as in Exercise 1.1 and the same arguments `min_df=500` and `stop_words='english'`. Create the `TfIdfVectorizer` and call `fit_transform` on the training data to create `X_train`. 

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=500, stop_words='english')
X_train = vectorizer.fit_transform(train_df['bio'])

Fit the bag-of-words classifier on the first 10,000 instances in `X_train` and its labels in `y_train`

In [30]:
from sklearn.linear_model import LogisticRegression

# clf = LogisticRegression(solver='lbfgs', multi_class='auto', random_state=RANDOM_SEED)
clf = LogisticRegression(solver='lbfgs', multi_class='auto', random_state=RANDOM_SEED,max_iter=10000)
clf.fit(X_train[:10000], y_train[:10000])

LogisticRegression(max_iter=10000, random_state=655)

Transform the development data in `dev_df` using the vectorizer and call this `X_dev`

In [31]:
X_dev = vectorizer.transform(dev_df.bio)
y_dev = list(dev_df.nationality)

### Task 2.2.11: Generate all the predictions

Generate predictions for the dense-vector classifier, the bag-of-words classifier and the two dummy classifiers, and store your predictions in the following variables:
* `lr_wp_tiny_dev_preds` (dense vector)
* `lr_tiny_dev_preds` (bag of words)
* `rand_dev_preds` (random baseline)
* `mf_dev_preds` (most frequent baseline)

In [32]:
# YOUR CODE HERE
# raise NotImplementedError()
# lr_wp_tiny_dev_preds = clf_wp.predict(X_dev_wp)
# # lr_wp_tiny_dev_preds = clf_wp.predict_proba(X_dev_wp)[:, 1] > 0.5
# lr_tiny_dev_preds = clf.predict(X_dev)
# rand_dev_preds = random_clf.predict(X_dev_wp)
# mf_dev_preds = mf_clf.predict(X_dev_wp)

# Generate predictions for the dense-vector classifier
lr_wp_tiny_dev_preds = clf_wp.predict(X_dev_wp)

# Generate predictions for the bag-of-words classifier
lr_tiny_dev_preds = clf.predict(X_dev)

# Generate predictions for the random baseline
rand_dev_preds = clf_dummy_uniform.predict(X_dev_wp)

# Generate predictions for the most frequent baseline
mf_dev_preds = clf_dummy_most_frequent.predict(X_dev_wp)

### Task 2.2.12: Score our predictions
Now, let's score the models. Here, we'll use F1 to score and use a _macro_ average so that the score reflects the average F1 performance across all classes. You'll want to define the list of gold standards answers from `dev_df` and call this `y_dev`. Call your f1 scores:
* `lr_wp_f1` (dense vector)
* `lr_f1` (bag of words)
* `rand_f1` (random baseline)
* `mf_f1` (most frequent baseline)

In [33]:
lr_wp_tiny_dev_preds

array(['japanese', 'russian', 'british', ..., 'american', 'american',
       'british'], dtype='<U13')

In [34]:
lr_wp_f1 = f1_score(y_dev, lr_wp_tiny_dev_preds, average='macro')
lr_f1 = f1_score(y_dev, lr_tiny_dev_preds, average='macro')
rand_f1 = f1_score(y_dev, rand_dev_preds, average='macro')
mf_f1 = f1_score(y_dev, mf_dev_preds, average='macro')

In [35]:
print(lr_wp_f1)
print(lr_f1)
print(rand_f1)
print(mf_f1)
#hidden tests are within this cell

0.34071854283097364
0.6799816930655451
0.042630466943191815
0.028832567997174142


Wow, looking pretty promising. The dense vectors certainly contain some useful information, but the bag of words representation still seems pretty powerful. How about if we trained on all the data?

### Task 2.2.13: Fit a classifier on the full data
Train the following classifiers on all the data:
* The dense-vector classifier trained on Wikipedia data (assigned to `clf_wp`)
* The bag-of-words classifier (assigned to `clf`)

In [36]:
clf_wp = LogisticRegression(random_state=RANDOM_SEED, solver='lbfgs', max_iter=10000, multi_class='auto')
clf_wp.fit(X_train_wp, y_train)

LogisticRegression(max_iter=10000, random_state=655)

In [37]:
clf = LogisticRegression(solver='lbfgs', multi_class='auto', random_state=RANDOM_SEED,max_iter=10000)
clf.fit(X_train, y_train)

LogisticRegression(max_iter=10000, random_state=655)

### Task 2.2.14: Generate all the predictions for the final model and score them

We'll use the same naming scheme here for reporting F1 scores. 

In [38]:
# Generate predictions for the dense-vector classifier
lr_wp_tiny_dev_preds = clf_wp.predict(X_dev_wp)

# Generate predictions for the bag-of-words classifier
lr_tiny_dev_preds = clf.predict(X_dev)

# Calculate the F1 score for the dense-vector classifier
lr_wp_f1 = f1_score(y_dev, lr_wp_tiny_dev_preds, average='macro')

# Calculate the F1 score for the bag-of-words classifier
lr_f1 = f1_score(y_dev, lr_tiny_dev_preds, average='macro')

In [39]:
print(lr_wp_f1)
print(lr_f1)
#hidden tests are within this cell

0.3864132452517425
0.7442532812711844


Some performance improvement! It looks like even a simple model and simple dense representation is still able to capture a lot of information (especially when compared with the baselines). That said, why might the bag of words model be doing so well? Thing about what kinds of features we see in text. Do we need dense representations for these (which help with generalization)?

# Exploration
So far, we've only used very simple representations and simple classifiers. If you're interested, you can try some of the following:
* Try using a classifier that can look at combinations of features like a Multi-layer Perception or Random Forest. Since the dense vectors have fewer features, these models can be _much_ more efficient to train than compared to using a bag of words classifier.
* Try training your own vectors (or finding other vectors online!) and see if you can get higher performance
* So far, we've represented a biography as just an average word vector. What if we wanted to up-weight certain words? One idea is to use TF-IDF to decide how to combine word vectors, so more important/rare words are more heavily weighted. Try adding in this weighting to see if it improves performance.

If you try any of these, feel free to discuss them on the class's Slack