# NLP Workshop (SOLUTIONS)

_Hendrik Erz, Institute for Analytical Sociology | <hendrik.erz@liu.se> | Twitter: @sahiralsaid_

Welcome to the practical part of the NLP Workshop! In this notebook, you will try out some of the methods covered in the theoretical section. In particular, the methods covered will be:

* tf-idf scores
* topic modeling
* Word2Vec

Below, you will see several exercises that cover most of the steps from an unprocessed text corpus to a final trained model, and, lastly the analysis step.

You will work on these examples in smaller groups with guidance from me.

## Preliminaries: Loading the Corpus

Your task in this exercise is to write a function that will return parts of the corpus in a way that the models we are using works with. Here, we will be working with a corpus of the **[States of the Union (SOTU)](https://en.wikipedia.org/wiki/State_of_the_Union) of the U.S. presidents**.

***

The first step is always to load the corpus. We will use a **generator** for this, since a generator helps us keep the memory footprint small and therefore to keep the model training times low.

Normally, you would have the corpus downloaded to your computer, but since we're on a Google Colab, we'll have to retrieve it from the web first. Since I provide the corpus, below you can find a ready-made function that will automatically return the corpus in the following format:

```python
corpus = [
    ('This is a speech from a republican', 'R'),
    ('This is a speech from a democrat', 'D'),
    # ...
]
```

As you can see, you will get from this generator function a list of **tuples**. The first element is always a speech, the second element is a letter indicating the president's party. The party codes are as follows:

* R: Republican
* D: Democrat
* W: Whig
* F: Federalist
* DR: Democratic-Republican
* na: No party
* NU: National Union

***

**Whenever you need the speeches, just call `speeches()` in your code**

In [1]:
import os
import urllib.request as request
import shutil

def maybe_download_file():
    """This function downloads the corpus to the VM and saves it to sotu.csv"""
    outfile = "sotu.csv"
    # The file is about 12MB large and contains 251 speeches.
    file_link = "https://gist.githubusercontent.com/nathanlesage/241cecdbd9a2f97146784abdb063d566/raw/26c17e63889575900cf0140eadcb84056193c78e/sotu.csv"
    if not os.path.exists (outfile):
        with request.urlopen(file_link) as response, open(outfile, 'wb') as fp:
            shutil.copyfileobj(response, fp)

def speeches ():
    """A generator that yields (speech, party) tuples"""
    maybe_download_file()

    with open("sotu.csv", "r", encoding="utf-8") as fp:
        for line in fp:
            speech, party = line.split('\t')
            yield (speech, party)

In [2]:
# Make sure we have all 251 speeches in our generator
sum([1 for x in speeches()])

251

## Computing tf-idf scores

The most simple way to begin an analysis is by calculating tf-idf scores. Here we will do this "manually" so that you get a sense for what this means. For practical usage, there are some libraries that already do that for you.

Calculating tf-idf scores consists of two steps:

1. Define a function that preprocesses the speeches and returns individual tokens
2. Call that function, count the words and calculate the tf-idf scores.

Remember, tf-idf is defined as:

$$
{\displaystyle \text{tf-idf} (t, d, D) = \mathrm{tf} (t,d) \times \mathrm{idf}}(t, D)
$$

where

$$
{\displaystyle \mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}}
$$

with $t$ = the term in question, $t'$ = all other terms, and $f_t$ = the relative frequency of the term. And:

$$
 \mathrm{idf}(t, D) =  \log \frac{N}{1 + D}
$$

with $N$ = total number of documents in the corpus and $D$ = number of documents that contain term $t$.

### Exercise 1: Preprocess the text for tf-idf

Below, write a function that takes speeches as returned from the function above and returns a list of tokens. You should remove digits, punctuation marks, and other symbols that do not comprise regular, English words.

> TIP: The NLTK package offers a lot of useful functions for working with natural language. It includes functions to remove so-called stopwords and to tokenize a text. Also, the String class of Python provides additional easy functions you can use.

In [3]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Loads a common english stopword list
stops = stopwords.words('english')

def preprocess_speeches():
    for speech, _ in speeches():
        # The following first converts the speech to lowercase and then tokenizes
        # it with NLTK's built-in tokenizer. It then also removes words that are
        # not alphanumeric (i.e. punctuation and numbers) and stopwords.
        yield [t for t in word_tokenize(speech.lower()) if t.isalpha() and t not in stops]

In [4]:
# Run this cell to see how the preprocessor processes the first speech.
next(preprocess_speeches())

['good',
 'afternoon',
 'beginning',
 'new',
 'year',
 'reflect',
 'state',
 'american',
 'union',
 'seek',
 'definition',
 'america',
 'means',
 'carl',
 'sandburg',
 'came',
 'close',
 'capturing',
 'real',
 'meaning',
 'three',
 'simple',
 'words',
 'became',
 'title',
 'one',
 'greatest',
 'poems',
 'people',
 'yes',
 'america',
 'risen',
 'greatness',
 'chips',
 'american',
 'people',
 'said',
 'yes',
 'yes',
 'challenge',
 'freedom',
 'yes',
 'dare',
 'progress',
 'yes',
 'hope',
 'peace',
 'defending',
 'peace',
 'meant',
 'paying',
 'price',
 'war',
 'america',
 'greatness',
 'endure',
 'future',
 'institutions',
 'continually',
 'rededicate',
 'saying',
 'yes',
 'people',
 'yes',
 'human',
 'needs',
 'aspirations',
 'yes',
 'democracy',
 'consent',
 'governed',
 'yes',
 'equal',
 'opportunity',
 'unlimited',
 'horizons',
 'achievement',
 'every',
 'american',
 'spirit',
 'rededication',
 'send',
 'congress',
 'next',
 'days',
 'fourth',
 'section',
 'state',
 'union',
 'report

### Exercise 2: Calculate tf-idf scores

Below, write a function that takes the lists of tokens returned by the `preprocess_speeches()` function and returns a dictionary of a tf-idf scores for each word.

> Remember that you will have to make several passes over the words, since you do not just need to calculate the relative frequencies of terms within a single document, but also which other documents contain a term. To index the documents, it suffices to use indices from 0 to the number of documents - 1.

In [5]:
import numpy as np
from collections import Counter

def calculate_tf_idf ():
    tfidf = dict()
    # The dictionary should look like this:
    # {
    #   0: {
    #     'word1': 0.4325,
    #     'word2': 0.9512,
    #     ...
    #   },
    #   1: {
    #     'word2': 0.124,
    #     ...
    #   },
    #  ...
    # }

    # Get the total number of documents
    N = sum([1 for x in speeches()])

    # Determine the number of documents in which each word occurs
    doc_counter = Counter()
    for tokens in preprocess_speeches():
        doc_counter.update(set(tokens))

    # Calculate the idf-scores for every word
    idf = dict()
    for word, count in doc_counter.items():
        idf[word] = np.log(N / (1 + count))

    # Now we can calculate tf-idf. We `enumerate` the speeches to have an index
    # to refer to the documents, because each word has a different tf-idf score
    # for every document
    for idx, tokens in enumerate(preprocess_speeches()):
        # Create the tf-idf dictionary
        tfidf[idx] = dict()

        # Count all the words in this specific document
        local_frequency = Counter()
        local_frequency.update(tokens)
        local_sum = len(tokens) # Number of all words

        # Calculate tf-idf
        for token in tokens:
            tf = local_frequency[token] / local_sum
            tfidf[idx][token] = tf * idf[token]

    return tfidf

In [6]:
# Call the function and calculate the tf-idf scores
tfidf = calculate_tf_idf()

### Exercise 3: Analysis of tf-idf scores

Below, write code to print out the highest scoring word for each speech, as well as, afterwards, the lowest-scoring word.

Explain what makes the words important or unimportant, and what this means in the context of the SOTU corpus.

In [10]:
# tfidf = calculate_tf_idf()

print("Most important words (according to tf-idf)")
words = []
for d in tfidf.values():
  max_score = max(d.values())
  [words.append(k) for k, tf in d.items() if tf == max_score]
print(", ".join(set(words)))

print("")
print("Least important words (according to tf-idf)")
words = []
for d in tfidf.values():
  min_score = min(d.values())
  [words.append(k) for k, tf in d.items() if tf == min_score]
# Do you see how, by calling set() we see that we only have three least important words?
print(", ".join(set(words)))

Most important words (according to tf-idf)
post, urban, colonies, tonight, nontaxable, california, heroin, submarines, ca, manila, constitution, complement, vietnam, hired, circuit, eighty, challenge, chickamaugas, texas, tell, gentlemen, applause, tile, attitudes, unrest, outline, percents, exhibition, croix, marijuana, seventies, warmest, nevada, chambers, axis, acquisition, missiles, balize, billion, freedmen, currency, cable, anarchist, florida, tons, german, disavowal, bank, hussein, poverty, slaves, objects, clarke, ports, fy, rebekah, contracts, violences, delawares, embargo, defeatism, cory, forest, economic, likewise, canal, alliance, regulatory, pennsylvania, augmentation, purchasing, bosnia, door, program, thank, afghanistan, banks, coinage, depression, hitler, tasks, iraqi, specifications, indemnity, observatory, savages, jobs, going, survey, removals, eighties, indians, percent, nitrates, isil, intimating, exclusions, cent, collective, smyrna, programs, spain, inflation, n

## Running a Topic Model

The next step to see what is inside our corpus is to run a topic model. The most common model is Latent Dirichlet Allocation (LDA). The library `sklearn` already provides such a model. However, again, here we have to preprocess the sentences. However, this time, we have to do it differently.

Running an LDA model requires a so-called Document-Term Matrix (DTM). In it, documents are defined as "one hot"-vectors. The matrix has the shape `(number of documents, number of words)`, and each cell is set to `0` if the document does not contain the word, and `1` if it does.

With the `preprocess_speeches()` from above, we already have a function that spits out our tokens. We now just need to build the DTM based on that. Building a DTM consists normally of these steps:

1. Create a vocabulary that contains every token within the whole corpus
2. Optionally, remove the most often occurring and the least often occurring terms to reduce the amount of words
3. Go over the corpus and set the corresponding cells in the matrix to `1`, if the document contains a word in the vocabulary.

### Exercise 4: Build a Vocabulary

We will need the vocabulary several times, so it makes sense to write a dedicated function for it. The easiest form of a vocabulary is a dictionary that maps words to indices:

```python
vocab = {
    'word': 0,
    'word2': 1,
    # ...
}
```

Since we also need to figure out words by their indices, we should also create a so-called `i2w`-dictionary. The `i2w` performs the reverse lookup and maps indices to words:

```python
iw2 = {
    0: 'word',
    1: 'word2',
    # ...
}
```

Below, write a function that returns both a vocab and an i2w.

In [11]:
def build_vocab ():
    # First generate the word -> index mapping
    vocab = {}
    for tokens in preprocess_speeches():
        for token in tokens:
            if not token in vocab:
                # This basically adds unseen words to the end of the vocabulary
                vocab[token] = len(vocab)

    # Now reverse it (important later), i.e. index -> word mapping.
    # i2w = index2word
    i2w = {}
    for token in vocab:
        i2w[vocab[token]] = token

    return vocab, i2w

### Exercise 5: Build a DTM

Below, write a function that creates a DTM. We have already provided a matrix that is set to all zeros and can be fed into the LDA function.

In [12]:
import numpy as np

def build_dtm ():
    # Retrieve the vocabulary
    vocab, i2w = build_vocab()

    # Instantiate the DTM with all zeros, meaning: At first, no document contains
    # any word.
    n_documents = sum([1 for x in speeches()])
    n_words = len(vocab)
    dtm = np.zeros((n_documents, n_words), dtype=np.longlong)

    # Then, iterate over every document and every word, and set those cells to
    # 1 where a word is contained in a document.
    for idx, tokens in enumerate(preprocess_speeches()):
        for token in tokens:
            if token in vocab:
                dtm[idx][vocab[token]] = 1

    return dtm

### Exercise 6: Run the Topic model

Below, write a function that trains a topic model. I have already added the correct function import for you. One thing that you will need to do, however, is figure out three hyperparameters: K, alpha, and beta.

Since we are dealing with a small corpus, let us just set $K = 1$. However, you still need to figure out a good alpha and a good beta. Beta should normally be larger than alpha, and both should be smaller than 0.5. Feel free to run the model several times while doing exercise 6 to figure out good values.

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

def fit_lda_model ():
    # First, retrieve the DTM and present the hyperparameters for the model.
    dtm = build_dtm()
    K = 10
    alpha = 0.001
    beta = 0.01

    # Then instantiate the model, and fit it to our data
    model = LatentDirichletAllocation(
        n_components=K,
        doc_topic_prior=alpha,
        topic_word_prior=beta
    )
    model.fit(dtm)

    return model

In [14]:
# Train a model
model = fit_lda_model()

### Exercise 7: Analyze the topic model

The last step in this exercise is to analyse the topic model. The most common method is to simply output the most important words (here, let us use 10 words) for each topic and see if you can make out any semantic topics.

> Below, write a function that prints the ten most important words for each topic. TIP: In order to sort the words correctly, you can use the functions `np.argsort` and, afterwards, `np.fliplr`, to reverse the order of the top words. Additionally, the topic-term-matrix is accessible with the property `components_` of the trained model. The shape of this matrix is `(n_topics, n_words)`.

In [18]:
def print_top_words (model):
    # Since we don't want to output numbers, we need an index->word mapping
    _, i2w = build_vocab()

    # How many words do we want to output?
    L = 10

    # This line first sorts every row ascending, i.e. the least important words
    # for each topic are at the beginning, the most important words at the end.
    # We then flip every row around so that the most important words are at the
    # beginning. Lastly, we keep only the first ten words.
    topic_list = np.fliplr(np.argsort(model.components_, axis=1))[:, :L] # Sort each row
    for idx, words in enumerate(topic_list):
        # Here we transform the indices from the topic_list to the actual words
        w = [i2w[wd] for wd in words]
        # Then print it as a comma-separated list
        print(f"Topic {idx + 1}: " + ", ".join(w))

In [19]:
# Call the function
print_top_words(model)

Topic 1: representatives, session, citizens, war, made, general, government, country, united, states
Topic 2: imprudence, unabating, uncommon, fore, reenlisted, guthrie, certifications, barrow, encamping, hering
Topic 3: policies, projects, activities, economic, problem, areas, problems, assistance, needs, needed
Topic 4: consideration, commerce, duty, navy, might, representatives, due, opinion, treaty, within
Topic 5: challenge, belief, achievements, construction, proposal, alone, proposals, half, general, governments
Topic 6: economic, today, program, help, need, come, must, great, people, peace
Topic 7: concentration, giving, putting, readily, individual, meeting, regard, extending, lines, respect
Topic 8: help, sure, allies, ensure, pay, thank, defend, million, said, want
Topic 9: thank, bless, tonight, laughter, ca, funding, reform, troops, medicare, nuclear
Topic 10: absolutely, germany, otherwise, stands, ordered, judgment, seeking, seat, detention, perfectly


### Running Word2Vec

The most advanced NLP method we will cover today is a Word2Vec model. Such a model encodes co-occurrence patterns of words in so-called word embeddings, vectors of numbers with 50, 100, 200, or 300 dimensions.

Here, you will write the least code since we will be using the gensim-library to run Word2Vec. However, due to requirements of the Word2Vec algorithms, we need to write a simple class that the Word2Vec model can use. Since engineering is not part of this workshop, I have provided this class already:

In [20]:
class RestartableGenerator:
    def __init__ (self, func):
        self.func = func

    def __iter__ (self):
        return self.func()

# Create a new instance of this class by calling RestartableGenerator(preprocess_speeches) and pass that to Word2Vec

### Exercise 8: Run Word2Vec

Below, write code that imports gensim's Word2Vec model and run it on our corpus, utilizing the `RestartableGenerator` class so that Word2Vec can work with your preprocessing generator.

Train two models, one with a `window size` of 5, and one with 30.

In [21]:
from gensim.models import Word2Vec

gen = RestartableGenerator(preprocess_speeches)

# The most complex model, but the least required code!
w2v = Word2Vec(gen)

### Exercise 9: Analyze the word embeddings

As a last exercise for today, here we analyze the word embeddings. Word embeddings are very good to find out what words are related to others. On gensim's model, you can check so by utilizing `model.wv.most_similar('word')`.

> Below, print out the most similar words for `america`, `government`, `bank`, and `war` for both models.

In [22]:
print("Most similar words to 'america'")
print(w2v.wv.most_similar('america'))
print("")
print("Most similar words to 'government'")
print(w2v.wv.most_similar('government'))
print("")
print("Most similar words to 'bank'")
print(w2v.wv.most_similar('bank'))
print("")
print("Most similar words to 'war'")
print(w2v.wv.most_similar('war'))

Most similar words to 'america'
[('allies', 0.8926410675048828), ('iraq', 0.8665159940719604), ('friends', 0.8628460168838501), ('democracy', 0.8561817407608032), ('proud', 0.8558210134506226), ('leadership', 0.8535337448120117), ('historic', 0.8531145453453064), ('iran', 0.8491712808609009), ('afghanistan', 0.8468578457832336), ('today', 0.8390277624130249)]

Most similar words to 'government'
[('courts', 0.6342569589614868), ('constitution', 0.600544810295105), ('authorities', 0.5990092754364014), ('form', 0.5988146066665649), ('jurisdiction', 0.5941332578659058), ('functions', 0.5925008058547974), ('judiciary', 0.5816866755485535), ('authority', 0.5756994485855103), ('legislature', 0.5613027811050415), ('executive', 0.557809591293335)]

Most similar words to 'bank'
[('banks', 0.8765361309051514), ('deposit', 0.8706388473510742), ('notes', 0.8547083735466003), ('circulating', 0.8458355069160461), ('deposits', 0.8433732986450195), ('redemption', 0.8371770977973938), ('circulation', 0.