# NLP Workshop (05.06.2022)

_Hendrik Erz, Institute for Analytical Sociology | <hendrik.erz@liu.se> | Twitter: @sahiralsaid_

Welcome to the practical part of the NLP Workshop! In this notebook, you will try out some of the methods covered in the theoretical section. In particular, the methods covered will be:

* tf-idf scores
* topic modeling
* Word2Vec

Below, you will see several exercises that cover most of the steps from an unprocessed text corpus to a final trained model, and, lastly the analysis step.

You will work on these examples in smaller groups with guidance from me.

## Preliminaries: Loading the Corpus

Your task in this exercise is to write a function that will return parts of the corpus in a way that the models we are using works with. Here, we will be working with a corpus of the **[States of the Union (SOTU)](https://en.wikipedia.org/wiki/State_of_the_Union) of the U.S. presidents**.

***

The first step is always to load the corpus. We will use a **generator** for this, since a generator helps us keep the memory footprint small and therefore to keep the model training times low.

Normally, you would have the corpus downloaded to your computer, but since we're on a Google Colab, we'll have to retrieve it from the web first. Since I provide the corpus, below you can find a ready-made function that will automatically return the corpus in the following format:

```python
corpus = [
    ('This is a speech from a republican', 'R'),
    ('This is a speech from a democrat', 'D'),
    # ...
]
```

As you can see, you will get from this generator function a list of **tuples**. The first element is always a speech, the second element is a letter indicating the president's party. The party codes are as follows:

* R: Republican
* D: Democrat
* W: Whig
* F: Federalist
* DR: Democratic-Republican
* na: No party
* NU: National Union

***

**Whenever you need the speeches, just call `speeches()` in your code**

In [17]:
import os
import urllib.request as request
import shutil

def maybe_download_file():
    """This function downloads the corpus to the VM"""
    outfile = "sotu.csv"
    # The file is about 12MB large and contains 251 speeches.
    file_link = "https://gist.githubusercontent.com/nathanlesage/241cecdbd9a2f97146784abdb063d566/raw/26c17e63889575900cf0140eadcb84056193c78e/sotu.csv"
    if not os.path.exists (outfile):
        with request.urlopen(file_link) as response, open(outfile, 'wb') as fp:
            shutil.copyfileobj(response, fp)

def speeches ():
    """A generator that yields (speech, party) tuples"""
    maybe_download_file()

    with open("sotu.csv", "r") as fp:
        for line in fp:
            speech, party = line.split('\t')
            yield (speech, party)

## Computing tf-idf scores

The most simple way to begin an analysis is by calculating tf-idf scores. Here we will do this "manually" so that you get a sense for what this means. For practical usage, there are some libraries that already do that for you.

Calculating tf-idf scores consists of two steps:

1. Define a function that preprocesses the speeches and returns individual tokens
2. Call that function, count the words and calculate the tf-idf scores.

Remember, tf-idf is defined as:

$$
{\displaystyle \text{tf-idf} (t, d, D) = \mathrm{tf} (t,d) \times \mathrm{idf}}(t, D)
$$

where

$$
{\displaystyle \mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}}
$$

with $t$ = the term in question, $t'$ = all other terms, and $f_t$ = the relative frequency of the term. And:

$$
 \mathrm{idf}(t, D) =  \log \frac{N}{1 + D}
$$

with $N$ = total number of documents in the corpus and $D$ = number of documents that contain term $t$.

### Exercise 1: Preprocess the text for tf-idf

Below, write a function that takes speeches as returned from the function above and returns a list of tokens. You should remove digits, punctuation marks, and other symbols that do not comprise regular, English words.

> TIP: The NLTK package offers a lot of useful functions for working with natural language. It includes functions to remove so-called stopwords and to tokenize a text. Also, the String class of Python provides additional easy functions you can use.

In [None]:
def preprocess_speeches():
    for speech, _ in speeches():
        # TODO: Replace the following line with your preprocessing code. You should
        # return a list of tokens, e.g., ['the', 'cat', 'sat', 'on', 'the', 'mat']
        yield speech

### Exercise 2: Calculate tf-idf scores

Below, write a function that takes the lists of tokens returned by the `preprocess_speeches()` function and returns a dictionary of a tf-idf scores for each word.

> Remember that you will have to make several passes over the words, since you do not just need to calculate the relative frequencies of terms within a single document, but also which other documents contain a term. To index the documents, it suffices to use indices from 0 to the number of documents - 1.

In [3]:
def calculate_tf_idf ():
    tfidf = dict()
    # The dictionary should look like this:
    # {
    #   0: {
    #     'word1': 0.4325,
    #     'word2': 0.9512,
    #     ...
    #   },
    #   1: {
    #     'word2': 0.124,
    #     ...
    #   },
    #  ...
    # }

    # TODO: Add your code here

    return tfidf

### Exercise 3: Analysis of tf-idf scores

Below, write code to print out the highest scoring word for each speech, as well as, afterwards, the lowest-scoring word.

Explain what makes the words important or unimportant, and what this means in the context of the SOTU corpus.

In [2]:
tfidf = calculate_tf_idf()

# TODO: Add your code here

## Running a Topic Model

The next step to see what is inside our corpus is to run a topic model. The most common model is Latent Dirichlet Allocation (LDA). The library `sklearn` already provides such a model. However, again, here we have to preprocess the sentences. However, this time, we have to do it differently.

Running an LDA model requires a so-called Document-Term Matrix (DTM). In it, documents are defined as "one hot"-vectors. The matrix has the shape `(number of documents, number of words)`, and each cell is set to `0` if the document does not contain the word, and `1` if it does.

With the `preprocess_speeches()` from above, we already have a function that spits out our tokens. We now just need to build the DTM based on that. Building a DTM consists normally of these steps:

1. Create a vocabulary that contains every token within the whole corpus
2. Optionally, remove the most often occurring and the least often occurring terms to reduce the amount of words
3. Go over the corpus and set the corresponding cells in the matrix to `1`, if the document contains a word in the vocabulary.

### Exercise 4: Build a Vocabulary

We will need the vocabulary several times, so it makes sense to write a dedicated function for it. The easiest form of a vocabulary is a dictionary that maps words to indices:

```python
vocab = {
    'word': 0,
    'word2': 1,
    # ...
}
```

Since we also need to figure out words by their indices, we should also create a so-called `i2w`-dictionary. The `i2w` performs the reverse lookup and maps indices to words:

```python
iw2 = {
    0: 'word',
    1: 'word2',
    # ...
}
```

Below, write a function that returns both a vocab and an i2w.

In [1]:
def build_vocab ():
    vocab = {}
    i2w = {}

    # TODO: Add your code here

    return vocab, i2w

### Exercise 5: Build a DTM

Below, write a function that creates a DTM. We have already provided a matrix that is set to all zeros and can be fed into the LDA function.

In [None]:
import numpy as np

def build_dtm ():
    # Here we already provide the DTM in a format suitable for the LDA function.
    # Just adapt the shape of the matrix with the correct number of documents
    # and words in the vocabulary below.
    n_documents = 1 # TODO: Adapt
    n_words = 1 # TODO: Adapt
    dtm = np.zeros((n_documents, n_words), dtype=np.longlong)

    # TODO: Add your own code here

    return dtm

### Exercise 6: Run the Topic model

Below, write a function that trains a topic model. I have already added the correct function import for you. One thing that you will need to do, however, is figure out three hyperparameters: K, alpha, and beta.

Since we are dealing with a small corpus, let us just set $K = 1$. However, you still need to figure out a good alpha and a good beta. Beta should normally be larger than alpha, and both should be smaller than 0.5. Feel free to run the model several times while doing exercise 6 to figure out good values.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

def fit_lda_model ():
    # TODO: Adapt the parameters for the LDA model
    model = LatentDirichletAllocation()

    # TODO: Train the LDA model on your DTM.

    return model

### Exercise 7: Analyze the topic model

The last step in this exercise is to analyse the topic model. The most common method is to simply output the most important words (here, let us use 10 words) for each topic and see if you can make out any semantic topics.

> Below, write a function that prints the ten most important words for each topic. TIP: In order to sort the words correctly, you can use the functions `np.argsort` and, afterwards, `np.fliplr`, to reverse the order of the top words. Additionally, the topic-term-matrix is accessible with the property `components_` of the trained model. The shape of this matrix is `(n_topics, n_words)`.

In [3]:
def print_top_words (model):
    # TODO: Add your code here
    pass

### Running Word2Vec

The most advanced NLP method we will cover today is a Word2Vec model. Such a model encodes co-occurrence patterns of words in so-called word embeddings, vectors of numbers with 50, 100, 200, or 300 dimensions.

Here, you will write the least code since we will be using the gensim-library to run Word2Vec. However, due to requirements of the Word2Vec algorithms, we need to write a simple class that the Word2Vec model can use. Since engineering is not part of this workshop, I have provided this class already:

In [2]:
class RestartableGenerator:
    def __init__ (self, func):
        self.func = func

    def __iter__ (self):
        return self.func()

# Create a new instance of this class by calling RestartableGenerator(preprocess_speeches) and pass that to Word2Vec

### Exercise 8: Run Word2Vec

Below, write code that imports gensim's Word2Vec model and run it on our corpus, utilizing the `RestartableGenerator` class so that Word2Vec can work with your preprocessing generator.

Train two models, one with a `window size` of 5, and one with 30.

In [None]:
# TODO: Add your code here

### Exercise 9: Analyze the word embeddings

As a last exercise for today, here we analyze the word embeddings. Word embeddings are very good to find out what words are related to others. On gensim's model, you can check so by utilizing `model.wv.most_similar('word')`.

> Below, print out the most similar words for `america`, `government`, `bank`, and `war` for both models.

In [None]:
# TODO: Add your code here