<a href="https://colab.research.google.com/github/probabll/ntmi-tutorials/blob/main/T4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guide

Check the guide carefully before starting.

## ILOs

After completing this lab you should be able to 

* implement the skip-gram model on PyTorch
* implement applications of word embeddings
* recognise biases and stereotypes that skip-gram models carry over from the data used to train them.


## General notes

* In this notebook you are expected to use $\LaTeX$. 
* Use python3.
* Use Torch
* To have GPU support run this notebook on Google Colab (you will find more instructions later).

We will use a set of standard libraries that are often used in machine learning projects. If you are running this notebook on Google Colab, all libraries should be pre-installed. If you are running this notebook locally you will need to install some additional packages, ask your TA for help if you have problems setting up.


If you need a short introduction to PyTorch [check this tutorial](https://github.com/probabll/ntmi-tutorials/blob/main/PyTorch.ipynb).


## Table of contents

* Neural Networks
* From GLMs to NNs
* SkipGram in PyTorch
* Bias in embeddings


## Table of graded exercises

Exercises have equal weights.

* [Applications of word embeddings](#applications)
* [Bias in embeddings](#bias)

## How to use this notebook

Check the entire notebook before you get started, this gives you an idea of what lies ahead.

Note that, as always, the notebook recaps theory, and contains solved quizzes. While you should probably make use of this theory recap, be careful not to spend disproportionately more time on this than you should. 


## Setting up

Here we set up the packages that you will need to install for this tutorial.

In [None]:
!pip install tqdm
!pip install seaborn
!pip install torch
!pip install sklearn
!pip install gensim
!pip install nltk

In [None]:
## Standard libraries
import os
import math
import numpy as np 
import time

## Imports for plotting
import matplotlib.pyplot as plt
%matplotlib inline 
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgba
import seaborn as sns
sns.set()

## Progress bar
from tqdm.auto import tqdm

# Neural Networks


A neural network is a very flexible real-valued function, it maps some input to some output by means of a composition of differentiable parametric transformations. 

Being parametric means that these transformations are specified by a set of real-valued parameters, whose values we can adjust/optimise towards a certain goal (e.g., maximum likelihood given a statistical model and a dataset of observations). Being differentiable means that we can use gradient-based search for parameter estimation.

Remember the GLM for text analysis? Given a document $x \in \mathcal X$ and a feature function $\mathbf h: \mathcal X \to \mathbb R^D$, the GLM uses linear models and nonlinear activations functions to parameterise a conditional distribution over the possible values of a response random variable $Y$ taking on values in $\mathcal Y$. Consider, for example, a GLM for a binary response variable:

\begin{align}
Y|X=x &\sim \mathrm{Bernoulli}(g(x; \theta)) \\
s &= \mathbf w^\top \mathbf h(x) + b \\
g(x; \mathbf w, b) &= \mathrm{sigmoid}(s)\\
\theta &= \{\mathbf w, b\}\\
&\quad \mathbf w \in \mathbb R^D, b \in \mathbb R
\end{align}

The output $s$ of the linear transformation is called a *linear predictor* (it maps the feature vector $\mathbf h(x)$ to the dimensionality of the Bernoulli parameter), the $\mathrm{sigmoid}$ function after that is called an *activation function* (it maps the linear predictor to the correct parameter space for the Bernoulli distribution). 

As it turns out the GLM is a very shallow neural network (NN)! It is made of a composition of two functions (the linear transformation and the activation), which are differentiable with respect to the trainable parameters. In a GLM, the data point, represented by its feature vector $\mathbf h(x)$, and the parameters interact linearly. In a neural network more generally, we would allow that interaction to be non-linear. 

We had mentioned that one of the limitations of GLMs is the need for a pre-specified feature functon. Now, with NNs, we are going to *parameterise* the feature function as well!

Before we can do this, you need [an introduction to pytorch](https://github.com/probabll/ntmi-tutorials/blob/main/PyTorch.ipynb).

<details>
<summary>Why another package when we already know some JAX?</summary>

 JAX is a good didactic tool to give you an understanding of the role of automatic differentiation and to introduce you to gradient-based optimisation, but, in the long run, we need a software package that offers more ready-to-go code, so that you can count on certain important functionalities, pytorch is one of the best options out there, it's highly regarded amonsgt academics and in the industry, it is also the choice in the UvA's MSc AI (in case you decide to join that programme later on).

---

</details>



**Roadmap**

* After taking the introduction to PyTorch you can work through the implemntation of the Skip-Gram model.
* A correct implementation is already provided but we prepared a couple of ungraded exercises for you. Once you are ready to move on to the graded exercises, you can use the solution we provided to make sure that your skip-gram model is implemented correctly.
* In the final part, you will conduct an investigation of skip-gram emebeddings using a pre-trained model.


## From GLMs to NNs

**We will now make a transition from GLMs to NNs, so you can see what really changes.**

Let's take a Bernoulli GLM for a special type of binary classification as a running example. 

**Task.** We have to classify a pair of words $(t, c)$, each from the same vocabulary $\mathcal W$ of $V$ words, into $y=1$ if they typically co-occur in text or $y=0$ if they typically do not co-occur in text. For us, $t$ is the "target word", and $c$ is the candidate context word. The response variable $y$ indicates whether $c$ is indeed a typical word within the context of $t$. The notion of context we will use is a small window around $t$ in sentences from a corpus.

**Model.** We want a conditional model:

\begin{align}
Y|T=t, C=c &\sim \mathrm{Bernoulli}(g(t, c; \theta)) 
\end{align}

*In a GLM*, we might have designed a feature function $\mathbf h: \mathcal W \times \mathcal W \to \mathbb R^D$ and predicted the Bernoulli parameter as follows:
\begin{align}
\\
s &=\mathbf w^\top \mathbf h(t, c) + b\\
g(t, c; \theta) &= \mathrm{sigmoid}(s) \\
\theta &= \{\mathbf w, b\}\\
&\quad \mathbf w \in \mathbb R^D, b \in \mathbb R
\end{align}


The main reasons for having a feature function were two:
1. to map a pair $(t, c)$ to the real coordinate space $\mathbb R^D$, so that we can use a parametric function to predict the parameters of our statistical model;
2. to capture in the coordinates of $\mathbf h(t, c)$ patterns or features (or *predictors*, as they are called in statistics) that are useful for the task.

As it turns out, the main reason for *manually* developing our feature functions is the second one. There are plently of ways to map document to *a real coordinate space*, but these ways are too naive an not task-driven. But now that we have a general mechanism (i.e., neural networks) to let the data and the parameters of the model interact nonlinearly, we can start with some simple strategy for step 1, and learn complex transformations of it to realise step 2 automatically.


*In neural network model*, we replace our hand-crafted feature function by a trainable parametric function. 

For that, let's start with mapping words from the vocabulary $\mathcal W$ to a relatively simple real coordinate space. If our vocabulary has $V = \mathcal |W|$ words in it, a fixed-dimensional representation of any word that treats words as categorical outcomes is the so-called **one-hot encoding**. The $\mathrm{onehot}_V: \mathcal W \to \mathbb R^V$ encoding function maps elements from a finite set of size $V$ to points in $\mathbb R^V$ and it is such that the mapping is one-to-one. For any word $w \in \mathcal W$ the output of the function is a vector with $V$ coordinates, most of which are set to $0$ except a single coordinate that uniquely identifies $w$:
\begin{align}
\mathbf v &= \mathrm{onehot}_V(w) \\
v_i &\in \{0, 1\}\\
\sum_{i=1}^V v_i &=1
\end{align}

For example, if we have a vocabulary $\{\text{cat}, \text{dog}, \text{rabbit}\}$, a possible one-hot encoding function is:

\begin{align}
\mathrm{onehot}_3(w) &=
\begin{cases}
(1, 0, 0)^\top &\text{if }w = \text{cat}\\
(0, 1, 0)^\top &\text{if }w = \text{dog}\\
(0, 0, 1)^\top &\text{if }w = \text{rabbit}\\
\end{cases}
\end{align}

If we apply this to both the target and the context word, we will end up with two $V$-dimensional vectors. 

We can now use parametric functions to map these two $V$-dimensional elements to other spaces (for example, the space of logits, and then the space of probability values).

Suppose we have a matrix $\mathbf E \in \mathbb R^{V\times D}$ of parameters, with one row per word in the vocabulary, each row containing $D$ columns. We then matrix-multiply it with the onehot encoding of the target and context words:
\begin{align}
\mathbf u &= \mathbf E^\top \mathrm{onehot}_V(t) \\
\mathbf v &= \mathbf E^\top \mathrm{onehot}_V(c) 
\end{align}
for each word, this will give us a $D$-dimensional vector that we refer to as **word embedding** (because it "embeds" words in a $D$-dimensional space). Look how this operation is essentially turning words into $D$-dimensional feature vectors, but we do not have to hand-design the feature values, instead those are parameters of the model and they are stored in the matrix $\mathbf E$, which we call the **embedding matrix**. Moreover, we get to choose the dimensionality $D$.

Now we could get a step closer towards predicting a Bernoulli parameter by computing a dot product $\mathbf u^\top \mathbf v \in \mathbb R$, and we can finally get a Bernoulli parameter by constraining this dot product with the sigmoid activation function. 

All in all, the following is a perfectly valid model for our task:
\begin{align}
Y|T=t, C=c &\sim \mathrm{Bernoulli}(g(t, c; \theta)) \\
\mathbf u &= \mathbf E^\top \mathrm{onehot}_V(t) \\
\mathbf v &= \mathbf E^\top \mathrm{onehot}_V(c) \\
s &= \mathbf u^\top \mathbf v \\
g(t, c; \theta) &= \mathrm{sigmoid}(s) \\
\theta &= \{\mathbf E\}\\
&\quad \mathbf E \in \mathbb R^{V\times D}
\end{align}

What we just described above is in fact the skip-gram model of word representation. Next, we will implement it in pytorch.

# SkipGram in PyTorch

Now we will apply our pytorch knowledge to develop a model of word representation, the skip-gram model.

Given two words $(t, c)$ in a vocabulary $\mathcal W$ of $V$ words, the skip-gram model predicts the probability with which the two words co-occur within a window of fixed size, thus specifying a Bernoulli distribution:

\begin{align}
    Y \mid T=t, C=c &\sim \mathrm{Bernoulli}(g(t, c; \theta)) \\    
    \mathbf u &= \mathrm{embed}_D(t; \theta_{\text{emb}}) \\
    \mathbf v &= \mathrm{embed}_D(c; \theta_{\text{emb}}) \\
    s &= \mathbf u^\top \mathbf v \\
    g(t, c; \theta) &= \sigma(s) \\
    \theta &= \theta_{\text{emb}} = \{\mathbf E\} \\
    &\quad\text{ with } \mathbf E \in \mathbb R^{V \times D}
\end{align}

where we use the notation $\mathrm{embed}_D(t; \theta_{\text{emb}})$ as a shortcut for $\mathbf E^\top \mathrm{onehot}_V(t)$. Note that we use the *same embedding layer* twice, once to encode the target word, once to encode the candidate context word. From the description, we can tell it is the same embedding layer that gets used twice because both times the parameters were the same (i.e., $\theta_{\text{emb}}$).

The skip-gram model stores in its embedding matrix $D$ feature values for each known word, but these features are parameters themselves, rather than hand-crafted. This means we will be able to learn them via maximum likelihood estimation. The skip-gram model creates an artificial task (namely, predicting whether words belong to the same context window) in order to learn a representation of words that is useful for detecting distributional similarity. For that reason, the skip-gram model can be seen as a way to operationalise the **distributional hypothesis** from linguistics. If you are not familiar with the distributional hypothesis, check [Chapter 6 of the textbook](https://web.stanford.edu/~jurafsky/slp3/6.pdf).


## Corpus

Let's start by preprocessing a corpus of English text.

In [None]:
import nltk

In [None]:
nltk.download("treebank")
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

In [None]:
from nltk.corpus import treebank

In [None]:
treebank.sents()[0]

In [None]:
import re
from collections import Counter, OrderedDict
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm.auto import tqdm

def preprocess(docs, stopwords=frozenset(stopwords.words('english'))):
    lemmatizer = WordNetLemmatizer()
    new_docs = []
    for doc in tqdm(docs):
        new_doc = []
        for w in doc:
            w = w.lower()
            if w in stopwords:
                continue            
            w = re.sub(r'[^\w\s]', '', w)
            if not w:
                continue
            w = lemmatizer.lemmatize(w)            
            new_doc.append(w)
        new_docs.append(new_doc)
    return new_docs

In [None]:
corpus = preprocess(treebank.sents())

In [None]:
corpus[0][:10]

In [None]:
len(corpus), sum(len(doc) for doc in corpus)

## Vocabulary

Then let's create a data structure to manage our word-to-integer correspondences, our vocabulary.

In [None]:
import numpy as np
from itertools import chain
from collections import Counter, OrderedDict

class Vocab:

    def __init__(self, corpus: list, min_freq=1):        
        """
        corpus: list of documents, each document is a list of tokens, each token is a string
        min_freq: words that occur less than this value are discarded
        """
        # Make the vocabulary of known words

        # Count word occurrences
        counter = Counter(chain(*corpus))
        # sort them by frequency
        sorted_by_freq_tuples = sorted(counter.items(), key=lambda pair: pair[1], reverse=True)
        
        # Special tokens
        self.pad_token = "-PAD-"
        self.unk_token = "-UNK-"
        self.pad_id = 0
        self.unk_id = 1
        self.known_words = [self.pad_token, self.unk_token]
        self.counts = [0, 0]
        
        # Vocabulary
        self.word2id = OrderedDict()                
        self.word2id[self.pad_token] = self.pad_id
        self.word2id[self.unk_token] = self.unk_id
        self.min_freq = min_freq
        for w, n in sorted_by_freq_tuples: 
            if n >= min_freq: # discard infrequent words
                self.word2id[w] = len(self.known_words)
                self.known_words.append(w)
                self.counts.append(n)
        
        # store the counts for later
        self.counts = np.array(self.counts)

    def __len__(self):
        return len(self.known_words)

    def __getitem__(self, word): # return the id (int) of a word (str)
        return self.word2id.get(word, self.unk_id)

    def encode(self, doc: list, pad_left=0, pad_right=0):
        """
        Transform a document into a numpy array of integer token identifiers.
        doc: list of tokens, each token is a string
        pad_left: number of prefix padding tokens 
        pad_right: number of suffix padding tokens
        
        Return: numpy array with shape [pad_left + len(doc) + pad_right]
        """
        return np.array([self.word2id.get(w, self.unk_id) for w in chain([self.pad_token] * pad_left, doc, [self.pad_token] * pad_right)])

    def batch_encode(self, docs: list, pad_left=0, pad_right=0):
        """
        Transform a batch of documents into a numpy array of integer token identifiers.
        This will pad the shorter documents to the length of the longest document.
        docs: a list of documents
        pad_left: number of prefix padding tokens
        pad_right: number of suffix padding tokens

        Return: numpy array with shape [len(docs), longest_doc + pad_left + pad_right]
        """
        max_len = max(len(doc) for doc in docs)
        return np.stack([self.encode(doc, pad_left=pad_left, pad_right=pad_right + max_len-len(doc)) for doc in docs])

    def decode(self, ids, strip_pad=False):
        """
        Transform a np.array document into a list of tokens.
        ids: np.array with shape [num_tokens] 
        strip_pad: whether PAD tokens should be deleted from the output

        Return: list of strings with size [num_tokens - num_padding]
        """
        if strip_pad:
            return [self.known_words[id] for id in ids if id != self.pad_id]
        else:
            return [self.known_words[id] for id in ids]

    def batch_decode(self, docs, strip_pad=False):
        """
        Transform a np.array collection of documents into a collection of lists of tokens.
        ids: np.array with shape [num_docs, max_length] 
        strip_pad: whether PAD tokens should be deleted from the output

        Return: list of documents, each a list of tokens, each token a string
        """
        return [self.decode(doc, strip_pad=strip_pad) for doc in docs]

    def sample(self, size, p=None, alpha=1., rng=None):
        """
        Sample words from the vocabulary. Word w is sampled with probability proportional to power(count(w), alpha).

        size: shape of sample (it can be a number e.g., size=10, or a shape size=(5, 10))
        Return: numpy array of samples.
        """
        if rng is None:
            rng = np.random
        if p is None:
            p = np.power(self.counts, alpha)
            p = p / p.sum()
        return rng.choice(len(self), p=p, size=size)    

In [None]:
vocab = Vocab(corpus, min_freq=1)
len(vocab), vocab.known_words[:10]

In [None]:
vocab.decode(vocab.encode("this is really AWESOME".split(), pad_left=2, pad_right=3), strip_pad=True)

In [None]:
vocab.batch_decode(vocab.batch_encode(["this is a real day".split(), "a nice day".split()], pad_left=2, pad_right=3), strip_pad=True)

In [None]:
vocab.batch_decode(vocab.sample((3, 5)))

## Dataset

Now we can create a pytroch dataset, a collection of data points the model learns from. 

A data point in the skip-gram model is a triple $(t, c, y)$ where $t$ is a target word, $c$ is a context word and $y$ indicates whether $c$ occurred in a window centered around $t$. We will create this data set by mining positive examples of co-occurrence from the corpus and artificially creating negative examples with so-called "negative sampling" (or sampling context words from the vocabulary, rather than from the context window).

In [None]:
def make_triples(corpus, vocab, radius=2, num_neg=2, alpha=0.75, rng=None):
    """
    corpus: a list of documents, each document is a list of tokens, each token is a string
    vocab: a Vocabulary object initialised with the corpus
    radius: how many words are in the left or right context of the target word
    num_neg: number of words to be sampled as negative examples for each word found within the context window
    alpha: closer to 0 will make negative samples be more uniform, closer to 1 will make the frequency reproduce
     the frequency in the corpus, larger than 1 will concentrate even more on the most frequent words
    rng: random number generator (np.random.RandomState)

    Return a generator for triples (t,c,y) where t is a target word, c is either a word within context (then y is 1), or a word sampled
     at random from the vocabulary (then y is 0)
    """
    if rng is None:
        rng = np.random
    N = len(corpus)    
    p = np.power(vocab.counts, alpha)
    p /= p.sum()
    
    for d in tqdm(rng.permutation(np.arange(N))):
        doc = corpus[d]
        positions = rng.permutation(np.arange(len(doc)))
        neg_samples = rng.choice(len(vocab.known_words), p=p, size=len(doc) * 2 * radius * num_neg)
        n = 0
        for i in positions:
            if doc[i] not in vocab.word2id:
                continue
            num_pos = 0
            # positive samples from the left
            for j in range(max(0, i - radius), i):
                yield doc[i], doc[j], 1
                num_pos += 1
            # positive samples from the right
            for j in range(i + 1, min(len(doc), i + 1 + radius)):
                yield doc[i], doc[j], 1
                num_pos += 1
            # negative samples
            for j in range(num_pos * num_neg):
                yield doc[i], vocab.known_words[neg_samples[n + j]], 0
            n += num_pos * num_neg

Let's visualise this

In [None]:
for n, (t, c, y) in zip(range(15), make_triples(corpus, vocab, radius=4, num_neg=1, rng=np.random.RandomState(42))):
    print(f"n={n} t={t} c={c} y={y}")

Now we can create a Dataset:

In [None]:
import torch 


class SkipGramDataset(torch.utils.data.Dataset):
    
    def __init__(self, vocab, triple_generator):
        """
        Inputs:
            size - Number of data points we want to generate
            std - Standard deviation of the noise (see generate_continuous_xor function)
        """
        super().__init__()
        self.vocab = vocab
        self.triples = list(triple_generator)        
        
    def __len__(self):
        # Number of data point we have. Alternatively self.data.shape[0], or self.label.shape[0]
        return len(self.triples)
    
    def __getitem__(self, idx):
        # Return the idx-th data point of the dataset
        t, c, y = self.triples[idx]
        # our dataset class already converts strings to integers using the vocabulary
        return self.vocab[t], self.vocab[c], y

In [None]:
skipgram_data = SkipGramDataset(vocab, make_triples(corpus, vocab, radius=4, num_neg=1, rng=np.random.RandomState(42)))

This is how many triples we mined from the training corpus"

In [None]:
len(skipgram_data)

In [None]:
for n, (t, c, y) in zip(range(15), skipgram_data):
    print(f"n={n} t={t} c={c} y={y}")

## Data Loader

Finally, we use a pytorch data loader to obtain batches from.

In [None]:
from torch.utils.data import DataLoader

skipgram_data_loader = DataLoader(skipgram_data, batch_size=100, shuffle=True)

Let's have a look at 2 batches of 3 data points each, both with and without shuffling the dataset:

In [None]:
for n, batch in zip(range(2), DataLoader(skipgram_data, batch_size=3, shuffle=False)):
    print(f"n={n} batch={batch}")

for n, batch in zip(range(2), DataLoader(skipgram_data, batch_size=3, shuffle=True)):
    print(f"n={n} batch={batch}")    

## Module

We will now develop a pytorch module for the skip-gram model.

To better match the theory, the forward pass of our module will return a batch of Bernoulli distributions, one per target-context pair in the input.

**Ungraded exercise.** Study the SkipGram module and complete its design, it is only missing the mapping from $(t, c)$ to a logit, and then the mapping from the Bernouli pmf and $y$ to a loss. The solution is presented a few cells down below.

In [None]:
import torch
import torch.nn as nn
import torch.distributions as td

class SkipGram(nn.Module):

    def __init__(self, vocab: Vocab, embed_dim: int, pad_id=0):
        """
        vocab: the Vocab object for our corpus and model
        embed_dim: the dimensionality we want for our embedding space
        pad_id: the id of the PAD token (0)
        """
        super().__init__()
        # we use the torch auxiliary module nn.Embedding
        # which is a very efficient implementation of an embedding layer
        self.vocab = vocab
        self.embed = nn.Embedding(
            num_embeddings=len(vocab), 
            embedding_dim=embed_dim, 
            padding_idx=pad_id
        )        # this is the embed_D layer of the theory

    def predict_logit(self, target, context):
        """
        target: a batch of word ids (one per target word in the batch)
        context: a batch of word ids (one per candidate context word in the batch)

        Return a batch of logits (one per target-context pair in the batch)
        """
        # This is a quiz with solution, you will find the solution in the next cell, 
        # but you can try to implement it yourself first
        raise NotImplementedError("Implement me!")

    def forward(self, target, context):
        """
        target: a batch of word ids (one per target word in the batch)
        context: a batch of word ids (one per candidate context word in the batch)

        Return a batch of Bernoulli distributions (one per target-context pair in the batch)
        """
        # [batch_size]
        logits = self.predict_logit(target, context)
        return td.Bernoulli(logits=logits)

    def loss(self, t, c, y):        
        """
        t: a batch of word ids (one per target word in the batch)
        c: a batch of word ids (one per candidate context word in the batch)
        y: a batch of binary labels (one per target-context pair in the batch)

        Return a single scalar value that is the negative of the log-likelihood of the 
        current model given the observations in the batch.
        """
        # one Bernoulli distribution per (t, c) pair in the batch
        py_xc = self(target=t, context=c)

        # Now, knowing that the observed binary outcome is y, you should compute the loss.
        #  This is a quiz with solution, you will find the solution in the next cell, 
        #  but you can try to implement it yourself first
        raise NotImplementedError("Implement me!")

    def np_embedding_matrix(self):
        """
        Converts the embedding matrix to a numpy array
        """
        return self.embed.weight.detach().cpu().numpy()        

This is how you construct the skipgram model:

In [None]:
device = torch.device("cpu")
skipgram = SkipGram(vocab, 32).to(device) # by default it's on CPU, you can change it to cuda:0 if you like
skipgram

And here you should be able to visualise the Bernoulli distributions predicted for a few data points. 

Note that we have not yet trained the model.

In [None]:
for n, (batch_t, batch_c, batch_y) in zip(range(2), DataLoader(skipgram_data, batch_size=3, shuffle=False)):
    pY_given_t_and_c = skipgram(batch_t.to(device), batch_c.to(device))
    bern_params = pY_given_t_and_c.probs
    prob_observations = torch.exp(pY_given_t_and_c.log_prob(batch_y.to(device).float()))
    for i, (t, c, y, bern_param, prob_y) in enumerate(zip(vocab.decode(batch_t), vocab.decode(batch_c), batch_y, bern_params, prob_observations)):
        print(f"batch={n} instance={i} g(t={t}, c={c})={bern_param:.2} hence Bernoulli({y}|{bern_param:.2})={prob_y:.2}")
    print("Loss for this batch:", skipgram.loss(batch_t.to(device), batch_c.to(device), batch_y.to(device)))
    print()

<details>

<summary> <b> Solution: python code for the SkipGram model.</b> </summary>

```python

import torch
import torch.nn as nn
import torch.distributions as td

class SkipGram(nn.Module):

    def __init__(self, vocab, embed_dim, pad_id=0):
        """
        vocab: the Vocab object for our corpus and model
        embed_dim: the dimensionality we want for our embedding space
        pad_id: the id of the PAD token (0)
        """
        super().__init__()
        # we use the torch auxiliary module nn.Embedding
        # which is a very efficient implementation of an embedding layer
        self.vocab = vocab
        self.embed = nn.Embedding(
            num_embeddings=len(vocab), 
            embedding_dim=embed_dim, 
            padding_idx=pad_id
        )        # this is the embed_D layer of the theory

    def predict_logit(self, target, context):
        """
        target: a batch of word ids (one per target word in the batch)
        context: a batch of word ids (one per candidate context word in the batch)

        Return a batch of logits (one per target-context pair in the batch)
        """
        # [batch_size, embed_dim]
        w = self.embed(target)
        # [batch_size, embed_dim]
        c = self.embed(context)        
        # Dot product
        # [batch_size]
        logits = torch.sum(w * c, dim=-1)
        return logits

    def forward(self, target, context):
        """
        target: a batch of word ids (one per target word in the batch)
        context: a batch of word ids (one per candidate context word in the batch)

        Return a batch of Bernoulli distributions (one per target-context pair in the batch)
        """
        # [batch_size]
        logits = self.predict_logit(target, context)
        return td.Bernoulli(logits=logits)

    def loss(self, t, c, y):        
        """
        t: a batch of word ids (one per target word in the batch)
        c: a batch of word ids (one per candidate context word in the batch)
        y: a batch of binary labels (one per target-context pair in the batch)

        Return a single scalar value that is the negative of the log-likelihood of the 
        current model given the observations in the batch.
        """
        # one Bernoulli distribution per (t, c) pair in the batch
        py_xc = self(target=t, context=c)
        return - py_xc.log_prob(y.float()).mean()

    def np_embedding_matrix(self):
        """
        Converts the embedding matrix to a numpy array
        """
        return self.embed.weight.detach().cpu().numpy()        
```        

---

</details>

## Optimisation

For optimisation, we will use something a bit more sophisticated than the standard SGD algorithm, we will use the Adam optimiser. Adam is a version of SGD with improved convergence properties due to a sophisticated treatment to learning rates.

We still need to pick an initial learning rate (`lr`). Whereas in JAX we implemented L2 regularisation by hand, in pytorch the optimiser class can take care of that for us (which is very convenient!), all we need to do is to set some positive weight to the argument `weight_decay` which is the importance of the L2 regulariser (0 means no regularisation).

In [None]:
from torch.optim import Adam

sg_optimizer = Adam(skipgram.parameters(), lr=5e-3, weight_decay=1e-6)

On CPU this will take 5 to 10 minutes:

In [None]:
from collections import defaultdict

# reset your choice of device
device = torch.device("cuda:0")

# reset model parameters
skipgram = SkipGram(vocab, 32).to(device) # you may change it to 'cpu'
# reset optimiser
sg_optimizer = Adam(skipgram.parameters(), lr=5e-3, weight_decay=1e-6)
# reset data loader 
skipgram_data_loader = DataLoader(
    skipgram_data, 
    batch_size=100, # adjust the batch size, on GPU you can have bigger batches (eg 1000)
    shuffle=True
)
# adjust the number of epochs, the more, the better
num_epochs = 5 # with larger batches you will probably need more epochs
total_steps = num_epochs * len(skipgram_data_loader)
step = 0
log = defaultdict(list)

with tqdm(range(total_steps), desc='MLE') as bar:  
    # we pass over the entire data a number of times
    for epoch in range(num_epochs):            
        skipgram.train()
        for t, c, y in skipgram_data_loader:
            t = t.to(device)
            c = c.to(device)
            y = y.to(device)

            sg_optimizer.zero_grad()
                                    
            # Negative log likelihood of model given batch of observations            
            loss = skipgram.loss(t=t, c=c, y=y)
            loss.backward()
            sg_optimizer.step()

            log['loss'].append(loss.item())       
            bar.set_postfix({'epoch': epoch + 1, 'step': f"{step:4d}", 'loss': f"{loss.item():.4f}"}) 
            bar.update()
            step += 1

Load word2vec embeddings, find top-10 professions for woman and man.

In [None]:
_ = plt.plot(np.arange(len(log['loss'])), log['loss'], '.')

**Helper functions** Below we provide you with a few helper functions that you can use to inspect the embedding of a word, to compare two words in terms of the cosine similarity of their embeddings, and to find the top-k elements of a numpy array. Study them and play with them for a bit, next you will work on an exercise for which they will be useful.



First, let's get the embedding matrix from the trained model in numpy format:

In [None]:
E = skipgram.np_embedding_matrix()
E.shape

We can find the embedding for a word by indexing the matrix using the word id in the vocabulary:

In [None]:
E[vocab['person']]

Here is a helper code to make it easier:

In [None]:
def embed(word, vocab, E):
    """
    word: a word (str)
    vocab: the Vocabulary object for our corpus and model
    E: our model's embedding matrix as a numpy array of shape [V, D]

    Return the D-dimensional embedding of the word as np.array object.
    """
    if word not in vocab.word2id:
        raise ValueError(f"{word} is OOV")
    wid = vocab.word2id[word]
    return E[wid]

assert np.alltrue(embed('person', vocab, E) == E[vocab['person']])

Finally, we can compare words using cosine similarity of their embeddings. Here is how we do that:

In [None]:
def cos_similarity(word1, word2, vocab, E):  
    """
    word1: a word (str)
    word2: another word (str)
    vocab:  the Vocabulary object for our corpus and model
    E: our model's embedding matrix as a numpy array of shape [V, D]

    Return the cosine similarity (a real number) of the two words in the embedding space of our model.
    """  
    # [D]    
    u = embed(word1, vocab, E)
    # [D]
    v = embed(word2, vocab, E)
    return np.sum(u * v) / (np.sqrt(np.sum(u * u)) * np.sqrt(np.sum(v * v)))

In [None]:
cos_similarity('car', 'truck', vocab, E), cos_similarity('car', 'automobile', vocab, E)

This is a functio to help you whenever you need to find the top-k values in a numpy array. It will be useful in an exercise later on.

In [None]:
def np_topk(array, k=10):
    """
    array: a list or a numpy np.array
    k: number of top elements to be returned

    Return the top-k elements and their indices in the array.
    """
    array = np.array(array)
    ids = np.argsort(-array)  # argsort finds the lowest values, so we use -array to find the highest values
    # return top-k values, return the indices of the top-k values
    return array[ids][:k], ids[:k]

assert np.alltrue(np_topk([10, 20, 30, 40], 2)[0]  == np.array([40, 30]))
assert np.alltrue(np_topk([10, 20, 30, 40], 2)[1]  == np.array([3, 2]))

<a name="applications"> **Graded Exercise - Applications of word embeddings**

In this exercise you will develop a few nice applications of word embeddings. 

1. `topk_words`: given a *word* you will find the words that are closest to it in embedding space using cosine similarity.

There is a list of words that you should test your function with, you will see it below. For each word in the list, display the 10 words that are nearest in embedding space. Make sure to display the information in a way that's convenient for grading (e.g., using a table from `tabulate` or something similar).

2. `doesnt_match`: given a *list of words* you will find the odd word in the list, this "odd" word is the one that is on average the least cosine-similar to the other words in the list.

Again, there is a list of test cases for you. Make sure to display the information in a convenient format for grading.

3. `word_analogy`: given two lists of words make a representation $\mathbf v$ where the words in the *positive* list contribute positively to $\mathbf v$, the words in the *negative* list contribute negative to $\mathbf v$, and then return the 10 words that are cosine-closest to $\mathbf v$ in embedding space. 

Again, there is a list of test cases for you. Make sure to print the information in a readable way for grading.


**Guidelines** The grade will depend mostly on the correctness of your implementation but also on whether you displayed the requested information in a human-readable way (so the grader does not have to necessarily interact with your code). The grade *will not* depend on whether your model captures meaningful similarities in its embedding space. Unfortunately, the dataset you are using is too small for you to train a good model, and training a very good model would also require training it for much longer.


In [None]:
def topk_words(word, vocab, E, k=10):    
    """
    word: a word (str)
    vocab:  the Vocabulary object for our corpus and model
    E: our model's embedding matrix as a numpy array of shape [V, D]
    k: how many nearest neighbours we want to find

    Return a python list with the k words (each a string) that are nearest to the input word
     in embedding space according to cosine similarity. You can use any of the functions provided earlier, and you
     can also create additional ones if you need them.
    """
    raise NotImplementedError("Implement me!")

In [None]:
for word in ['car', 'person', 'woman', 'man']:
    print("Make sure to test topk_words using:", word)

In [None]:
def doesnt_match(words, vocab, E):
    """
    words: a list of words (each a str)
    vocab:  the Vocabulary object for our corpus and model
    E: our model's embedding matrix as a numpy array of shape [V, D]
    
    Return the word in the list that is least cosine-similar to every other word in the list on average.
    """
    raise NotImplementedError("Implement me!")

In [None]:
for word_list in [['car', 'automobile', 'wall'], ['car', 'bridge', 'wall']]:
    print("Make sure to test doesnt_match using:", word_list)

In [None]:
def word_analogy(positive: list, negative: list, vocab: Vocab, E, k=10):
    """
    positive: a list of words (each a str) that contribute positively to the similarity
    negative: a list of words (each a str) that contribute negatively to the similarity
    vocab:  the Vocabulary object for our corpus and model
    E: our model's embedding matrix as a numpy array of shape [V, D]
    k: number of nearest neighbours
    
    Return the top-k words in terms of cosine similarity with the embedding you obtain by
     summing the embedding of the words in the positive list 
     and subtracting the embedding of the words in the negative list. 
    That is, you will retrieving the neighbours of the vector:
        \sum_{w in positive} emb(w) - \sum_{w in negative} emb(w)

    The return is a list of neighouring words (each a str).
    """
    raise NotImplementedError("Implement me!")

In [None]:
for pos_list, neg_list in [(['woman', 'president'], ['man']), (['street', 'person'], ['bridge'])]:
    print(f"Make sure to make analogies for: postive={pos_list} negative={neg_list}")

# Bias in embeddings 


In this section you will experiment with a strong pretrained embedding model that is very similar to skipgram, it's called GloVe. We are not using skipgram because the available models are much too large for this notebook, but GloVe is a **very strong** competitor. 

Instead of training GloVe, which would be too demanding, we will download a trained one, and interact with it using `gensim`, a very robust python package for word embeddings.  We will experiment with the same applications that you coded above, but this time you will use gensim code, this way if you made mistakes earlier, they won't affect the quality of this experiment.

The goal of this exercise is that you visualise biases that embedding models carry over from their training data. A statistical objective (like MLE) is *all about statistics* and not at all about *core human values*. When we download text from the web, it may contain all sorts of stereotypes that are inadequate in many situations. For example, if we download text with gender bias and train our models, those harmful are statistics present in the data will most likely be also present in our models. There's no statistical incentive in their training objective to get rid of correlations that we think are inadequate or outdated or simply harmful. For now, we will not work on fixing the models, we will just investigate them and see that the problem is real. If you are curious to see ways to address the problem, you can check this [excellent paper](https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf), though note that checking it is optional at this point.

First of all, make sure to install [gensim](https://radimrehurek.com/gensim/) by running the cell below:

In [None]:
!pip install gensim

Next, we use gensim's downloader to obtain a trained model:

In [None]:
import gensim.downloader as api

This shouldn't take too long:

In [None]:
word_vectors = api.load("glove-wiki-gigaword-50")

We can use the model in many ways, we can embed a word:

In [None]:
word_vectors.get_vector('person')

We can retrieve similar words:

In [None]:
word_vectors.similar_by_word('person', 10)

We can find words that do not match

In [None]:
word_vectors.doesnt_match(['car', 'wall', 'automobile'])

We can make word analogies

In [None]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

And more (you can see some examples [here](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format) if you like).

This is a list of occupations that we got from one of the [research papers that initiated this whole investigation](https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf). These occupations are *not* gender-marked words in English, they are gender-neutral. Yet, the statistics of English data used to train GloVe are biased towards making harmful stereotypical associations with words like `woman` and `man`.

In [None]:
occupations = [
    "homemaker",
    "nurse",
    "receptionist",
    "librarian",
    "socialite",
    "hairdresser",
    "nanny",
    "bookkeeper",
    "stylist",
    "housekeeper",
    "maestro",
    "skipper",
    "protege",
    "philosopher",
    "capitain",
    "architect",
    "financier",
    "warrior",
    "broadcaster",
    "magician"
]

<a name="bias"> **Graded Exercise - Bias in embeddings**

Using gensim functionality (code and model):

1. plot the similarity of each occupation word in the list to both `woman` and `man`
2. also, plot the difference in similarity towards `woman` with similarity towards `man` and order the occupation words by this difference. 

Make remarks about what you see in (1) and (2).

3. Use algorithms such as `most_similar` (for word analogies), `doesnt_match` and `similar_by_word` to discover additional harmful associations in embedding space. If you want, you can investigate a different type of bias. Be **very** careful here and **very conscious** as you will likely encounter terrible associations. The goal here is not to ridicule the victims of these associations, the goal here is that you grow worried about careless use of NLP, and that you join responsible researchers in a) making careful use of NLP, and b) developing NLP that pushes back from and overcome sources of harm.  

**Guideline** We will grade parts 1 and 2 in terms of the quality of your plots and the remarks you make. We will not grade part 3 as a function of how many biases you uncovered, nor as a function of whether we agree with them or not. Instead, we will use the information you display and the remarks you make as a way to assess the effort you put into it.
