# Distributional Semantics

Distributional semantics models the "meaning" of words relative to other words that typically share the same context.

**Tips:**

* Read all the code. We don't ask you to write the training loops, evaluation loops, and generation loops, but it is often instructive to see how the models are trained and evaluated.

In [None]:
# start time - notebook execution
import time
start_nb = time.time()

# Set up

In [None]:
!pip install datasets

In [None]:
import gensim.downloader
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from datasets import load_dataset
from torchtext.data import get_tokenizer

# ignore all warnings
import warnings
warnings.filterwarnings('ignore')

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

# Initialize the Autograder

In [None]:
import hw4_tests as ag

# GLOVE

We will first work with a pre-specified set of word embeddings, called [GLOVE](https://nlp.stanford.edu/projects/glove/). We will download it and set up a few basic global variables

In [None]:
GLOVE_MODEL = gensim.downloader.load('glove-wiki-gigaword-100')
GLOVE_VOCAB_SIZE = len(GLOVE_MODEL.key_to_index)
GLOVE_EMBEDDING_SIZE = 100

# Analogies

You must complete the code to compute analogies based on GLOVE embeddings.

An analogy is of the form ``a:b :: c:d``.

For example:

``
america : hamburger :: canada : ?
``

In this case we want to know what the `?` will be.

To compute an analogy, first convert `a`, `b`, and `c` into vectors using GLOVE: ``glove[word]``.
This will give you three vectors $\overrightarrow{a}$, $\overrightarrow{b}$, and $\overrightarrow{c}$. Next compute $\overrightarrow{d}=(\overrightarrow{b}-\overrightarrow{a})+\overrightarrow{c}$.

Unfortunately, $\overrightarrow{d}$ might not correspond to any one word. Instead, find the `k` vectors that are most similar to $\overrightarrow{d}$, and return the words that correspond to those vectors.


In [None]:
# analogy is a:b :: c:d
# america:canada :: hamburger:?
# DO NOT USE most_similar()
def glove_analogy(glove, a, b, c, k):
  d_list = None
  ### BEGIN SOLUTION
  ### END SOLUTION
  return d_list

In [None]:
d = glove_analogy(GLOVE_MODEL, 'driver', 'car', 'pilot', k=10)
print(d)

<!-- **TODO:** grading. we can look to see if specific words are returned within the top k return results. Create a test list and a set of potential answers. If all (or any) are in the returned list then success. Depending on how variable the results can be. -->
Test: Check if the glove_analogy function works properly

In [None]:
# student check - Test A (5 points)
ag.test_glove_analogy(GLOVE_MODEL, glove_analogy_fn=glove_analogy)

# Retrieval

In this part of the assignment, we will use word vectors to perform document retrieval. Given a query term, retrieve the `k` most related documents.

To do this, we will need to embed all the documents in a dataset into a document vector that can be compared to the query term vector.

## Download dataset

The wikitext 2 dataset is a collection of high-quality documents from Wikipedia. We will load them into Panda data frames.

In [None]:
wiki_data_train = load_dataset("wikitext", 'wikitext-2-v1', split="train").shuffle()
wiki_data_test = load_dataset("wikitext", 'wikitext-2-v1', split="test").shuffle()
WIKI_TRAIN = pd.DataFrame(wiki_data_train)
WIKI_TEST = pd.DataFrame(wiki_data_test)
WIKI_ALL = pd.concat([WIKI_TRAIN, WIKI_TEST])

## Tokenizer

This is a default tokenizer that comes with  the `torchtext` package.

In [None]:
TOKENIZER = get_tokenizer("basic_english")

**Optional:** If you wish to change or modify the tokenization of a string, you can add your own code to the following function.

We will use `my_tokenizer` for tokenization tasks from this point forward. It will work even if you do not modify it.

In [None]:
def my_tokenizer(string):
  tokens = TOKENIZER(string)
  ### BEGIN SOLUTION
  ### END SOLUTION
  return tokens

In [None]:
RETRIEVAL_MAX_LENGTH = 200

## Embed Dataset

Complete the code below. The `embed_dataset()` function converts a Panda data frame into a numpy matrix of size `len(dataframe) x embedding_size`.

Your code must iterate through all documents in `dataframe[text]`, tokenize each document, convert each token into a GLOVE vector, and take the average of embeddings in the same document as the embedding representation of the document.

The numpy matrix is set up for you, so you must splice your vectors into the appropriate places in the matrix.

**Hint:** create a numpy array for a document and use multi-dimensional numpy array slicing to insert it into the appropriate position in the matrix.

In [None]:
def embed_dataset(dataframe, glove, tokenizer_fn=my_tokenizer, embed_size=GLOVE_EMBEDDING_SIZE, max_length=RETRIEVAL_MAX_LENGTH):
  embedded_data = np.zeros((len(dataframe), max_length, embed_size))
  ### BEGIN SOLUTION
  ### END SOLUTION
  return embedded_data

<!-- Unit test. Hard code some words in a small custom dataframe and hard-code the glove embeddings, just need to do a simple accuracy check. -->
Test: Check if the `embed_dataset` function works properly

In [None]:
# student check - Test B (10 points)
ag.unit_test_embed_dataset(GLOVE_MODEL, embed_dataset_fn=embed_dataset)

In [None]:
embedded_data = embed_dataset(WIKI_TRAIN, GLOVE_MODEL)
print(embedded_data.shape)

Complete the code below. `retrieve_top_k` takes a word and finds the top `k` documents in `embedded_data`, a matrix of size `num_docs x max_doc_length x embed_size`. Return the *indexes* of the top `k` most similar documents to the input word.

**Hint:** you should not need to write a loop. You should be able to do everything through numpy matrix manipulation.

In [None]:
def retrieve_top_k(word, glove, embedded_data, k=10):
  top_k_docs = []
  ### BEGIN SOLUTION
  ### END SOLUTION
  return top_k_docs

In [None]:
word = 'mars'
# Retrieve indexes of top k most similar documents to the above word
top_k = retrieve_top_k(word, GLOVE_MODEL, embedded_data, k=10)
print("indexes:", top_k)
# Get the dataframe for the top k
WIKI_TRAIN.iloc[top_k]['text']

In [None]:
# student check - Test C (5 points)
ag.unit_test_retrieve_top_k(GLOVE_MODEL, embed_dataset_fn=embed_dataset, retrieve_top_k_fn=retrieve_top_k, k=10)

# Word2Vec

In this section, you will re-implement and train Word2Vec from scratch. There are two versions of Word2Vec. The first uses a continuous bag of words (CBOW) representation and the second uses skip grams.

## Create Vocabulary

The following is a standard class that stores a vocabulary. The vocabulary object can:
* Tell you all the words: `get_words()`
* Tell you how many words there are: `num_words()`
* Map a word to an index: `word2index()`
* Map an index to a word: `index2word()`

Additionally, it has two helper functions used during set up:
* `add_word()` adds a word to the vocabulary.
* `add_sentence()` adds all the previously unknown words in a sentence to the vocabulary (simply splitting the sentence by blank spaces.

In [None]:
# RUN THIS CELL BUT DO NOT EDIT IT
UNK_token = 0   # Unknown '<unk>'
UNK_symbol = '<unk>'

class Vocab:
  def __init__(self, name=''):
    self.name = name
    self._word2index = {UNK_symbol: UNK_token}
    self._word2count = {UNK_symbol: 0}
    self._index2word = {UNK_token: UNK_symbol}
    self._n_words = 1

  def get_words(self):
    return list(self._word2count.keys())

  def num_words(self):
    return self._n_words

  def word2index(self, word):
    if word in self._word2index:
      return self._word2index[word]
    else:
      return self._word2index[UNK_symbol]

  def index2word(self, word):
    return self._index2word[word]

  def word2count(self, word):
    return self._word2count[word]

  def add_sentence(self, sentence):
    for word in sentence.split(' '):
      self.add_word(word)

  def add_word(self, word):
    if word not in self._word2index:
      self._word2index[word] = self._n_words
      self._word2count[word] = 1
      self._index2word[self._n_words] = word
      self._n_words += 1
    else:
      self._word2count[word] += 1

## CBOW

The continuous bag of words model

### Data preparation

In [None]:
# Hyperparameters; feel free to change them
CBOW_EMBED_DIMENSIONS = 100
CBOW_WINDOW = 4
CBOW_MAX_LENGTH = 50
CBOW_BATCH_SIZE = 1024
CBOW_NUM_EPOCHS = 2
CBOW_LEARNING_RATE = 5e-4

Before training the CBOW model, we must prepare the data for training. The CBOW model learns to predict a word based on the words to the left and the words to the right.

This function takes a Pandas data frame and converts it into a regular python array consisting of `(x, y)` pairs where:
* `y` is the index of a word in the corpus.
* `x` is a list of indexes of words to the left of `y` and to the right of `y`.

For example, consider the sentence "The quick brown fox jumped over the lazy dog". For a window of size two, we would create the following data:
1. `x=[the, quick, fox, jumped]`, `y=brown`
2. `x=[quick, brown, jumped, over]`, `y=fox`
3. `x=[brown, fox, over, the]`, `y=jumped`
4. `x=[fox, jumped, the, lazy]`, `y=over`
5. `x=[jumped, over, lazy, dog]`, `y=the`

(Except instead of words, there would be the indices for each word in the vocabulary)

This is done for every document in the data frame.

`prep_cbow_data()` (below) will also simultaneously create the Vocab object.

Thus `prep_cbow_data()` should return two values:
* the `[(x1, y1) ... (xn, yn)]` data
* the Vocab object. The vocab object is initialized for you but not populated.

Complete the `prep_cbow_data()` function. It takes a data frame and a tokenizer (`my_tokenizer()`) a window to either side of each word, and a max document length. The function should return two values as described above.

In [None]:
def prep_cbow_data(data_frame, tokenizer_fn, window=2, max_length=50):
  data_out = []
  vocab = Vocab()
  ### BEGIN SOLUTION
  ### END SOLUTION
  return data_out, vocab

In [None]:
CBOW_DATA, CBOW_VOCAB = prep_cbow_data(WIKI_TRAIN, tokenizer_fn=my_tokenizer, window=CBOW_WINDOW, max_length=CBOW_MAX_LENGTH)
print("len dataframe=", len(WIKI_TRAIN), "len data=", len(CBOW_DATA))

 <!-- Unit test: Do something along the lines of figuring out how many words are in lines with greater than window*2+1 words. What I have below isn't quite matching what my solution above is producing. I'm not sure if my solution above has a bug or if my computation below is incorrect, or if it is just an approximation and we should allow some variance. -->
 Test: checking the size of the dataset and vocabulary

In [None]:
# student check - Test D (10 points)
ag.check_data_size_d(WIKI_TRAIN, CBOW_WINDOW, CBOW_DATA, CBOW_VOCAB, max_length=CBOW_MAX_LENGTH, tokenizer_fn=my_tokenizer)

### Get Batch

Complete the following function. `get_batch()` will return a batch of data of the given size, starting at the given index.

The function should return two values:
1. A batch of `x` components of the data as a tensor of size `window*2 x batch_size`.
2. A batch of `y` components of the data as a tensor array of length `window*2`.

Both tensors should be moved to the GPU, if available, before being returned (Note: Gradescope will not have a GPU available).

**Hint:** You should not need to write a loop. You can achieve what you need using numpy slicing.

In [None]:
def get_batch(data, index, batch_size=10):
  ### BEGIN SOLUTION
  ### END SOLUTION
  return x, y

<!-- Unit test: make up some synthetic data, check if you get the right stuff out for a given idx and batch size. -->
Test: Check if get back works properly

In [None]:
# student check - Test E (10 points)
ag.unit_test_get_batch(CBOW_DATA, CBOW_WINDOW, 10, get_batch)

### The CBOW Model

Complete the CBOW model specification.

The CBOW model should contain:
* An embedding layer `nn.Embedding`
* A linear layer that transforms the embedding to the vocabulary

The forward function will take the `x` component of the data--a list of `window*2` indices and produce a log softmax distribution over the vocabulary.

In [None]:
class CBOW(nn.Module):
  def __init__(self, vocab_size, embed_size):
    super(CBOW, self).__init__()
    ### BEGIN SOLUTION
    ### END SOLUTION

  def forward(self, x):
    probs = None
    ### BEGIN SOLUTION
    ### END SOLUTION
    return probs

Create the model.

In [None]:
import traceback
cbow_model = CBOW(CBOW_VOCAB.num_words(), CBOW_EMBED_DIMENSIONS)
cbow_model.to(DEVICE)
CBOW_CRITERION = nn.NLLLoss()
try:
  CBOW_OPTIMIZER = torch.optim.AdamW(cbow_model.parameters(), lr=CBOW_LEARNING_RATE)
except:
  print(traceback.format_exc())

Test: Check the structure of CBOW model

In [None]:
# student check - Test F (10 points)
ag.test_cbow_structure(cbow_model)

### Train the CBOW Model

Training loop

In [None]:
def train_cbow(model, data, num_epochs, batch_size, criterion, optimizer):
  for epoch in range(num_epochs):
    losses = []
    for i in range(len(data)//batch_size):
      x, y = get_batch(data, i, batch_size)
      y_hat = model(x)
      loss = criterion(y_hat, y)
      optimizer.zero_grad()
      loss.backward()
      losses.append(loss.item())
      optimizer.step()
      if i % 100 == 0:
        print('iter', i, 'loss', np.array(losses).mean())
    print('epoch', epoch, 'loss', np.array(losses).mean())

Train the model.

In [None]:
try:
  train_cbow(cbow_model, CBOW_DATA, num_epochs=CBOW_NUM_EPOCHS, batch_size=CBOW_BATCH_SIZE, criterion=CBOW_CRITERION, optimizer=CBOW_OPTIMIZER)
except:
    print(traceback.format_exc())

Test: Now that we have trained the CBOW model, we will be testing it on the `WIKI_TEST` dataset. Your CBOW model will need to achieve an accuracy of at least 30% to pass the test.

In [None]:
def prep_test_data(data_frame, vocab, tokenizer_fn, window=2, max_length=50):
  data_out = []
  for row in data_frame['text']:
    tokens = tokenizer_fn(row)
    token_ids = [vocab.word2index(w) for w in tokens]
    if len(token_ids) >= (window*2)+1:
      token_ids = token_ids[0:min(len(token_ids), max_length)]
      for i in range(window, len(token_ids)-window):
        x = token_ids[i-window:i] + token_ids[i+1:i+window+1]
        y = token_ids[i]
        data_out.append((x, y))
  return data_out

TEST_DATA = prep_test_data(WIKI_TEST, CBOW_VOCAB, tokenizer_fn=my_tokenizer, window=CBOW_WINDOW, max_length=CBOW_MAX_LENGTH)

In [None]:
# student check - G (20 points)
ag.test_cbow_performance(cbow_model, TEST_DATA, 512, get_batch_fn=get_batch)

## Skip Grams

The Skip Gram model.

In [None]:
# Hyperparameters; feel free to change
SKIP_EMBED_DIMENSIONS = 100
SKIP_WINDOW = 4
SKIP_MAX_LENGTH = 50
SKIP_BATCH_SIZE = 1024
SKIP_NUM_EPOCHS = 2
SKIP_LEARNING_RATE = 5e-4

Before training the Skip Gram model, we must prepare the data for training. The Skip Gram model learns to predict words to the left and right of a given word.

This function takes a Pandas data frame and converts it into a regular python array consisting of `(x, y)` pairs where:
* `x` is the index of a word in the corpus.
* `y` is a list of indexes of words to the left of `x` or to the right of `x`.
(Note the organization of the data is the opposite of the CBOW model)

For example, consider the sentence "The quick brown fox jumped over the lazy dog". For a window of size two, we would create the following data:
1. `x=brown`, `y=[the, quick, fox, jumped]`
2. `x=fox`, `y=[quick, brown, jumped, over]`
3. `x=jumped`, `y=[brown, fox, over, the]`
4. `x=over`, `y=[fox, jumped, the, lazy]`
5. `x=the`, `y=[jumped, over, lazy, dog]`

(Except instead of words, there would be the indices for each word in the vocabular)

This is done for every document in the data frame.

`prep_skip_data()` (below) will also simultaneously create the Vocab object.

Thus `prep_skip_data()` should return two values:
* the `[(x1, y1) ... (xn, yn)]` data, where each `y` is a list of word indices
* the Vocab object. The vocab object is initialized for you but not populated.

In [None]:
def prep_skip_gram_data(data_frame, tokenizer_fn, window=2, max_length=50):
  data_out = []
  vocab = Vocab()
  ### BEGIN SOLUTION
  ### END SOLUTION
  return data_out, vocab

In [None]:
SKIP_DATA, SKIP_VOCAB = prep_skip_gram_data(WIKI_TRAIN, my_tokenizer, window=SKIP_WINDOW, max_length=SKIP_MAX_LENGTH)

In [None]:
try:
  SKIP_DATA[0]
except:
  print(traceback.format_exc())

Unit test: compute the number of data points that should be in SKIP_DATA and check the vocab size

In [None]:
# student check - H (5 points)
ag.check_data_size_h(WIKI_TRAIN, SKIP_WINDOW, SKIP_DATA, SKIP_VOCAB, max_length=SKIP_MAX_LENGTH, tokenizer_fn=my_tokenizer)

### The Skip Gram Model

Complete the Skip Gram model specification.

The Skip Gram model should contain:
* An embedding layer `nn.Embedding`
* A linear layer that transforms the embedding to the vocabulary

The forward function will take the `x` component of the data--a single token index and produces a log softmax distribution over the vocabulary.

In [None]:
class SkipGram(nn.Module):
  def __init__(self, vocab_size, embed_size):
    super(SkipGram, self).__init__()
    ### BEGIN SOLUTION
    ### END SOLUTION

  def forward(self, x):
    probs = None
    ### BEGIN SOLUTION
    ### END SOLUTION
    return probs

Unit test: check the layers and layer ordering

In [None]:
# initialize the model
skip_model = SkipGram(SKIP_VOCAB.num_words(), SKIP_EMBED_DIMENSIONS)

In [None]:
# student check - Test I (5 points)
ag.test_skipgram_structure(skip_model)

### Train the Skip Gram Model

In [None]:
try:
  SKIP_CRITERION = nn.NLLLoss()
  SKIP_OPTIMIZER = torch.optim.AdamW(skip_model.parameters(), lr=SKIP_LEARNING_RATE)
except:
    print(traceback.format_exc())

In [None]:
def train_skipgram(model, data, num_epochs, batch_size, criterion, optimizer):
  for epoch in range(num_epochs):
    losses = []
    for i in range(len(data)//batch_size):
      x, y = get_batch(data, i, batch_size)
      y_hat = model(x)
      loss = None
      # Calculate loss for every word in the context
      for word in y.T:
        if loss is None:
          loss = criterion(y_hat, word)
        else:
          loss += criterion(y_hat, word)
      optimizer.zero_grad()
      loss.backward()
      losses.append(loss.item() / y.shape[1])
      optimizer.step()
      if i % 100 == 0:
        print('iter', i, 'loss', np.array(losses).mean())
    print('epoch', epoch, 'loss', np.array(losses).mean())

In [None]:
try:
  train_skipgram(skip_model, SKIP_DATA, num_epochs=SKIP_NUM_EPOCHS, batch_size=SKIP_BATCH_SIZE, criterion=SKIP_CRITERION, optimizer=SKIP_OPTIMIZER)
except:
    print(traceback.format_exc())

Now that we have trained the Skipgram model, we will be using the `WIKI_TEST` dataset again for evaluation. Your Skipgram model will need to achieve at least 30% accuracy to pass the test.

In [None]:
def prep_skip_gram_test_data(data_frame, vocab, tokenizer_fn, window=2, max_length=50):
  data_out = []
  for row in data_frame['text']:
    tokens = tokenizer_fn(row)
    token_ids = [vocab.word2index(w) for w in tokens]
    if len(token_ids) >= (window*2)+1:
        token_ids = token_ids[0:min(len(token_ids), max_length)]
    for i in range(window, len(token_ids)-window):
      x = token_ids[i]
      y = token_ids[i-window:i]
      y.extend(token_ids[i+1:i+1+window])
      data_out.append((x, y))
  return data_out

TEST_DATA = prep_skip_gram_test_data(WIKI_TEST, SKIP_VOCAB, tokenizer_fn=my_tokenizer, window=SKIP_WINDOW, max_length=SKIP_MAX_LENGTH)

In [None]:
# student check - Test J (20 points)
ag.test_skip_performance(skip_model, TEST_DATA, 512, get_batch_fn=get_batch)

# Grading
Please submit this .ipynb file to Gradescope for grading.

## Final Grade

In [None]:
# student check
ag.final_grade()

# Notebook Runtime

In [None]:
# end time - notebook execution
end_nb = time.time()
# print notebook execution time in minutes
print("Notebook execution time in minutes =", (end_nb - start_nb)/60)
# warn student if notebook execution time is greater than 30 minutes
if (end_nb - start_nb)/60 > 30:
  print("WARNING: Notebook execution time is greater than 30 minutes. Your submission may not complete auto-grading on Gradescope. Please optimize your code to reduce the notebook execution time.")