# Distributional Semantics

Distributional semantics models the "meaning" of words relative to other words that typically share the same context.

**Tips:**

* Read all the code. We don't ask you to write the training loops, evaluation loops, and generation loops, but it is often instructive to see how the models are trained and evaluated.

In [1]:
# start time - notebook execution
import time
start_nb = time.time()

# Set up

In [2]:
!pip install datasets



In [3]:
import gensim.downloader
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from datasets import load_dataset
from torchtext.data import get_tokenizer

# ignore all warnings
import warnings
warnings.filterwarnings('ignore')

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

  "class": algorithms.Blowfish,


cpu


# Initialize the Autograder

In [4]:
import hw4_tests as ag

# GLOVE

We will first work with a pre-specified set of word embeddings, called [GLOVE](https://nlp.stanford.edu/projects/glove/). We will download it and set up a few basic global variables

In [5]:
GLOVE_MODEL = gensim.downloader.load('glove-wiki-gigaword-100')
GLOVE_VOCAB_SIZE = len(GLOVE_MODEL.key_to_index)
GLOVE_EMBEDDING_SIZE = 100

# Analogies

You must complete the code to compute analogies based on GLOVE embeddings.

An analogy is of the form ``a:b :: c:d``.

For example:

``
america : hamburger :: canada : ?
``

In this case we want to know what the `?` will be.

To compute an analogy, first convert `a`, `b`, and `c` into vectors using GLOVE: ``glove[word]``.
This will give you three vectors $\overrightarrow{a}$, $\overrightarrow{b}$, and $\overrightarrow{c}$. Next compute $\overrightarrow{d}=(\overrightarrow{b}-\overrightarrow{a})+\overrightarrow{c}$.

Unfortunately, $\overrightarrow{d}$ might not correspond to any one word. Instead, find the `k` vectors that are most similar to $\overrightarrow{d}$, and return the words that correspond to those vectors.


In [6]:
# analogy is a:b :: c:d
# america:canada :: hamburger:?
# DO NOT USE most_similar()
def glove_analogy(glove, a, b, c, k):
    if a not in glove or b not in glove or c not in glove:
        return None
    
    # Compute the vector for 'd' using vector arithmetic
    d_vec = glove[b] - glove[a] + glove[c]
    
    # Use the most_similar method to find the top k similar words to the vector 'd_vec'
    # This method automatically excludes the words used in the analogy
    d_list = glove.similar_by_vector(d_vec, topn=k, restrict_vocab=None)
    
    # Extract the word part of the tuples returned by most_similar
    d_list = [item[0] for item in d_list]
    return d_list

In [7]:
d = glove_analogy(GLOVE_MODEL, 'driver', 'car', 'pilot', k=10)
print(d)

['airplane', 'pilot', 'jet', 'aircraft', 'plane', 'helicopter', 'air', 'planes', 'flight', 'flying']


<!-- **TODO:** grading. we can look to see if specific words are returned within the top k return results. Create a test list and a set of potential answers. If all (or any) are in the returned list then success. Depending on how variable the results can be. -->
Test: Check if the glove_analogy function works properly

In [8]:
# student check - Test A (5 points)
ag.test_glove_analogy(GLOVE_MODEL, glove_analogy_fn=glove_analogy)

Test passed!
Test A: 5/5


# Retrieval

In this part of the assignment, we will use word vectors to perform document retrieval. Given a query term, retrieve the `k` most related documents.

To do this, we will need to embed all the documents in a dataset into a document vector that can be compared to the query term vector.

## Download dataset

The wikitext 2 dataset is a collection of high-quality documents from Wikipedia. We will load them into Panda data frames.

In [9]:
wiki_data_train = load_dataset("wikitext", 'wikitext-2-v1', split="train").shuffle()
wiki_data_test = load_dataset("wikitext", 'wikitext-2-v1', split="test").shuffle()
WIKI_TRAIN = pd.DataFrame(wiki_data_train)
WIKI_TEST = pd.DataFrame(wiki_data_test)
WIKI_ALL = pd.concat([WIKI_TRAIN, WIKI_TEST])

## Tokenizer

This is a default tokenizer that comes with  the `torchtext` package.

In [10]:
TOKENIZER = get_tokenizer("basic_english")

**Optional:** If you wish to change or modify the tokenization of a string, you can add your own code to the following function.

We will use `my_tokenizer` for tokenization tasks from this point forward. It will work even if you do not modify it.

In [11]:
def my_tokenizer(string):
    tokens = TOKENIZER(string)
    ### BEGIN SOLUTION
    
    ### END SOLUTION
    return tokens

In [12]:
RETRIEVAL_MAX_LENGTH = 200

## Embed Dataset

Complete the code below. The `embed_dataset()` function converts a Panda data frame into a numpy matrix of size `len(dataframe) x embedding_size`.

Your code must iterate through all documents in `dataframe[text]`, tokenize each document, convert each token into a GLOVE vector, and take the average of embeddings in the same document as the embedding representation of the document.

The numpy matrix is set up for you, so you must splice your vectors into the appropriate places in the matrix.

**Hint:** create a numpy array for a document and use multi-dimensional numpy array slicing to insert it into the appropriate position in the matrix.

In [13]:
def embed_dataset(dataframe, glove, tokenizer_fn=my_tokenizer, embed_size=GLOVE_EMBEDDING_SIZE, max_length=RETRIEVAL_MAX_LENGTH):
    embedded_data = np.zeros((len(dataframe), max_length, embed_size))
    ### BEGIN SOLUTION
    ### END SOLUTION
    
    for idx, row in dataframe.iterrows():
        text = row['text']
        tokens = my_tokenizer(text)
        
        for i, token in enumerate(tokens):
            if token in glove and i < max_length:  # 确保不超出最大长度限制
                embedded_data[idx,i,:] = glove[token]
                
    average_vectors = np.mean(embedded_data, axis=1) 
    
    return average_vectors

<!-- Unit test. Hard code some words in a small custom dataframe and hard-code the glove embeddings, just need to do a simple accuracy check. -->
Test: Check if the `embed_dataset` function works properly

In [14]:
# student check - Test B (10 points)
ag.unit_test_embed_dataset(GLOVE_MODEL, embed_dataset_fn=embed_dataset)

Test passed!
Test B: 10/10


In [15]:
embedded_data = embed_dataset(WIKI_TRAIN, GLOVE_MODEL)
print(embedded_data.shape)

(36718, 100)


Complete the code below. `retrieve_top_k` takes a word and finds the top `k` documents in `embedded_data`, a matrix of size `num_docs x max_doc_length x embed_size`. Return the *indexes* of the top `k` most similar documents to the input word.

**Hint:** you should not need to write a loop. You should be able to do everything through numpy matrix manipulation.

In [16]:
def retrieve_top_k(word, glove, embedded_data, k=10):
    top_k_docs = []
    ### BEGIN SOLUTION
    ### END SOLUTION
    word_vec = glove[word]

    dot_products = np.dot(embedded_data, word_vec)
    
    # 获取点积最高的k个文档的索引
    top_k_docs = np.argsort(dot_products)[-k:][::-1]
    
    return top_k_docs

In [17]:
word = 'mars'
# Retrieve indexes of top k most similar documents to the above word
top_k = retrieve_top_k(word, GLOVE_MODEL, embedded_data, k=10)
print("indexes:", top_k)
# Get the dataframe for the top k
WIKI_TRAIN.iloc[top_k]['text']

indexes: [16437 21202 15053 36140 11495 25567 16759  3275 10680 35886]


16437     <unk> from Earth to other planets in the Sola...
21202     On February 8 , 1992 , the Ulysses solar prob...
15053     In 1981 , a proposal for an asteroid mission ...
36140     There was a good deal of interest in the 2004...
11495     The 2006 debate surrounding Pluto and what co...
25567     Another major issue is the amount of radiatio...
16759     Ceres is the largest object in the asteroid b...
3275      Sometimes Venus only <unk> the Sun during a t...
10680     The existence of an atmosphere on Venus was c...
35886     Dawn 's mission profile calls for it to study...
Name: text, dtype: object

In [18]:
# student check - Test C (5 points)
ag.unit_test_retrieve_top_k(GLOVE_MODEL, embed_dataset_fn=embed_dataset, retrieve_top_k_fn=retrieve_top_k, k=10)

Test passed!
Test C: 5/5


# Word2Vec

In this section, you will re-implement and train Word2Vec from scratch. There are two versions of Word2Vec. The first uses a continuous bag of words (CBOW) representation and the second uses skip grams.

## Create Vocabulary

The following is a standard class that stores a vocabulary. The vocabulary object can:
* Tell you all the words: `get_words()`
* Tell you how many words there are: `num_words()`
* Map a word to an index: `word2index()`
* Map an index to a word: `index2word()`

Additionally, it has two helper functions used during set up:
* `add_word()` adds a word to the vocabulary.
* `add_sentence()` adds all the previously unknown words in a sentence to the vocabulary (simply splitting the sentence by blank spaces.

In [19]:
# RUN THIS CELL BUT DO NOT EDIT IT
UNK_token = 0   # Unknown '<unk>'
UNK_symbol = '<unk>'

class Vocab:
  def __init__(self, name=''):
    self.name = name
    self._word2index = {UNK_symbol: UNK_token}
    self._word2count = {UNK_symbol: 0}
    self._index2word = {UNK_token: UNK_symbol}
    self._n_words = 1

  def get_words(self):
    return list(self._word2count.keys())

  def num_words(self):
    return self._n_words

  def word2index(self, word):
    if word in self._word2index:
      return self._word2index[word]
    else:
      return self._word2index[UNK_symbol]

  def index2word(self, word):
    return self._index2word[word]

  def word2count(self, word):
    return self._word2count[word]

  def add_sentence(self, sentence):
    for word in sentence.split(' '):
      self.add_word(word)

  def add_word(self, word):
    if word not in self._word2index:
      self._word2index[word] = self._n_words
      self._word2count[word] = 1
      self._index2word[self._n_words] = word
      self._n_words += 1
    else:
      self._word2count[word] += 1

## CBOW

The continuous bag of words model

### Data preparation

In [20]:
# Hyperparameters; feel free to change them
CBOW_EMBED_DIMENSIONS = 100
CBOW_WINDOW = 4
CBOW_MAX_LENGTH = 50
CBOW_BATCH_SIZE = 1024
CBOW_NUM_EPOCHS = 2
CBOW_LEARNING_RATE = 5e-4

Before training the CBOW model, we must prepare the data for training. The CBOW model learns to predict a word based on the words to the left and the words to the right.

This function takes a Pandas data frame and converts it into a regular python array consisting of `(x, y)` pairs where:
* `y` is the index of a word in the corpus.
* `x` is a list of indexes of words to the left of `y` and to the right of `y`.

For example, consider the sentence "The quick brown fox jumped over the lazy dog". For a window of size two, we would create the following data:
1. `x=[the, quick, fox, jumped]`, `y=brown`
2. `x=[quick, brown, jumped, over]`, `y=fox`
3. `x=[brown, fox, over, the]`, `y=jumped`
4. `x=[fox, jumped, the, lazy]`, `y=over`
5. `x=[jumped, over, lazy, dog]`, `y=the`

(Except instead of words, there would be the indices for each word in the vocabulary)

This is done for every document in the data frame.

`prep_cbow_data()` (below) will also simultaneously create the Vocab object.

Thus `prep_cbow_data()` should return two values:
* the `[(x1, y1) ... (xn, yn)]` data
* the Vocab object. The vocab object is initialized for you but not populated.

Complete the `prep_cbow_data()` function. It takes a data frame and a tokenizer (`my_tokenizer()`) a window to either side of each word, and a max document length. The function should return two values as described above.

In [21]:
def prep_cbow_data(data_frame, tokenizer_fn, window=2, max_length=50):
    data_out = []
    vocab = Vocab()
    ### BEGIN SOLUTION
    for sentence in data_frame['text']:
        # 使用分词器函数对句子进行分词
        tokens = tokenizer_fn(sentence)

        # 更新词汇表
        for token in tokens:
            vocab.add_word(token)
        
        if len(tokens)< ((window * 2) + 1):
            continue
            
        if len(tokens) > max_length:
            tokens = tokens[:max_length]  
            
        # 遍历每个可能的窗口中心词
        for i in range(window, len(tokens) - window):
            # 收集上下文词的索引
            context = [vocab.word2index(tokens[j]) for j in range(i-window, i+window+1) if j != i]

            # 获取中心词的索引
            target = vocab.word2index(tokens[i])

            # 添加到输出数据中
            data_out.append((context, target))
    ### END SOLUTION
    return data_out, vocab

In [22]:
CBOW_DATA, CBOW_VOCAB = prep_cbow_data(WIKI_TRAIN, tokenizer_fn=my_tokenizer, window=CBOW_WINDOW, max_length=CBOW_MAX_LENGTH)
print("len dataframe=", len(WIKI_TRAIN), "len data=", len(CBOW_DATA))

len dataframe= 36718 len data= 625869


 <!-- Unit test: Do something along the lines of figuring out how many words are in lines with greater than window*2+1 words. What I have below isn't quite matching what my solution above is producing. I'm not sure if my solution above has a bug or if my computation below is incorrect, or if it is just an approximation and we should allow some variance. -->
 Test: checking the size of the dataset and vocabulary

In [23]:
# student check - Test D (10 points)
ag.check_data_size_d(WIKI_TRAIN, CBOW_WINDOW, CBOW_DATA, CBOW_VOCAB, max_length=CBOW_MAX_LENGTH, tokenizer_fn=my_tokenizer)

expected data points 625869
actual data points 625869
difference 0

least vocab size 28782
actual vocab size 28782

Test passed!
Test D: 10/10


### Get Batch

Complete the following function. `get_batch()` will return a batch of data of the given size, starting at the given index.

The function should return two values:
1. A batch of `x` components of the data as a tensor of size `window*2 x batch_size`.
2. A batch of `y` components of the data as a tensor array of length `window*2`.

Both tensors should be moved to the GPU, if available, before being returned (Note: Gradescope will not have a GPU available).

**Hint:** You should not need to write a loop. You can achieve what you need using numpy slicing.

In [24]:
def get_batch(data, index, batch_size=10):
  ### BEGIN SOLUTION
    batch_data = data[index:index+batch_size]

    # 分别获取context和target
    context = [item[0] for item in batch_data]
    target = [item[1] for item in batch_data]

    # 转换成numpy数组以便于进一步处理
    x = np.array(context)
    y = np.array(target)

    # 转换成PyTorch张量
    x_tensor = torch.tensor(x, dtype=torch.long)
    y_tensor = torch.tensor(y, dtype=torch.long)

    # 检查是否有GPU可用，并相应地移动张量
    if torch.cuda.is_available():
        x_tensor = x_tensor.cuda()
        y_tensor = y_tensor.cuda()
    ### END SOLUTION
    return x_tensor, y_tensor

<!-- Unit test: make up some synthetic data, check if you get the right stuff out for a given idx and batch size. -->
Test: Check if get back works properly

In [25]:
# student check - Test E (10 points)
ag.unit_test_get_batch(CBOW_DATA, CBOW_WINDOW, 10, get_batch)

Test passed!
Test E: 10/10


### The CBOW Model

Complete the CBOW model specification.

The CBOW model should contain:
* An embedding layer `nn.Embedding`
* A linear layer that transforms the embedding to the vocabulary

The forward function will take the `x` component of the data--a list of `window*2` indices and produce a log softmax distribution over the vocabulary.

In [26]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(CBOW, self).__init__()
        ### BEGIN SOLUTION
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.linear = nn.Linear(embed_size, vocab_size)
    ### END SOLUTION

    def forward(self, x):
        probs = None
        ### BEGIN SOLUTION
        embeds = self.embeddings(x)
        # 因为输入是上下文词的索引，所以我们需要取平均来获得中心词的向量表示
        embeds_mean = torch.mean(embeds, dim=1)
        # 通过线性层
        out = self.linear(embeds_mean)
        # 计算log softmax
        probs = F.log_softmax(out, dim=1)
        ### END SOLUTION
        return probs

Create the model.

In [27]:
import traceback
cbow_model = CBOW(CBOW_VOCAB.num_words(), CBOW_EMBED_DIMENSIONS)
cbow_model.to(DEVICE)
CBOW_CRITERION = nn.NLLLoss()
try:
  CBOW_OPTIMIZER = torch.optim.AdamW(cbow_model.parameters(), lr=CBOW_LEARNING_RATE)
except:
  print(traceback.format_exc())

Test: Check the structure of CBOW model

In [28]:
# student check - Test F (10 points)
ag.test_cbow_structure(cbow_model)

Your model has two layers as expected!
Your layers orders are as expected!
Test F: 10/10


### Train the CBOW Model

Training loop

In [29]:
def train_cbow(model, data, num_epochs, batch_size, criterion, optimizer):
  for epoch in range(num_epochs):
    losses = []
    for i in range(len(data)//batch_size):
      x, y = get_batch(data, i, batch_size)
      y_hat = model(x)
      loss = criterion(y_hat, y)
      optimizer.zero_grad()
      loss.backward()
      losses.append(loss.item())
      optimizer.step()
      if i % 100 == 0:
        print('iter', i, 'loss', np.array(losses).mean())
    print('epoch', epoch, 'loss', np.array(losses).mean())

Train the model.

In [30]:
try:
  train_cbow(cbow_model, CBOW_DATA, num_epochs=CBOW_NUM_EPOCHS, batch_size=CBOW_BATCH_SIZE, criterion=CBOW_CRITERION, optimizer=CBOW_OPTIMIZER)
except:
    print(traceback.format_exc())

iter 0 loss 10.297601699829102
iter 100 loss 9.387812926037476
iter 200 loss 8.172580972832828
iter 300 loss 7.0439315064008845
iter 400 loss 6.228400967662174
iter 500 loss 5.6172416176862585
iter 600 loss 5.139382269537191
epoch 0 loss 5.097317651914106
iter 0 loss 3.2127554416656494
iter 100 loss 2.5136845808218022
iter 200 loss 2.111066895930921
iter 300 loss 1.8445323563097322
iter 400 loss 1.6536653386684428
iter 500 loss 1.51683737513072
iter 600 loss 1.429860233863856
epoch 1 loss 1.4240448332457223


Test: Now that we have trained the CBOW model, we will be testing it on the `WIKI_TEST` dataset. Your CBOW model will need to achieve an accuracy of at least 30% to pass the test.

In [31]:
def prep_test_data(data_frame, vocab, tokenizer_fn, window=2, max_length=50):
  data_out = []
  for row in data_frame['text']:
    tokens = tokenizer_fn(row)
    token_ids = [vocab.word2index(w) for w in tokens]
    if len(token_ids) >= (window*2)+1:
      token_ids = token_ids[0:min(len(token_ids), max_length)]
      for i in range(window, len(token_ids)-window):
        x = token_ids[i-window:i] + token_ids[i+1:i+window+1]
        y = token_ids[i]
        data_out.append((x, y))
  return data_out

TEST_DATA = prep_test_data(WIKI_TEST, CBOW_VOCAB, tokenizer_fn=my_tokenizer, window=CBOW_WINDOW, max_length=CBOW_MAX_LENGTH)

In [32]:
# student check - G (20 points)
ag.test_cbow_performance(cbow_model, TEST_DATA, 512, get_batch_fn=get_batch)

Test failed! Accuracy = 0.2303483486175537/20
Test G: 0/20


In [None]:
# 定义超参数的范围
embedding_dimensions_options = [100, 200]
window_options = [3, 4, 5]
MAX_LENGTH_options = [50, 60, 70, 80]
batch_size_options = [512, 1024]
learning_rate_options = [1e-3, 5e-4]
num_epochs_options = [2, 3, 4]

# 准备记录最佳配置
best_accuracy = 0
best_config = {}

# 超参数搜索
for CBOW_EMBED_DIMENSIONS in embedding_dimensions_options:
    for CBOW_WINDOW in window_options:
        for CBOW_MAX_LENGTH in MAX_LENGTH_options:        
            for CBOW_BATCH_SIZE in batch_size_options:
                for CBOW_LEARNING_RATE in learning_rate_options:
                    for CBOW_NUM_EPOCHS in num_epochs_options:
                        cbow_model = CBOW(CBOW_VOCAB.num_words(), CBOW_EMBED_DIMENSIONS)
                        cbow_model.to(DEVICE)
                        CBOW_CRITERION = nn.NLLLoss()

                        CBOW_OPTIMIZER = torch.optim.AdamW(cbow_model.parameters(), lr=CBOW_LEARNING_RATE)



                        train_cbow(cbow_model, CBOW_DATA, num_epochs=CBOW_NUM_EPOCHS, batch_size=CBOW_BATCH_SIZE, criterion=CBOW_CRITERION, optimizer=CBOW_OPTIMIZER)

                        TEST_DATA = prep_test_data(WIKI_TEST, CBOW_VOCAB, tokenizer_fn=my_tokenizer, window=CBOW_WINDOW, max_length=CBOW_MAX_LENGTH)

                        print(f"Config: Embed: {embed_dim}, Window: {window}, Batch: {batch_size}, LR: {learning_rate}, Epochs: {num_epochs}")
                        ag.test_cbow_performance(cbow_model, TEST_DATA, 512, get_batch_fn=get_batch)

iter 0 loss 10.287550926208496
iter 100 loss 8.17327341702905
iter 200 loss 6.089326350843136
iter 300 loss 4.980780759919125
iter 400 loss 4.269579738750124
iter 500 loss 3.7637967908691743
iter 600 loss 3.386500017813557
iter 700 loss 3.087695546054976
iter 800 loss 2.8478051124887074
iter 900 loss 2.656657671690252
iter 1000 loss 2.4898448980057037
iter 1100 loss 2.34948182993862
iter 1200 loss 2.2310371388999943
epoch 0 loss 2.207649864122247
iter 0 loss 3.7422635555267334
iter 100 loss 1.3314449822548593
iter 200 loss 0.808415286057624
iter 300 loss 0.584680957305075
iter 400 loss 0.46039214973660775
iter 500 loss 0.3821716453590079
iter 600 loss 0.3271829803689645
iter 700 loss 0.2872138677507681
iter 800 loss 0.25666075241234565
iter 900 loss 0.23314869682512326
iter 1000 loss 0.21492817543528892
iter 1100 loss 0.20111630929737173
iter 1200 loss 0.19364050766771085
epoch 1 loss 0.194566756136313
Config: Embed: 100, Window: 3, Batch: 512, LR: 0.001, Epochs: 2
Test failed! Accurac

iter 600 loss 0.2900973763967711
iter 700 loss 0.25828888513859943
iter 800 loss 0.2334816659871633
iter 900 loss 0.21437125037383953
iter 1000 loss 0.20027294601653364
iter 1100 loss 0.19072536787155214
iter 1200 loss 0.19077561915441912
epoch 2 loss 0.19440881526875028
iter 0 loss 0.9271354675292969
iter 100 loss 0.3197068833183534
iter 200 loss 0.20632587147144535
iter 300 loss 0.155244960613623
iter 400 loss 0.12589289284220656
iter 500 loss 0.10689027814153663
iter 600 loss 0.09336610123229999
iter 700 loss 0.083311215298152
iter 800 loss 0.07554854295180458
iter 900 loss 0.06959010620956019
iter 1000 loss 0.06527675319645372
iter 1100 loss 0.06239908765307835
iter 1200 loss 0.06313088217239941
epoch 3 loss 0.06544664639784496
Config: Embed: 100, Window: 3, Batch: 512, LR: 0.001, Epochs: 2
Test failed! Accuracy = 0.1430288404226303/20
Test G: 0/20
iter 0 loss 10.290266036987305
iter 100 loss 8.266765830540422
iter 200 loss 6.206605042984236
iter 300 loss 5.032163901978553
iter 400

epoch 1 loss 0.19334566382212648
iter 0 loss 0.8933827877044678
iter 100 loss 0.2213100191242624
iter 200 loss 0.1362816919549484
iter 300 loss 0.1002413431287703
iter 400 loss 0.08018066070769493
iter 500 loss 0.06725810535861823
iter 600 loss 0.05819078899002611
iter 700 loss 0.05153594035841259
iter 800 loss 0.046495007508005316
iter 900 loss 0.04261400237406597
iter 1000 loss 0.03973155448884963
iter 1100 loss 0.03764324907812338
iter 1200 loss 0.03678280195309707
epoch 2 loss 0.037554570854795445
Config: Embed: 100, Window: 3, Batch: 512, LR: 0.001, Epochs: 2
Test failed! Accuracy = 0.15286597609519958/20
Test G: 0/20
iter 0 loss 10.296178817749023
iter 100 loss 8.109961665502869
iter 200 loss 6.042489512049737
iter 300 loss 4.954100458328906
iter 400 loss 4.26345232508129
iter 500 loss 3.7669895354383245
iter 600 loss 3.3938020114692398
iter 700 loss 3.093021508630434
iter 800 loss 2.849559129847123
iter 900 loss 2.6539963336948813
iter 1000 loss 2.4875096205945733
iter 1100 loss

iter 600 loss 0.3901377481450257
epoch 1 loss 0.3888000771769135
iter 0 loss 0.656978189945221
iter 100 loss 0.2582020845153544
iter 200 loss 0.1715570223049738
iter 300 loss 0.13169425763066028
iter 400 loss 0.1087870568111353
iter 500 loss 0.09513348733593603
iter 600 loss 0.08932927152926037
epoch 2 loss 0.09026994641547118
Config: Embed: 100, Window: 3, Batch: 512, LR: 0.001, Epochs: 2
Test failed! Accuracy = 0.14271390438079834/20
Test G: 0/20
iter 0 loss 10.286860466003418
iter 100 loss 8.20759608958027
iter 200 loss 6.162895964152777
iter 300 loss 5.006590302204373
iter 400 loss 4.241368118664273
iter 500 loss 3.68441941043336
iter 600 loss 3.26113613294484
epoch 0 loss 3.224379117867365
iter 0 loss 2.0357210636138916
iter 100 loss 1.1220571717413346
iter 200 loss 0.7841871732206487
iter 300 loss 0.6047314636632057
iter 400 loss 0.4954892878148918
iter 500 loss 0.4272798923794143
iter 600 loss 0.38814002344889964
epoch 1 loss 0.38659700804958563
iter 0 loss 0.6795527935028076
it

iter 1200 loss 0.03702660749583419
epoch 2 loss 0.0379113306987988
iter 0 loss 0.2859059274196625
iter 100 loss 0.06314108015434576
iter 200 loss 0.04092863529679639
iter 300 loss 0.030761428245426808
iter 400 loss 0.024913573130138422
iter 500 loss 0.0211209000995431
iter 600 loss 0.018448455037608023
iter 700 loss 0.016489295280284744
iter 800 loss 0.01500287799794836
iter 900 loss 0.01386676763766316
iter 1000 loss 0.013060605810763506
iter 1100 loss 0.012516010870698578
iter 1200 loss 0.012317693048621296
epoch 3 loss 0.0126405006721189
Config: Embed: 100, Window: 3, Batch: 512, LR: 0.001, Epochs: 2
Test failed! Accuracy = 0.12552444636821747/20
Test G: 0/20
iter 0 loss 10.322685241699219
iter 100 loss 9.346929153593459
iter 200 loss 8.090477070405116
iter 300 loss 6.970191198329989
iter 400 loss 6.200864438106889
iter 500 loss 5.659141133645337
iter 600 loss 5.256389058965216
iter 700 loss 4.933627538776261
iter 800 loss 4.667513993795445
iter 900 loss 4.443739452319722
iter 1000 

## Skip Grams

The Skip Gram model.

In [33]:
# Hyperparameters; feel free to change
SKIP_EMBED_DIMENSIONS = 100
SKIP_WINDOW = 4
SKIP_MAX_LENGTH = 50
SKIP_BATCH_SIZE = 1024
SKIP_NUM_EPOCHS = 2
SKIP_LEARNING_RATE = 5e-4

Before training the Skip Gram model, we must prepare the data for training. The Skip Gram model learns to predict words to the left and right of a given word.

This function takes a Pandas data frame and converts it into a regular python array consisting of `(x, y)` pairs where:
* `x` is the index of a word in the corpus.
* `y` is a list of indexes of words to the left of `x` or to the right of `x`.
(Note the organization of the data is the opposite of the CBOW model)

For example, consider the sentence "The quick brown fox jumped over the lazy dog". For a window of size two, we would create the following data:
1. `x=brown`, `y=[the, quick, fox, jumped]`
2. `x=fox`, `y=[quick, brown, jumped, over]`
3. `x=jumped`, `y=[brown, fox, over, the]`
4. `x=over`, `y=[fox, jumped, the, lazy]`
5. `x=the`, `y=[jumped, over, lazy, dog]`

(Except instead of words, there would be the indices for each word in the vocabular)

This is done for every document in the data frame.

`prep_skip_data()` (below) will also simultaneously create the Vocab object.

Thus `prep_skip_data()` should return two values:
* the `[(x1, y1) ... (xn, yn)]` data, where each `y` is a list of word indices
* the Vocab object. The vocab object is initialized for you but not populated.

In [34]:
def prep_skip_gram_data(data_frame, tokenizer_fn, window=2, max_length=50):
    data_out = []
    vocab = Vocab()
  ### BEGIN SOLUTION
    for sentence in data_frame['text']:
        # 使用分词器函数对句子进行分词
        tokens = tokenizer_fn(sentence)

        # 更新词汇表
        for token in tokens:
            vocab.add_word(token)
        
        if len(tokens)< ((window * 2) + 1):
            continue

        # 如果句子过长，只考虑前max_length个词
        if len(tokens) > max_length:
            tokens = tokens[:max_length]

        # 遍历每个可能的窗口中心词
        for i in range(window, len(tokens) - window):
            # 收集上下文词的索引
            context = [vocab.word2index(tokens[j]) for j in range(i-window, i+window+1) if j != i]

            # 获取中心词的索引
            target = vocab.word2index(tokens[i])

            # 添加到输出数据中
            data_out.append((target, context))
    ### END SOLUTION
    return data_out, vocab

In [35]:
SKIP_DATA, SKIP_VOCAB = prep_skip_gram_data(WIKI_TRAIN, my_tokenizer, window=SKIP_WINDOW, max_length=SKIP_MAX_LENGTH)

In [36]:
try:
  SKIP_DATA[0]
except:
  print(traceback.format_exc())

Unit test: compute the number of data points that should be in SKIP_DATA and check the vocab size

In [37]:
# student check - H (5 points)
ag.check_data_size_h(WIKI_TRAIN, SKIP_WINDOW, SKIP_DATA, SKIP_VOCAB, max_length=SKIP_MAX_LENGTH, tokenizer_fn=my_tokenizer)

expected data points 625869
actual data points 625869
difference 0

least vocab size 28782
actual vocab size 28782

Test passed!
Test H: 5/5


### The Skip Gram Model

Complete the Skip Gram model specification.

The Skip Gram model should contain:
* An embedding layer `nn.Embedding`
* A linear layer that transforms the embedding to the vocabulary

The forward function will take the `x` component of the data--a single token index and produces a log softmax distribution over the vocabulary.

In [38]:
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(SkipGram, self).__init__()
        # 定义嵌入层，它将索引转换为嵌入向量
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        # 定义线性层，它将嵌入向量转换为词汇表大小的输出，以预测上下文
        self.linear = nn.Linear(embed_size, vocab_size)

    def forward(self, x):
        # 通过嵌入层获取x的嵌入向量
        embeds = self.embeddings(x)
        # 将嵌入向量通过线性层
        out = self.linear(embeds)
        # 应用log softmax来获取概率分布
        probs = F.log_softmax(out, dim=1)
        return probs

Unit test: check the layers and layer ordering

In [39]:
# initialize the model
skip_model = SkipGram(SKIP_VOCAB.num_words(), SKIP_EMBED_DIMENSIONS)

In [40]:
# student check - Test I (5 points)
ag.test_skipgram_structure(skip_model)

Your model has two layers as expected!
Your layers orders are as expected!
Test passed!
Test I: 5/5


### Train the Skip Gram Model

In [41]:
try:
  SKIP_CRITERION = nn.NLLLoss()
  SKIP_OPTIMIZER = torch.optim.AdamW(skip_model.parameters(), lr=SKIP_LEARNING_RATE)
except:
    print(traceback.format_exc())

In [42]:
def train_skipgram(model, data, num_epochs, batch_size, criterion, optimizer):
  for epoch in range(num_epochs):
    losses = []
    for i in range(len(data)//batch_size):
      x, y = get_batch(data, i, batch_size)
      y_hat = model(x)
      loss = None
      # Calculate loss for every word in the context
      for word in y.T:
        if loss is None:
          loss = criterion(y_hat, word)
        else:
          loss += criterion(y_hat, word)
      optimizer.zero_grad()
      loss.backward()
      losses.append(loss.item() / y.shape[1])
      optimizer.step()
      if i % 100 == 0:
        print('iter', i, 'loss', np.array(losses).mean())
    print('epoch', epoch, 'loss', np.array(losses).mean())

In [43]:
try:
  train_skipgram(skip_model, SKIP_DATA, num_epochs=SKIP_NUM_EPOCHS, batch_size=SKIP_BATCH_SIZE, criterion=SKIP_CRITERION, optimizer=SKIP_OPTIMIZER)
except:
    print(traceback.format_exc())

iter 0 loss 10.450851440429688
iter 100 loss 9.557377154284184
iter 200 loss 8.552803839024026
iter 300 loss 7.660844748994441
iter 400 loss 6.990698989192744
iter 500 loss 6.499340323868863
iter 600 loss 6.13663020744895
epoch 0 loss 6.105807384765675
iter 0 loss 4.983040809631348
iter 100 loss 4.268890052738756
iter 200 loss 4.003258628038624
iter 300 loss 3.8760668003677927
iter 400 loss 3.8004145099040576
iter 500 loss 3.7504902043028507
iter 600 loss 3.725188686129654
epoch 1 loss 3.724487073682919


Now that we have trained the Skipgram model, we will be using the `WIKI_TEST` dataset again for evaluation. Your Skipgram model will need to achieve at least 30% accuracy to pass the test.

In [44]:
def prep_skip_gram_test_data(data_frame, vocab, tokenizer_fn, window=2, max_length=50):
  data_out = []
  for row in data_frame['text']:
    tokens = tokenizer_fn(row)
    token_ids = [vocab.word2index(w) for w in tokens]
    if len(token_ids) >= (window*2)+1:
        token_ids = token_ids[0:min(len(token_ids), max_length)]
    for i in range(window, len(token_ids)-window):
      x = token_ids[i]
      y = token_ids[i-window:i]
      y.extend(token_ids[i+1:i+1+window])
      data_out.append((x, y))
  return data_out

TEST_DATA = prep_skip_gram_test_data(WIKI_TEST, SKIP_VOCAB, tokenizer_fn=my_tokenizer, window=SKIP_WINDOW, max_length=SKIP_MAX_LENGTH)

In [45]:
# student check - Test J (20 points)
ag.test_skip_performance(skip_model, TEST_DATA, 512, get_batch_fn=get_batch)

Test failed! Accuracy = 0.2414517337328767/1
Test J: 0/20


In [None]:
# Hyperparameters; feel free to change
SKIP_EMBED_DIMENSIONS = 100
SKIP_WINDOW = 4
SKIP_MAX_LENGTH = 50
SKIP_BATCH_SIZE = 1024
SKIP_NUM_EPOCHS = 3
SKIP_LEARNING_RATE = 5e-4

skip_model = SkipGram(SKIP_VOCAB.num_words(), SKIP_EMBED_DIMENSIONS)
SKIP_CRITERION = nn.NLLLoss()
SKIP_OPTIMIZER = torch.optim.AdamW(skip_model.parameters(), lr=SKIP_LEARNING_RATE)

train_skipgram(skip_model, SKIP_DATA, num_epochs=SKIP_NUM_EPOCHS, batch_size=SKIP_BATCH_SIZE, criterion=SKIP_CRITERION, optimizer=SKIP_OPTIMIZER)

TEST_DATA = prep_skip_gram_test_data(WIKI_TEST, SKIP_VOCAB, tokenizer_fn=my_tokenizer, window=SKIP_WINDOW, max_length=SKIP_MAX_LENGTH)

ag.test_skip_performance(skip_model, TEST_DATA, 512, get_batch_fn=get_batch)

# Grading
Please submit this .ipynb file to Gradescope for grading.

## Final Grade

In [46]:
# student check
ag.final_grade()

Your projected points for this assignment is 60/100.

NOTE: THIS IS NOT YOUR FINAL GRADE. YOUR FINAL GRADE FOR THIS ASSIGNMENT WILL BE AT LEAST 60 OR MORE, BUT NOT LESS



# Notebook Runtime

In [47]:
# end time - notebook execution
end_nb = time.time()
# print notebook execution time in minutes
print("Notebook execution time in minutes =", (end_nb - start_nb)/60)
# warn student if notebook execution time is greater than 30 minutes
if (end_nb - start_nb)/60 > 30:
  print("WARNING: Notebook execution time is greater than 30 minutes. Your submission may not complete auto-grading on Gradescope. Please optimize your code to reduce the notebook execution time.")

Notebook execution time in minutes = 46.04129377603531
