# Homework 9 - Part 2 Notebook

In this homework notebook, we will create and train our own SkipGram embedding, by using the short synopsis of the Lion King movie explained to kids in the text.text file.

Get familiar with the code and write a small report (2 or 3 pages, idk), with answers to the questions listed at the end of the notebook.

**The report must be submitted in PDF format, before April 8th, 11.59pm!**

Do not forget to write your name and student ID on the report.

### Imports needed

Note, we strongly advise to use a CUDA/GPU machine for this notebook.

Technically, this can be done on CPU only, but it will be very slow!

If you decide to use it on CPU, you might also have to change some of the .cuda() methods used on torch tensors and models in this notebook!

In [1]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import functools
import matplotlib.pyplot as plt
CUDA = torch.cuda.is_available()

### Testing for CUDA

We advise running on GPU and setting up CUDA on your machine as it might drastically speed up the running time for this notebook!

In [2]:
# Define device for torch
use_cuda = True
print("CUDA is available:", torch.cuda.is_available())
device = torch.device("cuda" if (use_cuda and torch.cuda.is_available()) else "cpu")

CUDA is available: False


### Step 1. Produce some data based on a given text for training our SkipGram model    

The functions below will be used to produce our dataset for training the SkipGram model.

The dataset text consists of a short description of the story behing the movie The Lion King, explained in simple terms to kids.

In [3]:
def text_to_train(text, context_window):
    """
    This function receives the text as a list of words, in lowercase format.
    It then returns data, a list of all the possible (x,y) pairs with
    - x being the middle word of the sentence of length 2*context_window+1,
    - y being a list of 2k words, containing the k preceding words and the k
    posterior words.
    """
    
    # Get data from list of words in text, using a context window of size k = context_window
    data = []
    for i in range(context_window, len(text) - context_window):
        target = [text[i+e] for e in range(-context_window, context_window+1) if i+e != i]
        input_word = text[i]
        data.append((input_word, target))
        
    return data

In [4]:
def create_text():
    """
    This function loads the string of text from the text.txt file,
    and produces a list of words in string format, as variable text.
    """
    
    # Load corpus from file
    with open("./text.txt", 'r', encoding="utf8",) as f:
        corpus = f.readlines()
    f.close()
    
    # Join corpus into a single string
    text = ""
    for s in corpus:
        l = s.split()
        for s2 in l:
            # Removes all special characters from string
            s2 = ''.join(filter(str.isalnum, s2))
            s2 += ' '
            text += s2.lower()
    text = text.split()
    
    return text

In [5]:
text = create_text()
print(text)

['the', 'lion', 'king', 'is', 'an', 'animated', 'movie', 'made', 'by', 'walt', 'disney', 'in', '1994', 'it', 'was', 'the', 'most', 'successful', 'animated', 'movie', 'of', 'the', '1990s', 'the', 'movie', 'is', 'about', 'a', 'young', 'lion', 'prince', 'who', 'learns', 'about', 'his', 'role', 'as', 'prince', 'and', 'in', 'the', 'circle', 'of', 'life', 'it', 'is', 'dedicated', 'to', 'frank', 'wells', 'who', 'was', 'the', 'president', 'of', 'the', 'walt', 'disney', 'company', 'and', 'died', 'shortly', 'before', 'the', 'movie', 'was', 'released', 'into', 'theaters', 'on', 'june', '15', '1994', 'it', 'was', 'the', 'first', 'fulllength', 'disney', 'movie', 'to', 'feature', 'no', 'human', 'characters', 'since', 'bambi', 'much', 'of', 'the', 'voice', 'acting', 'work', 'was', 'done', 'by', 'wellknown', 'actors', 'including', 'james', 'earl', 'jones', 'jeremy', 'irons', 'matthew', 'broderick', 'whoopi', 'goldberg', 'rowan', 'atkinson', 'jonathan', 'taylor', 'thomas', 'and', 'nathan', 'lane', 'the

In [6]:
def generate_data(text, context_window):
    """
    This function receives the text and context window size.
    It produces four outputs:
    - vocab, a set containing the words found in text.txt,
    without any doublons,
    - word2index, a dictionary to convert words to their integer index,
    - word2index, a dictionary to convert integer index to their respective words,
    - data, containing our (x,y) pairs for training.
    """
    
    # Create vocabulary set V
    vocab = set(text)
    
    # Word to index and index 2 word converters
    word2index = {w:i for i,w in enumerate(vocab)}
    index2word = {i:w for i,w in enumerate(vocab)}
    
    # Generate data
    data = text_to_train(text, context_window)
    
    return vocab, data, word2index, index2word

In [7]:
vocab, data, word2index, index2word = generate_data(text, context_window = 2)

In [8]:
print("The dataset contains", len(vocab), "different words.")

The dataset contains 389 different words.


In [9]:
print("The dadatset contains the following words:")
print(vocab)

The dadatset contains the following words:
{'hyenas', 'like', 'who', 'lane', 'november', 'john', 'also', 'wants', 'while', 'popular', 'going', 'thought', 'find', 'trouble', 'trick', 'messenger', 'city', 'musical', 'new', 'ground', 'sky', 'scar', '1997', 'out', 'would', 'appears', 'look', '1994', 'characters', 'at', 'happens', 'gone', 'gump', 'puts', 'learns', 'is', 'work', 'uks', 'atkinson', 'rain', 'young', 'save', 'runs', 'walks', 'taylor', 'acting', 'wonderful', 'desert', 'done', 'dedicated', 'land', 'animals', 'daughter', 'go', 'life', 'picks', 'circle', 'movies', 'roars', 'truth', 'died', 'even', 'many', 'tells', 'water', 'nemo', 'three', 'success', 'looking', 'prince', 'lot', 'gorge', 'kill', 'york', 'know', 'upset', 'elephant', 'others', 'it', 'i', 'place', 'finds', 'worldwide', 'them', 'top', 'can', 'forbidden', 'finding', 'past', 'time', 'shows', 'hakuna', 'lionesses', 'more', 'be', 'earl', 'zazu', 'speaks', 'each', 'before', 'used', 'tim', 'than', 'her', 'mufasas', 'of', 'bec

In [10]:
print(word2index)

{'hyenas': 0, 'like': 1, 'who': 2, 'lane': 3, 'november': 4, 'john': 5, 'also': 6, 'wants': 7, 'while': 8, 'popular': 9, 'going': 10, 'thought': 11, 'find': 12, 'trouble': 13, 'trick': 14, 'messenger': 15, 'city': 16, 'musical': 17, 'new': 18, 'ground': 19, 'sky': 20, 'scar': 21, '1997': 22, 'out': 23, 'would': 24, 'appears': 25, 'look': 26, '1994': 27, 'characters': 28, 'at': 29, 'happens': 30, 'gone': 31, 'gump': 32, 'puts': 33, 'learns': 34, 'is': 35, 'work': 36, 'uks': 37, 'atkinson': 38, 'rain': 39, 'young': 40, 'save': 41, 'runs': 42, 'walks': 43, 'taylor': 44, 'acting': 45, 'wonderful': 46, 'desert': 47, 'done': 48, 'dedicated': 49, 'land': 50, 'animals': 51, 'daughter': 52, 'go': 53, 'life': 54, 'picks': 55, 'circle': 56, 'movies': 57, 'roars': 58, 'truth': 59, 'died': 60, 'even': 61, 'many': 62, 'tells': 63, 'water': 64, 'nemo': 65, 'three': 66, 'success': 67, 'looking': 68, 'prince': 69, 'lot': 70, 'gorge': 71, 'kill': 72, 'york': 73, 'know': 74, 'upset': 75, 'elephant': 76, 

In [11]:
print(index2word)

{0: 'hyenas', 1: 'like', 2: 'who', 3: 'lane', 4: 'november', 5: 'john', 6: 'also', 7: 'wants', 8: 'while', 9: 'popular', 10: 'going', 11: 'thought', 12: 'find', 13: 'trouble', 14: 'trick', 15: 'messenger', 16: 'city', 17: 'musical', 18: 'new', 19: 'ground', 20: 'sky', 21: 'scar', 22: '1997', 23: 'out', 24: 'would', 25: 'appears', 26: 'look', 27: '1994', 28: 'characters', 29: 'at', 30: 'happens', 31: 'gone', 32: 'gump', 33: 'puts', 34: 'learns', 35: 'is', 36: 'work', 37: 'uks', 38: 'atkinson', 39: 'rain', 40: 'young', 41: 'save', 42: 'runs', 43: 'walks', 44: 'taylor', 45: 'acting', 46: 'wonderful', 47: 'desert', 48: 'done', 49: 'dedicated', 50: 'land', 51: 'animals', 52: 'daughter', 53: 'go', 54: 'life', 55: 'picks', 56: 'circle', 57: 'movies', 58: 'roars', 59: 'truth', 60: 'died', 61: 'even', 62: 'many', 63: 'tells', 64: 'water', 65: 'nemo', 66: 'three', 67: 'success', 68: 'looking', 69: 'prince', 70: 'lot', 71: 'gorge', 72: 'kill', 73: 'york', 74: 'know', 75: 'upset', 76: 'elephant', 

In [12]:
print(data)

[('king', ['the', 'lion', 'is', 'an']), ('is', ['lion', 'king', 'an', 'animated']), ('an', ['king', 'is', 'animated', 'movie']), ('animated', ['is', 'an', 'movie', 'made']), ('movie', ['an', 'animated', 'made', 'by']), ('made', ['animated', 'movie', 'by', 'walt']), ('by', ['movie', 'made', 'walt', 'disney']), ('walt', ['made', 'by', 'disney', 'in']), ('disney', ['by', 'walt', 'in', '1994']), ('in', ['walt', 'disney', '1994', 'it']), ('1994', ['disney', 'in', 'it', 'was']), ('it', ['in', '1994', 'was', 'the']), ('was', ['1994', 'it', 'the', 'most']), ('the', ['it', 'was', 'most', 'successful']), ('most', ['was', 'the', 'successful', 'animated']), ('successful', ['the', 'most', 'animated', 'movie']), ('animated', ['most', 'successful', 'movie', 'of']), ('movie', ['successful', 'animated', 'of', 'the']), ('of', ['animated', 'movie', 'the', '1990s']), ('the', ['movie', 'of', '1990s', 'the']), ('1990s', ['of', 'the', 'the', 'movie']), ('the', ['the', '1990s', 'movie', 'is']), ('movie', ['19

In [None]:
def words_to_tensor(words: list, word2index: dict, dtype = torch.FloatTensor):
    """
    This fucntion converts a word or a list of words into a torch tensor,
    with appropriate format.
    It reuses the word2index dictionary.
    """
    
    tensor =  dtype([word2index[word] for word in words])
    tensor = tensor.to(device)
    
    return Variable(tensor)

### Step 2. Create a SkipGram model and train

Task #1: Write your own model for the SkipGram model below.

In [None]:
class SkipGram(nn.Module):
    """
    Your skipgram model here!
    """
    
    def __init__(self, context_size, embedding_dim, vocab_size):
        pass

    def forward(self, inputs):
        pass

In [None]:
# Create model and pass to CUDA if available.
model = SkipGram(context_size = 2, embedding_dim = 20, vocab_size = len(vocab))
model = model.to(device)
model.train()

In [None]:
# Define training parameters
learning_rate = 0.001
epochs = 50
torch.manual_seed(28)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr = learning_rate)

Task #2: Write your own training function for the SkipGram model in the cell below. It should return a list of losses and accuracies for display later on, along with your trained model. You may also write a helper function for computing the accuracy of your model during training.

In [None]:
def get_prediction(target, model, word2index, index2word):
    """
    This is a helper function to get predictions from our model.
    """
    
    return None

In [None]:
def train(data, word2index, model, epochs, loss_func, optimizer):
    """
    This is a trainer function to train our SkipGram model.
    """
    losses = []
    accuracies = []
    pass
    return losses, accuracies, model

losses, accuracies, model = train(data, word2index, model, epochs, loss_func, optimizer)

### 3. Visualization

In [None]:
# Display losses over time
plt.figure()
plt.plot(losses)
plt.show()

In [None]:
# Display accuracies over time
plt.figure()
plt.plot(accuracies)
plt.show()

### 4. Extract embedding and play with it? (Optional)

### Questions and expected answers for the report

The questions listed below are related to Part 2.

P2-QA. Copy and paste your SkipGram class code (Task #1 in this notebook).

P2-QB. Copy and paste your train function (Task #2 in the notebook), along with any helper functions you might have used (e.g. a function to compute the accuracy of your model after each iteration).
Please also copy and paste the function call with the parameters you used for the train() function.

P2-QC. Why is the SkipGram model much more difficult to train than the CBoW?
Is it problematic if it does not reach a 100% accuracy on the task it is being trained on?

P2-QD. If we were to evaluate this model by using intrinsic methods, what could be a possible approach to do so?

P2-QE. (Optional) Please submit any additional code you might that will demonstrate the performance/problems of the word embedding you have trained!