# Week 2 Lesson Notebook: Word2Vec_Embeddings & GPT-2 Predictions

In this notebook, we play with some classic word embeddings (using Word2Vec) and then use an old Language Model, GPT-2, to make a few next-word predictions. The purpose is start building up some intuition for the entities and concepts we are working with.  We use "embedding" vectors to represent the words in language as we process them in neural networks.  Embeddings are a fuzzy representation of words.  We use decoder transformers to predict the next word based on the previous sequence of words.  We'll see the mechanics of feeding a sequence of words into a transformer to predict the next word.  We'll use this process through out the rest of the class.<br>

**Note:** In this and other lesson notebooks we will also pose questions for you to think about and solve, if you are interested. Look for '**Additional Question**'.

## 1. Setup

This notebook requires the tensorflow dataset and other prerequisites that you must download and then store locally.

In [1]:
!pip install gensim --quiet
!pip install pydot --quiet

Ready to do the imports.

In [2]:
import sklearn as sk
import os
import nltk
from nltk.corpus import reuters
from nltk.data import find

import matplotlib.pyplot as plt

import re

import gensim

import numpy as np

Below is a helper function for similarity evaluation:

In [3]:
# We are using cosine similarity

def cos_sim(a, b):

    """
    Computes the cosine similarity
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)


## 2. Word Embeddings

Next, we get the word2vec model from nltk.

In [None]:
nltk.download('word2vec_sample')

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

How many words are in the vocabulary?

In [None]:
len(model.key_to_index)

How do the word vectors look like? As expected:

In [None]:
model['school']

Let's vectorize at a few words and look at the cosine similarities:

In [None]:
vec_car = model['car']
vec_vehicle = model['vehicle']
vec_school = model['school']

In [None]:
cos_sim(vec_car, vec_school)

In [None]:
cos_sim(vec_car, vec_vehicle)

In [None]:
cos_sim(vec_school, vec_vehicle)

Let's play with a few more examples...

In [None]:
vec_related = model['automotive']
cos_sim(vec_car, vec_related)

In [None]:
vec_unrelated = model['aardvark']
cos_sim(vec_car, vec_unrelated)

Oops! Out of vocabulary used to be a real issue for classic word embeddings.

**Additional Question 1:** Can you verify that the word vectors represent interesting syntactic and semantic relationships well, like '*run* is to *running* as *swim* is *swimming*'. How could you approach that? (Hint: conceptually, 'ing' ~ 'running' - 'run ).

## 3. Simple Next-Word Predictions with GPT-2

We will now download the GPT2 model from Huggingface and use it to get a feeling for these next-word predictions


In [None]:
#!pip install transformers  --quiet

In [None]:
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

In [None]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

The model requires tokenized input. I.e., each word is split into tokens (one word can be comprised of one or more tokens) and the token id is used as the input to the model:

In [None]:
inputs = tokenizer("Today is a very nice", return_tensors="pt")

In [None]:
inputs

We see the five input ids and the corresponding 'attention_masks' (~'should' the model pay attention to the position?').

Now we apply the model to the input:

In [None]:
output = model(**inputs)

In [None]:
len(output)

Why '2'? The [Huggingface documentation ](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel)is very helpful.

In [None]:
output.keys()

In [None]:
output.logits.shape

What could be the meaning of these dimensions?

Ok, let's the positions by the logits:

In [None]:
logits_last_position = (output.logits.detach()[0, -1])
np.argsort(logits_last_position)

What is the token corresponding to the highest logit?

In [None]:
tokenizer.decode([1110])

Does this look right? It does...

What are the corresponding *relative* probabilities of the 2 most common words?

In [None]:
np.exp(logits_last_position[1110])/ np.exp(logits_last_position[640])

Substantially more likely to pick token 1.  What was token 2?

In [None]:
tokenizer.decode([640])

'Today is a very nice **day**' vs 'Today is a very nice **time**'. Makes sense...

**Additional Question 2:** How could you possibly use a language model to determine whether 'This was fun' has *positive* or *negative* sentiment? (Note, GPT-2 isn't that great to say the least, but the principle is instructive.)
