# Week 2 Lesson Notebook: Word2Vec_Embeddings & GPT-2 Predictions

In this notebook, we play with some classic word embeddings (using Word2Vec) and then use an old Language Model, GPT-2, to make a few next-word predictions. The purpose is start building up some intuition for the entities and concepts we are working with.  We use "embedding" vectors to represent the words in language as we process them in neural networks.  Embeddings are a fuzzy representation of words.  We use decoder transformers to predict the next word based on the previous sequence of words.  We'll see the mechanics of feeding a sequence of words into a transformer to predict the next word.  We'll use this process through out the rest of the class.<br>

**Note:** In this and other lesson notebooks we will also pose questions for you to think about and solve, if you are interested. Look for '**Additional Question**'.

## 1. Setup

This notebook requires the tensorflow dataset and other prerequisites that you must download and then store locally.

In [1]:
!pip uninstall numpy scikit-learn scip -y

Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Found existing installation: scikit-learn 1.6.0
Uninstalling scikit-learn-1.6.0:
  Successfully uninstalled scikit-learn-1.6.0
[0m

In [2]:
!pip install numpy==1.26.4 --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.
mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.
sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.
libpysal 4.13.0 requires scikit-learn>=1.1, which is not installed.
umap-learn 0.5.7 requires scikit-learn>=0.22, which is not installed.
imbalanced-learn 0.13.0 requires scikit-learn<2,>=1.3.2, which is not installed.
sentence-transformers 4.1.0 requires scikit-learn, which is not installed.
librosa 0.11.0 requires scikit-learn>=1.1.0, which is not installed.
shap 0.47.2 requires scikit-learn, which is not installed.
hdbscan 0.8.40 requires scikit-learn>=0.20, which is not installed.
fastai 2.7.19 requires scikit-learn, which is not installed.
pynndescent 0.5.13 requires scikit-learn>=0.18, which 

In [3]:
!pip install -U scikit-learn==1.6 scipy --quiet

In [1]:
!pip install gensim --quiet
!pip install pydot --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m957.5 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible.[0m[31m
[0

Ready to do the imports.

In [4]:
import sklearn as sk
import os
import nltk
from nltk.corpus import reuters
from nltk.data import find

import matplotlib.pyplot as plt

import re

import gensim

import numpy as np

Below is a helper function for similarity evaluation:

In [5]:
# We are using cosine similarity

def cos_sim(a, b):

    """
    Computes the cosine similarity
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)


## 2. Word Embeddings

Next, we get the word2vec model from nltk.

In [6]:
nltk.download('word2vec_sample')

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


How many words are in the vocabulary?

In [7]:
len(model.key_to_index)

43981

How do the word vectors look like? As expected:

In [12]:
model['school']

array([ 3.70471e-02,  1.14410e-02,  1.49575e-02,  8.87547e-02,
        3.96226e-02, -2.67453e-02,  6.33962e-02, -1.90189e-02,
       -1.89446e-03, -3.68490e-02,  1.01038e-01,  1.85236e-02,
        2.69434e-02, -4.00188e-02, -4.29905e-02,  4.31887e-02,
       -8.12264e-02,  5.72052e-03,  5.54717e-02, -3.56604e-02,
        8.32075e-02,  6.93396e-02,  4.72995e-03,  6.97358e-02,
        1.96875e-03, -1.41849e-01,  9.22464e-04,  7.48867e-02,
        4.85377e-02, -1.02028e-02,  4.14056e-02, -4.33868e-02,
        1.62453e-02,  3.04599e-03, -6.61698e-02, -6.06226e-02,
        9.27169e-02, -2.04056e-02,  1.88207e-02,  5.07170e-02,
        5.29953e-03,  5.19056e-02,  4.47736e-02, -2.05047e-02,
        1.39670e-02,  5.86415e-02,  6.97358e-02, -1.12924e-02,
       -4.49717e-02,  9.31132e-02, -4.75471e-02, -4.95283e-02,
       -1.44251e-03, -4.61604e-02,  8.59811e-02, -8.47924e-02,
       -4.23962e-02,  1.78302e-02, -5.00236e-03, -6.45849e-02,
       -3.58585e-02, -1.62453e-02,  4.31887e-02, -2.060

Let's vectorize at a few words and look at the cosine similarities:

In [13]:
vec_car = model['car']
vec_vehicle = model['vehicle']
vec_school = model['school']

In [14]:
cos_sim(vec_car, vec_school)

0.109525874

In [15]:
cos_sim(vec_car, vec_vehicle)

0.7821095

In [16]:
cos_sim(vec_school, vec_vehicle)

0.09002902

Let's play with a few more examples...

In [17]:
vec_related = model['automotive']
cos_sim(vec_car, vec_related)

0.37437758

In [18]:
vec_unrelated = model['aardvark']
cos_sim(vec_car, vec_unrelated)

KeyError: "Key 'aardvark' not present"

Oops! Out of vocabulary used to be a real issue for classic word embeddings.

**Additional Question 1:** Can you verify that the word vectors represent interesting syntactic and semantic relationships well, like '*run* is to *running* as *swim* is *swimming*'. How could you approach that? (Hint: conceptually, 'ing' ~ 'running' - 'run ).

In [20]:
vec_run = model["run"]
vec_running = model["running"]
vec_swim = model["swim"]
vec_swimming = model["swimming"]

In [21]:
vec_ing = vec_running - vec_run
vec_swim_plus_ing = vec_swim + vec_ing


In [22]:
cos_sim(vec_swimming, vec_swim_plus_ing)

0.7389302

In [23]:
model.most_similar(positive=[vec_swim_plus_ing])

[('swim', 0.7550984621047974),
 ('swimming', 0.7389301657676697),
 ('swam', 0.6208618879318237),
 ('swimmers', 0.597092866897583),
 ('swum', 0.5865792632102966),
 ('Swim', 0.5549381971359253),
 ('running', 0.49041056632995605),
 ('Running', 0.41739922761917114),
 ('diver', 0.40837565064430237),
 ('rowed', 0.4043257534503937)]

## 3. Simple Next-Word Predictions with GPT-2

We will now download the GPT2 model from Huggingface and use it to get a feeling for these next-word predictions


In [None]:
#!pip install transformers  --quiet

In [24]:
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

In [25]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The model requires tokenized input. I.e., each word is split into tokens (one word can be comprised of one or more tokens) and the token id is used as the input to the model:

In [26]:
inputs = tokenizer("Today is a very nice", return_tensors="pt")

In [27]:
inputs

{'input_ids': tensor([[8888,  318,  257,  845, 3621]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

We see the five input ids and the corresponding 'attention_masks' (~'should' the model pay attention to the position?').

Now we apply the model to the input:

In [28]:
output = model(**inputs)

In [29]:
len(output)

2

Why '2'? The [Huggingface documentation ](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel)is very helpful.

In [32]:
output.keys()

odict_keys(['logits', 'past_key_values'])

In [34]:
output.logits.shape

torch.Size([1, 5, 50257])

What could be the meaning of these dimensions?

Ok, let's the positions by the logits:

In [35]:
logits_last_position = (output.logits.detach()[0, -1])
np.argsort(logits_last_position)

tensor([31573,   208,   214,  ...,   290,   640,  1110])

What is the token corresponding to the highest logit?

In [38]:
tokenizer.decode([290, 640, 1110])

' and time day'

Does this look right? It does...

What are the corresponding *relative* probabilities of the 2 most common words?

In [43]:
logits_last_position[[640, 110]]

tensor([-101.2198, -120.0501])

In [44]:
np.exp(logits_last_position[1110])/ np.exp(logits_last_position[640])

tensor(8.6250)

Substantially more likely to pick token 1.  What was token 2?

In [45]:
tokenizer.decode([640])

' time'

'Today is a very nice **day**' vs 'Today is a very nice **time**'. Makes sense...

**Additional Question 2:** How could you possibly use a language model to determine whether 'This was fun' has *positive* or *negative* sentiment? (Note, GPT-2 isn't that great to say the least, but the principle is instructive.)


In [46]:
inputs = tokenizer("A data cloud strategy requires a crucial component that allows", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits.detach()

In [59]:
last_logit = logits[0, -1]

In [60]:
np.argsort(last_logit)

tensor([18945, 37574,   200,  ...,   262,   514,   345])

In [61]:
tokenizer.decode([345])

' you'