# Embeddings

In this lab we will use both sparse vectors and dense word2vec embeddings to obtain
vector representations of words and documents.

First, we will reload the tweets dataset from lab 3. Then, we will obtain a term-document matrix, and compute cosine similarities. Then, we will use the Gensim library to train a word2vec model and download a pretrained model. Finally, we use the Gensim embeddings to perform the analogy task.

# 1. Preparing the Data

In [3]:
from datasets import load_dataset
from tqdm import tqdm

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Training dataset with {len(train_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Test dataset with {len(test_dataset)} instances loaded")

# Put the data into lists ready for the next steps...
train_texts = []
train_labels = []
for i in tqdm(range(len(train_dataset))):
    train_texts.append(train_dataset[i]['text'])
    train_labels.append(train_dataset[i]['label'])


Reusing dataset tweet_eval (./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Training dataset with 45615 instances loaded


Reusing dataset tweet_eval (./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Test dataset with 12284 instances loaded


100%|██████████| 45615/45615 [00:05<00:00, 8565.39it/s]


# 2. Term-Document Matrix

TODO: Use the CountVectorizer, as in week 3, to obtain a term-document matrix for the training set.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectorizer.fit(train_texts)
X_train = vectorizer.transform(train_texts)

TODO: Print out the term vector for the word 'license'. Use the vocabulary_ attribute to look up the word's index. Hint: the CountVectorizer stores a term-document matrix in a sparse format to save memory. You can convert this to a standard numpy array using the method '.toarray()'.

In [31]:
word = 'license'
word_ind = vectorizer.vocabulary_[word]
word_count_train = X_train.toarray()[:, word_ind]

TODO: Print out the document vector for the first text (i.e., dialogue turn or utterance) in the training set. Hint: you can use the method '.flatten()' to convert a 1xN matrix to a vector.

In [26]:
flat_count = word_count_train.flatten()
print(flat_count)
print(flat_count.sum())

[0 0 0 ... 0 0 0]
13


# 3. Cosine Similarity

TODO: write a function that computes cosine similarity between two vectors. Hint: you might find numpy's linalg library useful.

In [34]:
import numpy as np

def similarity(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim


TODO: Use the function to find the five most similar words to 'happy'. Hint: the vocab_inverted dictionary lets you look up a word given its index.

In [38]:
# invert the vocabulary dictionary so we can look up word types given an index
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

# get count vector given word ind
def get_word_count_vec(X_array, word_ind):
    word_count_X = X_array[:, word_ind]
    return word_count_X.flatten()

# example how to get similarity
X_train_array = X_train.toarray()
sim = similarity(get_word_count_vec(X_train_array, vectorizer.vocabulary_['license']), 
                 get_word_count_vec(X_train_array, vectorizer.vocabulary_['happy']))

counter_sim_license = {}
license_count = get_word_count_vec(X_train_array, vectorizer.vocabulary_['license'])

for key in tqdm(vectorizer.vocabulary_.keys()):
    sim = similarity(license_count, 
                     get_word_count_vec(X_train_array, vectorizer.vocabulary_[key]))
    counter_sim_license[sim] = key
    

100%|██████████| 43358/43358 [01:31<00:00, 473.68it/s]


In [47]:
# find the most similar words
def n_max_elements(a, num):
    ind = np.argpartition(a, -num)[-num:]
    return np.argsort(a[ind])

print(np.array(list(counter_sim_license.keys())).shape)
max_5 = n_max_elements(np.array(list(counter_sim_license.keys())), 5)

for m in max_5:
    print(counter_sim_license[m], ': ',m)

(121,)
bethiphopawards :  0


KeyError: 3

# 4. Word2Vec

For this part, we will need the gensim library. The code below tokenizes the training texts, then runs word2vec (the skipgram model) to learn a set of embeddings.

In [None]:
from gensim.models import word2vec
from gensim.utils import tokenize

tokenized_texts = [list(tokenize(text)) for text in train_texts]
emb_model = word2vec.Word2Vec(tokenized_texts, sg=1, min_count=1, window=3, size=100)

We can look up the embedding for any given word like this:

In [None]:
emb_model['happy']

Now, use your cosine similarity method again to find the five most similar words to 'happy' according to your word2vec model.

In [None]:
# WRITE YOUR OWN CODE HERE


Which embeddings do you think have been more effective for finding similar words?



# 5. Downloading Pretrained Models

Above, we trained our own model using the skipgram method. We can also download a pretrained model that has previously been trained on a large corpus. There is a list of models available [here](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models). Let's try out GLoVe embeddings (another way of learning embeddings than using the skipgram model) trained on a corpus of tweets:

In [None]:
import gensim.downloader

new_emb_model = gensim.downloader.load('glove-twitter-25')

# show the vector for Hamlet:
print(new_emb_model.wv['happy'])

TODO: Repeat the exercise above to find the closest relations to 'happy' with the downloaded model.

In [None]:
# WRITE YOUR CODE HERE

# 6. Analogy Task

An analogy can be formalised as:

A is to B as A* is to B*.

The analogy task is to find B* given A, B and A*.

TODO: Define a function that can find the closest B* for any given A, B and A*, using the Gensim embeddings.

In [None]:
def analogy(A, B, Astar, embeddings):
    # WRITE YOUR OWN CODE HERE

    ####

    return closest_word

print(analogy('man', 'programmer', 'woman', new_emb_model.wv))