# HW05: Word Embeddings

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Essay Feedback**

Please provide feedback to two classmates' essays on Eduflow.

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [1]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [2]:
dataset = api.load("text8")

In [5]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

##TODO train a word2vec model on this dataset which appear at least 10 times in the corpus

# https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

# Train model
model = Word2Vec(dataset, min_count =  10)

# Get word vectors
word_vectors = model.wv

# Store word vectors to disk
word_vectors.save('vectors.kv')

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [6]:
##TODO find the closest words to king

# Ref 1: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar
from gensim.models import KeyedVectors

# Load word vectors form disk
word_vectors = KeyedVectors.load('vectors.kv')

print(word_vectors)

# Result is a list of tuples: (most_similar_key, cosine similarity)]
result = word_vectors.most_similar(positive=['king'])

print(result)

KeyedVectors<vector_size=100, 47134 keys>
[('prince', 0.741571843624115), ('emperor', 0.7393261790275574), ('queen', 0.7222601175308228), ('throne', 0.7127467393875122), ('regent', 0.683900773525238), ('vii', 0.6831018924713135), ('aragon', 0.679315447807312), ('sultan', 0.6753013730049133), ('kings', 0.6689374446868896), ('viii', 0.6606752276420593)]


King is to man as woman is to X

In [7]:
##TODO find the closest word for the vector "woman" + "king" - "man"

# Ref 1: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar

result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

print(result)

[('queen', 0.6665955185890198), ('throne', 0.6042353510856628), ('prince', 0.6033719778060913), ('empress', 0.6009417176246643), ('princess', 0.5853360295295715), ('isabella', 0.5832987427711487), ('son', 0.5782522559165955), ('emperor', 0.5712879300117493), ('aragon', 0.5683165192604065), ('elizabeth', 0.5633431673049927)]


**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.

In [8]:
!wget http://alfonseca.org/pubs/ws353simrel.tar.gz
!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

--2023-03-31 14:32:23--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: ‘ws353simrel.tar.gz.8’


2023-03-31 14:32:24 (179 MB/s) - ‘ws353simrel.tar.gz.8’ saved [5460/5460]

[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [6]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
# if a word is not present in our model, we assign similarity 0 for the respective text pair
import io

# Ref: https://radimrehurek.com/gensim/models/keyedvectors.html#what-can-i-do-with-word-vectors

fn = lambda pair, value : f"{pair[0]},{pair[1]},{value}"

dataset_map = map(fn,X, y)

data = list(dataset_map)

# Store the data as a csv so we can pass it to the function below
# The function below only takes data from file, not form memory......
data_string = "\n".join(data)
f = open('data.csv','w')
f.write(data_string)
f.close()

# "Apply" our model to the gold standard aka human annnotations
similarities = word_vectors.evaluate_word_pairs('data.csv', delimiter=",", dummy4unknown=True)

print(similarities)

(PearsonRResult(statistic=0.6820281339003748, pvalue=3.979966014111241e-29), SignificanceResult(statistic=0.666789919894723, pvalue=1.8004414358901796e-27), 0.9852216748768473)


In [46]:
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations

print(similarities)

(PearsonRResult(statistic=0.6869470526204524, pvalue=1.1061160166189108e-29), SignificanceResult(statistic=0.6695084153849175, pvalue=9.27314443825835e-28), 0.9852216748768473)


In [None]:
import spacy
en = spacy.load('en_core_web_sm')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment

**PyTorch Embeddings**

In [None]:
#Import the AG news dataset (same as hw01)
#Download them from here 
# !wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

In [None]:
vocab = 200
##TODO tokenize the text, only keep 200 most frequent words 

In [None]:
length = 100
#TODO create a one_hot representation for each word and truncate/pad the sequences such that they are all of the same length (here we use 100)

In [None]:

##TODO create your torch embedding like we did in notebook 5! (hint: predicting labels: world, sport, business, and sci/tech)