# HW05: Word Embeddings

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Essay Feedback**

Please provide feedback to two classmates' essays on Eduflow.

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [1]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [2]:
dataset = api.load("text8")



In [19]:
from gensim.models import Word2Vec

##TODO train a word2vec model on this dataset which appear at least 10 times in the corpus
w2v = Word2Vec(dataset,
               workers = 8,
               vector_size = 300,
               min_count = 10,
               window = 5,
               sample = 1e-3
               )

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [20]:
# model.wv

##TODO find the closest words to king
w2v.wv.most_similar('king')

[('prince', 0.6579781174659729),
 ('kings', 0.6461048722267151),
 ('throne', 0.616422176361084),
 ('emperor', 0.6078433394432068),
 ('queen', 0.6054559350013733),
 ('aragon', 0.6049193143844604),
 ('pharaoh', 0.5987396836280823),
 ('sultan', 0.5826488733291626),
 ('vii', 0.5816540718078613),
 ('sigismund', 0.5784335732460022)]

King is to man as woman is to X

In [29]:
##TODO find the closest word for the vector "woman" + "king" - "man"

result_vector = w2v.wv['king'] - w2v.wv['man'] + w2v.wv['woman'] 

w2v.wv.most_similar(result_vector)

## Answer: The closest word somehow is still "king" with "queen" being the second-closest.

[('king', 0.8074710369110107),
 ('queen', 0.6103578209877014),
 ('throne', 0.5557685494422913),
 ('empress', 0.5511813759803772),
 ('prince', 0.5508004426956177),
 ('elizabeth', 0.5499032139778137),
 ('kings', 0.5362768769264221),
 ('daughter', 0.5321873426437378),
 ('reigning', 0.5302295088768005),
 ('emperor', 0.5291579365730286)]

**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.

In [32]:
!curl -O http://alfonseca.org/pubs/ws353simrel.tar.gz
!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  5460  100  5460    0     0  11147      0 --:--:-- --:--:-- --:--:-- 11234


[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [61]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
# if a word is not present in our model, we assign similarity 0 for the respective text pair

def compare_w2v(X):
    y = []
    for (w1, w2) in X: # for each pair in X
        if w1 in w2v.wv and w2 in w2v.wv: # ensure both words are in our model
            y.append(w2v.wv.similarity(w1, w2))
        else:
            print('{}: {} --- {}: {}'.format(w1, w1 in w2v.wv, w2, w2 in w2v.wv))
            y.append(0.0)
    return y

y_w2v = compare_w2v(X)

Arafat: False --- Jackson: False
asylum: True --- madhouse: False
cup: True --- tableware: False
Japanese: False --- American: False
Harvard: False --- Yale: False
Mexico: False --- Brazil: False
Mars: False --- water: True
Wednesday: False --- news: True
stock: True --- CD: False


In [68]:
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations
spearman_w2v = spearmanr(y, y_w2v)
spearman_w2v.statistic

0.6483163976197102

In [69]:
import spacy
en = spacy.load('en_core_web_sm')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings

def compare_spacy(X):
    y = []
    for (w1, w2) in X: # for each pair in X
        w1_, w2_ = en(w1), en(w2)
        y.append(w1_.similarity(w2_))
    return y

y_spacy = compare_spacy(X)

##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment

spearman_spacy = spearmanr(y, y_spacy)
spearman_spacy.statistic

  y.append(w1_.similarity(w2_))


0.0917488312498204

**PyTorch Embeddings**

In [33]:
#Import the AG news dataset (same as hw01)
#Download them from here 
# !wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

Unnamed: 0,label,title,lead,text
69791,sport,Historic teams meet in World Series,CBC SPORTS ONLINE - It?s been 86 years since t...,Historic teams meet in World Series CBC SPORTS...
83269,business,"Former Merrill Lynch, Enron Employees Convicte...",Description: A jury in Houston finds four form...,"Former Merrill Lynch, Enron Employees Convicte..."
49165,sci/tech,Translation Device Assists Minn. Police (AP),AP - Burnsville police have begun using an ele...,Translation Device Assists Minn. Police (AP) A...
96597,business,Big Dig Leaks Worse Than Thought,Leaks in the Big Dig highway tunnel system und...,Big Dig Leaks Worse Than Thought Leaks in the ...
27894,sport,"METS 7, BRAVES 0 Benson Gets 4-Hit Shutout","ris Benson, whose quot;tired shoulder quot; b...","METS 7, BRAVES 0 Benson Gets 4-Hit Shutout ris..."


In [83]:
vocab = 200
##TODO tokenize the text, only keep 200 most frequent words 

from nltk.tokenize import word_tokenize

df['tokens'] = df['text'].apply(lambda x: [tok for tok in word_tokenize(x) if tok.isalnum()]) # i am removing all punctuations
flat_toks = [tok for tok_list in df['tokens'] for tok in tok_list]

# now I can get the 200 most common by computing the freq_dist of the flattened list of tokens
from nltk.probability import FreqDist

fdist = FreqDist(flat_toks).most_common(vocab)
most_common = [word for word, _ in fdist]

In [97]:
length = 100
#TODO create a one_hot representation for each word and truncate/pad the sequences such that they are all of the same length (here we use 100)

import torch
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence

w_id = {word: index for index, word in enumerate(most_common)}

one_hot_vec = [F.one_hot(torch.tensor(w_id[word]), num_classes=200) for word in most_common]
one_hot_pad = pad_sequence(one_hot_vec)

In [102]:
##TODO create your torch embedding like we did in notebook 5! (hint: predicting labels: world, sport, business, and sci/tech)

import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class EmbeddingNet(nn.Module):
  def __init__(self, num_judges):
    super(EmbeddingNet, self).__init__()
    self.embedding = nn.Embedding(num_judges, 2)
    self.flatten = nn.Flatten()
    self.fc1 = nn.Linear(2, 2)
    self.fc2 = nn.Linear(2, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, x):
    x = self.embedding(x)
    x = self.flatten(x)
    x = self.fc1(x)
    x = self.fc2(x)
    x = self.sigmoid(x)
    return x

## I'm not very sure what the exact task is here or how to go about it

In [103]:
import os

os.system('jupyter nbconvert --to html homework_05.ipynb')

0