# HW05: Word Embeddings

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Essay Feedback**

Please provide feedback to two classmates' essays on Eduflow.

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [1]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [2]:
dataset = api.load("text8")



In [5]:
from gensim.models import Word2Vec

##TODO train a word2vec model on this dataset which appear at least 10 times in the corpus

model = Word2Vec(dataset, min_count=10)

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [8]:
model.wv

##TODO find the closest words to king

close_words = model.wv.most_similar('king')

for word in close_words:
    print(word)

('prince', 0.7743529081344604)
('queen', 0.7292153239250183)
('kings', 0.7031580209732056)
('throne', 0.6981783509254456)
('emperor', 0.6948561072349548)
('regent', 0.6773486137390137)
('sultan', 0.6719624400138855)
('vii', 0.6605079174041748)
('aragon', 0.6547995209693909)
('elector', 0.6542447209358215)


King is to man as woman is to X

In [9]:
##TODO find the closest word for the vector "woman" + "king" - "man"

closest_word = model.wv.most_similar(positive=['woman', 'king'], negative=['man'])[0]

print(closest_word)

('queen', 0.6954731941223145)


**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.

In [10]:
!wget http://alfonseca.org/pubs/ws353simrel.tar.gz
!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

--2023-03-30 13:34:53--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: ‘ws353simrel.tar.gz’


2023-03-30 13:34:54 (486 MB/s) - ‘ws353simrel.tar.gz’ saved [5460/5460]

[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [14]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
# if a word is not present in our model, we assign similarity 0 for the respective text pair

sim_scores = []

for pair in X:
    try:
        score = model.wv.similarity(pair[0], pair[1])
    except KeyError:
        score = 0.0
    
    sim_scores.append(score)

print(sim_scores[:3])

[0.61210567, 1.0, 0.4190455]


In [17]:
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations

correlation, _ = spearmanr(y, sim_scores)

print(correlation)

0.6463631577377009


In [19]:
import spacy
en = spacy.load('en_core_web_sm')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings

sim_scores = []

for pair in X:
    word1 = en(pair[0])
    word2 = en(pair[1])

    score = word1.similarity(word2)
    sim_scores.append(score)

print(sim_scores[:3])

##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment

correlation, _ = spearmanr(y, sim_scores)

print(correlation)

  score = word1.similarity(word2)


[0.6628428894742415, 1.0, 0.7974166052761725]
0.0917488312498204


**PyTorch Embeddings**

In [20]:
#Import the AG news dataset (same as hw01)
#Download them from here 
# !wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

--2023-03-30 14:01:47--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: ‘train.csv’


2023-03-30 14:01:47 (166 MB/s) - ‘train.csv’ saved [29470338/29470338]



Unnamed: 0,label,title,lead,text
17149,world,Israel Hints at Military Action Vs. Syria (AP),AP - Israel ratcheted up its rhetoric against ...,Israel Hints at Military Action Vs. Syria (AP)...
115124,sci/tech,Geminid Meteor Shower Peaks Monday Night,Viewing conditions could be just right to try ...,Geminid Meteor Shower Peaks Monday Night Viewi...
78660,world,Iran votes to resume nuclear programme,"The Iranian parliament, which is dominated by ...",Iran votes to resume nuclear programme The Ira...
7091,sci/tech,2 Vietnamese die of unidentified virus,Among three recent human deaths in Vietnam #39...,2 Vietnamese die of unidentified virus Among t...
17259,business,Lucent Says It May Get #36;816M Tax Refund (AP),AP - Telecommunications gear maker Lucent Tech...,Lucent Says It May Get #36;816M Tax Refund (A...


In [22]:
vocab = 200
##TODO tokenize the text, only keep 200 most frequent words 

from gensim.utils import tokenize
from collections import Counter

tokenized_text = [list(tokenize(t, lowercase = True)) for t in df['text']]


whole_corpus = []

for t in tokenized_text:
    whole_corpus.extend(t)


common_words = set([t[0] for t in Counter(whole_corpus).most_common(200)])

print(common_words)

{'more', 'world', 'to', 'he', 'president', 'former', 'first', 'court', 'space', 'n', 'as', 'microsoft', 'gt', 'no', 'and', 'would', 'years', 'b', 'technology', 'into', 'a', 'u', 'had', 'you', 'international', 'american', 'get', 'next', 'reported', 'what', 'south', 'oil', 'bush', 'this', 'google', 'their', 'co', 'made', 'day', 's', 'make', 'afp', 'officials', 'the', 'reuters', 'yesterday', 'been', 'third', 'year', 'after', 'week', 'software', 'company', 'they', 'it', 'york', 'deal', 'victory', 'quarter', 'just', 'computer', 'market', 'killed', 'may', 'expected', 'news', 'of', 'billion', 'said', 'san', 'night', 'washington', 'wednesday', 'could', 'announced', 'profit', 'high', 'when', 'end', 'help', 'will', 'home', 'by', 'says', 'plans', 'about', 'against', 't', 'last', 'lead', 'season', 'four', 'out', 'tuesday', 'three', 'be', 'coach', 'bank', 'its', 'an', 'thursday', 'million', 'chief', 'but', 'largest', 'cup', 'service', 'some', 'one', 'second', 'report', 'won', 'two', 'open', 'games'

In [30]:
length = 100
#TODO create a one_hot representation for each word and truncate/pad the sequences such that they are all of the same length (here we use 100)

# !pip install Keras-Preprocessing
from keras.preprocessing.text import one_hot
from keras_preprocessing.sequence import pad_sequences

X_one_hot = [one_hot(opinion, n=len(df)) for opinion in df['text']]
print(X_one_hot[0][:50])

X_one_hot_padded = pad_sequences(X_one_hot, padding='post', maxlen=100, truncating='post')
X_one_hot_padded.shape

[8172, 7716, 9245, 4905, 5407, 7037, 276, 7261, 7261, 8172, 3774, 702, 8015, 1977, 6987, 276, 4960, 9534, 6505, 1323, 1325, 4905, 5407, 5296, 4381, 4105, 2063, 8850, 9590, 4371, 995, 8172]


(10000, 100)

In [33]:
##TODO create your torch embedding like we did in notebook 5! (hint: predicting labels: world, sport, business, and sci/tech)

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

class EmbeddingNet(nn.Module):
    def __init__(self, num_judges):
        super(EmbeddingNet, self).__init__()
        self.embedding = nn.Embedding(num_judges, 2)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(2, 2)
        self.fc2 = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.embedding(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

class GenericDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index):
        return self.X[index], self.y[index]