**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).


In [None]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [None]:
dataset = api.load("text8")



In [None]:
dataset

<text8.Dataset at 0x7fccf8599430>

In [None]:
!pip install LineSentence

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement LineSentence (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for LineSentence[0m[31m
[0m

In [None]:
!pip install --upgrade gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from gensim.models import Word2Vec
import gensim.downloader as api
from gensim.utils import simple_preprocess

# Load "text8" dataset from gensim downloader
dataset = api.load("text8")

# Initialize Word2Vec model with hyperparameters
model = Word2Vec(dataset) # number of worker threads



**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found here.


In [None]:
model.wv.most_similar("king")

##TODO find the closest words to king

[('prince', 0.7440363764762878),
 ('queen', 0.7238773703575134),
 ('emperor', 0.7014837265014648),
 ('throne', 0.6960609555244446),
 ('kings', 0.6862594485282898),
 ('vii', 0.6750237345695496),
 ('aragon', 0.6732887625694275),
 ('regent', 0.664305567741394),
 ('sultan', 0.6592472195625305),
 ('pope', 0.6572885513305664)]



King is to man as woman is to X


In [None]:
##TODO find the closest word for the vector "woman" + "king" - "man"
model.wv.most_similar(positive=["woman", "king"], negative=["man"])

[('queen', 0.6874383687973022),
 ('prince', 0.6202050447463989),
 ('throne', 0.6170430779457092),
 ('princess', 0.6168161034584045),
 ('sigismund', 0.6153447031974792),
 ('isabella', 0.6132460832595825),
 ('empress', 0.6125864386558533),
 ('son', 0.6124827265739441),
 ('matilda', 0.6055944561958313),
 ('elizabeth', 0.5988599061965942)]



**Evaluate Word Similarities**

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the WordSim353 benchmark, the task is to determine how similar two words are.


In [None]:
!wget http://alfonseca.org/pubs/ws353simrel.tar.gz
!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

--2023-03-23 10:10:04--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: ‘ws353simrel.tar.gz’


2023-03-23 10:10:04 (548 MB/s) - ‘ws353simrel.tar.gz’ saved [5460/5460]

[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [None]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
# if a word is not present in our model, we assign similarity 0 for 

import pandas as pd
import numpy as np

similarities = []

for pair in X:
    if pair[0] in list(model.wv.index_to_key) and pair[1] in list(model.wv.index_to_key):
        similarities.append(model.wv.similarity(pair[0], pair[1]))
    else:
        similarities.append(0)

similarities_summary = pd.DataFrame(list(zip(X, y, similarities)), columns=['pairs', 'similarities_wordsim', 'similarities_w2v'])
similarities_summary.head()

Unnamed: 0,pairs,similarities_wordsim,similarities_w2v
0,"(tiger, cat)",7.35,0.598704
1,"(tiger, tiger)",10.0,1.0
2,"(plane, car)",5.77,0.451379
3,"(train, car)",6.31,0.555301
4,"(television, radio)",6.77,0.717568


In [None]:
from scipy.stats import spearmanr

spearmanr(y, similarities)

SignificanceResult(statistic=0.6373903244607537, pvalue=1.5426954517594767e-24)

In [None]:
import spacy
en = spacy.load('en_core_web_sm')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment

sim_spacy = []

for pair in X:
    sim_spacy.append(model.wv.similarity(en(pair[0]), en(pair[1])))

similiarities_summary['spacy'] = sim_spacy

similarities_summary.head()


**PyTorch Embeddings**

In [None]:
#Import the AG news dataset (same as hw01)
#Download them from here 
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

--2023-03-23 10:36:42--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: ‘train.csv’


2023-03-23 10:36:45 (130 MB/s) - ‘train.csv’ saved [29470338/29470338]



Unnamed: 0,label,title,lead,text
52916,sci/tech,Monti: Courts must rule on MS anti-trust,"Mario Monti, the outgoing European competition...",Monti: Courts must rule on MS anti-trust Mario...
50946,sci/tech,AT T looks into closing Windows,A team of researchers is evaluating how Linux ...,AT T looks into closing Windows A team of rese...
15836,sport,Citadel Postpones Opener Over Hurricane (AP),AP - The Citadel has postponed its home opener...,Citadel Postpones Opener Over Hurricane (AP) A...
75279,world,Two Moderate Earthquakes Hit Taiwan (AP),AP - Two moderate earthquakes hit eastern Taiw...,Two Moderate Earthquakes Hit Taiwan (AP) AP - ...
36427,sci/tech,PeopleSoft devotees in denial?,With software tycoon Larry Ellison poised to d...,PeopleSoft devotees in denial? With software t...


In [None]:
vocab = 200
##TODO tokenize the text, only keep 200 most frequent words 
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
# Define the CountVectorizer with the desired hyperparameters
vectorizer = CountVectorizer(max_features=200)

# Fit the CountVectorizer to the text data in the 'text' column of the DataFrame
vectorizer.fit(df['text'])

# Transform the 'text' column into a sparse matrix of word counts
X = vectorizer.transform(df['text'])

# Convert the sparse matrix into a dense matrix for further processing
X_dense = X.toarray()

print(vectorizer.get_feature_names())

In [None]:
from sklearn.preprocessing import OneHotEncoder


In [None]:
#TODO create a one_hot representation for each word and truncate/p
# Create a one-hot encoder object
onehot_encoder = OneHotEncoder()

# Fit the one-hot encoder to the CountVectorizer output
onehot_encoder.fit(X_dense)