# Word2Vec Embeddings

In [1]:
import pandas as pd

from gensim.models import Word2Vec
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

Loading the data for which we will train word embeddings

The full training file is too large to push to GitHub, and I have been having trouble with using Git LFS for a number of reasons, so only a sample of the training set is on the GitHub repository. A Google Drive link is located in the Datasets folder under the name 'training_datasets_access', which contains a link to access the full training set and a folder containing the trained Word2Vec embeddings used in the machine learning models.

For the purpose of this demonstration, we will be training Word2Vec embeddings using a sample of the training tweets (the resulting Word2Vec embeddings from this dataset aren't as accurate as the word embeddings trained on the entire training dataset)

In [37]:
col_list = ['SpellCheckTweets']
data = pd.DataFrame()
data = pd.read_csv('Datasets/train_tweets_clean.csv', encoding = 'utf-8', usecols=col_list, dtype = str)
data['SpellCheckTweets'] = data['SpellCheckTweets'].apply(str)
data.head()

In [26]:
#create empty list
tweet_data_list = []

indv_lines = data['SpellCheckTweets'].values.tolist()
for line in indv_lines:
    
    #create word tokens
    rem_tok_punc = RegexpTokenizer(r'\w+')
    tokens = rem_tok_punc.tokenize(line)
    
    #append words in the tweet_data_list list
    tweet_data_list.append(tokens)

In [27]:
print(len(tweet_data_list))

199998


Training the Word2Vec model

In [28]:
# Dimension of the word embedding
embed_dim = 100

# Train Word2Vec model
model = Word2Vec(sentences = tweet_data_list, size = embed_dim, workers = 4, min_count = 1)

In [29]:
#Save word embedding model to txt file
model_file = 'Datasets/Word2Vec_embedding.txt'
model.wv.save_word2vec_format(model_file, binary=False)

# Saving the Word2Vec file
model.save("Datasets/Word2Vec.model")

In order to save time, you can load Word2Vec model directly, that was trained in the above 2 cells, by running the following cell instead:

Here, we will be loading the Word2Vec embeddings that were trained on the entire training dataset. To access this Word2Vec model, follow the Google Drive link located in 'Datasets\training_datasets_access', and click on the 'Embeddings' folder on the Drive, and download the following:

Word2Vec.model

Word2Vec.model.trainables.syn1neg.npy

Word2Vec.model.wv.vectors.npy

In [38]:
Word2Vec_model = 'Word2Vec.model'

# Load trained Word2Vec model
model = Word2Vec.load(Word2Vec_model)

## Exploring the vectors in the Word2Vec embeddings

In [39]:
# Finding similar words
model.wv.most_similar('sad')

[('upset', 0.7530679106712341),
 ('depressed', 0.7351791262626648),
 ('heartbreaking', 0.6865273714065552),
 ('disappointed', 0.6809518337249756),
 ('depressing', 0.6282151341438293),
 ('emotional', 0.6175241470336914),
 ('angry', 0.5949232578277588),
 ('bad', 0.592631995677948),
 ('cry', 0.586841344833374),
 ('mad', 0.5740446448326111)]

In [40]:
#Performing some mathematics on word vectors queen + man - woman = ?
model.wv.most_similar_cosmul(positive=['queen','man'], negative=['woman'])

[('king', 0.8694459199905396),
 ('starplus', 0.8644324541091919),
 ('jiggaboo', 0.8369128704071045),
 ('boxspring', 0.8362486958503723),
 ('queens', 0.8203683495521545),
 ('cabino', 0.8177290558815002),
 ('iiked', 0.8144375681877136),
 ('dannngbulls', 0.8141029477119446),
 ('newtrack', 0.8045833110809326),
 ('ibetshewouldwin', 0.8024565577507019)]

In [41]:
#Finding the odd word out from the list of words given
print(model.wv.doesnt_match("france switzerland germany england usa".split()))

usa


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


## Visualising the word embedding vectors

In [42]:
# Importing bokeh libraries for showing how words of similar context are grouped together
import bokeh.plotting as bp

from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

In [43]:
VocabKeys = []
for key in model.wv.vocab.keys():
    VocabKeys.append(key)


#Defining the chart
output_notebook()
plot_chart = bp.figure(plot_width=700, plot_height=600, title="A Plot of 5000 Word Vectors",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

#Extracting the list of word vectors, limiting to 5000, each is of 200 dimensions
word_vectors = [model[w] for w in VocabKeys[:5000]]

  del sys.path[0]


## Reducing dimensionality by converting the vectors to 2d vectors

## TSNE

In [44]:
from sklearn.manifold import TSNE

In [45]:
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_w2v = tsne_model.fit_transform(word_vectors)

# Storing data in a dataframe
tsne_df = pd.DataFrame(tsne_w2v, columns=['x', 'y'])
tsne_df['words'] = VocabKeys[:5000]

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5000 samples in 0.001s...
[t-SNE] Computed neighbors for 5000 samples in 0.719s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5000
[t-SNE] Computed conditional probabilities for sample 2000 / 5000
[t-SNE] Computed conditional probabilities for sample 3000 / 5000
[t-SNE] Computed conditional probabilities for sample 4000 / 5000
[t-SNE] Computed conditional probabilities for sample 5000 / 5000
[t-SNE] Mean sigma: 0.149540
[t-SNE] KL divergence after 250 iterations with early exaggeration: 83.059326
[t-SNE] KL divergence after 1000 iterations: 2.212246


In [46]:
# Corresponding word appears when you hover on the data point.
plot_chart.scatter(x='x', y='y', source=tsne_df)
hover = plot_chart.select(dict(type=HoverTool))
hover.tooltips={"word": "@words"}
show(plot_chart)

## PCA

In [47]:
from sklearn.decomposition import PCA

In [48]:
pca_model = PCA(n_components=2, random_state=0)
pca_w2v = pca_model.fit_transform(word_vectors)

# Storing data in a dataframe
pca_df = pd.DataFrame(pca_w2v, columns=['x', 'y'])
pca_df['words'] = VocabKeys[:5000]

In [49]:
# Corresponding word appears when you hover on the data point.
plot_chart.scatter(x='x', y='y', source=pca_df)
hover = plot_chart.select(dict(type=HoverTool))
hover.tooltips={"word": "@words"}
show(plot_chart)