# Word2Vec

Word2Vec feels like an incantation. In this notebook we are going to try out `gensim`'s word2vec implementation.

In [25]:
import numpy as np

In [1]:
# Imports, Functions, Stopwords
import pandas as pd, re
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import sent_tokenize
import gensim
import operator

stopwords = set(stopwords.words('english'))

# Load the Data
df = pd.read_csv('../output/TEDall.csv')

# Grab the text of the talks
talks = df.text.tolist()

## Building a Word Embedding Model

In his discussion of the `gensim` implementation, Radim Řehůřek notes that it expects a list of sentences as input.[[1](https://rare-technologies.com/word2vec-tutorial/)]. Because individual texts do not matter to/for `word2vec`, we are going to bundle all our sentences into a single string, then break it into sentences using the NLTK sentence tokenizer. We will preprocess the sentences to make them lowercase and to remove stop words. 

In [2]:
def sentencer(sentence):
    global stopwords
    tokens = word_tokenize(sentence)
    sentenced = [token for token in tokens if token not in stopwords and  len(token)>2]
    return sentenced

In [3]:
all_talks = ' '.join(talks).lower()

In [4]:
raw = sent_tokenize(all_talks)
# Check our work by getting the number of sentences and three sentences
print(len(raw), raw[500:503])

220118 ['antiquated zoning and land-use regulations are still used to this day to continue putting polluting facilities in my neighborhood.', 'are these factors taken into consideration when land-use policy is decided?', 'what costs are associated with these decisions?']


In [5]:
sentences = [sentencer(sentence) for sentence in raw]
print(len(sentences), sentences[500:503])

220118 [['antiquated', 'zoning', 'land-use', 'regulations', 'still', 'used', 'day', 'continue', 'putting', 'polluting', 'facilities', 'neighborhood'], ['factors', 'taken', 'consideration', 'land-use', 'policy', 'decided'], ['costs', 'associated', 'decisions']]


In [6]:
w2v_model = gensim.models.Word2Vec(min_count=5, window=10, workers=3)

In [8]:
w2v_model.build_vocab(sentences)

In [9]:
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

(45279594, 49749630)

In [10]:
w2v_model.wv.most_similar(positive=["climate"])

[('warming', 0.5628418922424316),
 ('deforestation', 0.5012110471725464),
 ('environmental', 0.4848070740699768),
 ('greenhouse', 0.4760082960128784),
 ('pollutants', 0.4671592116355896),
 ('biodiversity', 0.466927707195282),
 ('impacts', 0.46144992113113403),
 ('brink', 0.4613882303237915),
 ('disasters', 0.4569864273071289),
 ('two-degree', 0.4561028480529785)]

In [14]:
# print( f"Model has {len(w2v_model)} terms." )

# w2v_model.save("../output/w2v_model.bin")

# To re-load this model, run
# w2v_model = gensim.models.Word2Vec.load("w2v-model.bin")