# Word2Vec

Word2Vec feels like an incantation. In this notebook we are going to try out `gensim`'s word2vec implementation.

In [1]:
# Imports, Functions, Stopwords
import pandas as pd, re
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import sent_tokenize
import gensim

stopwords = set(stopwords.words('english'))

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]

def remove_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

# Load the Data
df = pd.read_csv('../output/TEDall.csv')

# Grab the text of the talks
talks = df.text.tolist()

# Create some labels we can use later but remove the redundant parts of the URL
labels = [re.sub('https://www.ted.com/talks/', '',item) for item in df.public_url.tolist()]

## Building a Word Embedding Model

In his discussion of the `gensim` implementation, Radim Řehůřek notes that it expects a sequence of sentences as input.[[1](https://rare-technologies.com/word2vec-tutorial/)]. Since it looks like that sequence is simply a list of sentences, we are going to bundle all our sentences into a single string, then break it into sentences using the NLTK sentence tokenizer. We will preprocess the sentences to make them lowercase and to remove stop words. 

In [2]:
def sentencer(sentence):
    global stopwords
    tokens = word_tokenize(sentence)
    sentenced = [token for token in tokens if token not in stopwords and  len(token)>2]
    return sentenced

In [3]:
all_talks = ' '.join(talks).lower()

In [4]:
raw = sent_tokenize(all_talks)

In [5]:
print(len(raw), raw[500:503])

220118 ['antiquated zoning and land-use regulations are still used to this day to continue putting polluting facilities in my neighborhood.', 'are these factors taken into consideration when land-use policy is decided?', 'what costs are associated with these decisions?']


In [None]:
sentences = [' '.join(sentencer(sentence)) for sentence in raw]

In [None]:
print(len(sentences), sentences[500:503])

In [6]:
w2v_raw = gensim.models.Word2Vec(raw, min_count = 10)

In [7]:
print( f"Model has {len(w2v_raw.wv.vocab)} terms." )

Model has 76 terms.


In [9]:
w2v_raw.save("../output/w2v_raw_model.bin")

In [None]:
# To re-load this model, run
#w2v_model = gensim.models.Word2Vec.load("w2v-model.bin")

## Calculating K

In [None]:
def calculate_coherence( w2v_model, term_rankings ):
    overall_coherence = 0.0
    for topic_index in range(len(term_rankings)):
        # check each pair of terms
        pair_scores = []
        for pair in combinations( term_rankings[topic_index], 2 ):
            pair_scores.append( w2v_model.similarity(pair[0], pair[1]) )
        # get the mean for all pairs in this topic
        topic_score = sum(pair_scores) / len(pair_scores)
        overall_coherence += topic_score
    # get the mean score across all topics
    return overall_coherence / len(term_rankings)

def get_descriptor( all_terms, H, topic_index, top ):
    # reverse sort the values to sort the indices
    top_indices = np.argsort( H[topic_index,:] )[::-1]
    # now get the terms corresponding to the top-ranked indices
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append( all_terms[term_index] )
    return top_terms

In [None]:
k_values = []
coherences = []
for (k,W,H) in topic_models:
    # Get all of the topic descriptors - the term_rankings, based on top 10 terms
    term_rankings = []
    for topic_index in range(k):
        term_rankings.append( get_descriptor( terms, H, topic_index, 10 ) )
    # Now calculate the coherence based on our Word2vec model
    k_values.append( k )
    coherences.append( calculate_coherence( w2v_model, term_rankings ) )
    print(f"K={k}: Coherence={coherences[-1]:.4f}")