### Word embeddings
Word2Vec Google 2013 C implementation

"show me your friends and I'll tell you who you are"

In [1]:
import pandas as pd

#### Load Simpsons dataset

In [2]:
df = pd.read_csv('../../../data/simpsons/simpsons_script_lines.csv')

  df = pd.read_csv('../../../data/simpsons/simpsons_script_lines.csv')


In [3]:
df[['raw_character_text', 'spoken_words']].head()

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


- make sure all text documents are type 'str'

In [4]:
df['spoken_words'] = df.spoken_words.astype('str')

#### Data preprocessing (tokenization)

In [5]:
from gensim.utils import simple_preprocess, tokenize

In [6]:
preprocessed_data = df.spoken_words.apply(simple_preprocess)
preprocessed_data.head(3)

0    [no, actually, it, was, little, of, both, some...
1                               [where, mr, bergstrom]
2    [don, know, although, sure, like, to, talk, to...
Name: spoken_words, dtype: object

#### Word2Vec model

In [7]:
from gensim.models.word2vec import Word2Vec

In [8]:
w2v = Word2Vec(min_count = 20, window = 2, sample = 6e-5, alpha = 0.03, negative = 20, workers = 4)

- vector_size: dimension of the word vectors
- window: max. distance between related words within a sentence
- min_count: ignore words with frequency lower than this
- alpha: learning rate
- max_vocab_size: unique words (10M words~ 1GB, prune infrequent words if not enough memory)

#### build vocabulary corpus

In [9]:
w2v.build_vocab(preprocessed_data, progress_per = 1000)

#### train model

In [10]:
%%time
w2v.train(preprocessed_data, total_examples = w2v.corpus_count, epochs = 30)

CPU times: user 1min 2s, sys: 262 ms, total: 1min 2s
Wall time: 18.7 s


(12159262, 38288880)

#### embedded word vetor

In [11]:
w2v.wv

<gensim.models.keyedvectors.KeyedVectors at 0x7ff3a48e39d0>

In [12]:
w2v.wv['homer'], w2v.wv['bart']

(array([ 0.17007735,  0.05130703,  0.2449152 , -0.34791893, -0.34790993,
        -0.1582444 ,  0.27166662,  0.24572502, -0.18833262, -0.07470448,
         0.32395992, -0.3827506 , -0.30402797,  0.5117364 , -0.25202638,
        -0.18041186, -0.23169003,  0.07758235,  0.30381194, -0.3615166 ,
         0.4894907 ,  0.1689966 ,  0.5835631 , -0.2762664 , -0.15182684,
         0.05075058, -0.4095239 , -0.34264362, -0.36777773,  0.09464496,
        -0.37031403,  0.02066873,  0.05076038,  0.02864877, -0.21446376,
        -0.0954689 , -0.01066179, -0.03046482,  0.1248414 , -0.73715675,
        -0.00761789,  0.2902718 ,  0.02610861,  0.53760153,  0.16781178,
        -0.40810835,  0.17140816, -0.14743644,  0.20496991,  0.42047876,
         0.16198042, -0.08699372,  0.0990838 , -0.11170958, -0.27884436,
         0.06706273,  0.4690118 ,  0.04659254,  0.7466157 ,  0.3043252 ,
        -0.18438147,  0.16471702,  0.51746774, -0.18928978, -0.0459165 ,
        -0.23764385,  0.24147789,  0.75318056, -0.1

#### word similarity (closeness in the embeddig space)

In [13]:
w2v.wv.similarity('homer', 'bart')

0.61756635

#### find most similar

In [14]:
w2v.wv.most_similar(positive = ['homer'])

[('marge', 0.7098295092582703),
 ('bart', 0.6175663471221924),
 ('lisa', 0.5872983336448669),
 ('homie', 0.5258287191390991),
 ('mr', 0.5219753384590149),
 ('becky', 0.509057879447937),
 ('you', 0.5053243041038513),
 ('simpson', 0.4968818426132202),
 ('abe', 0.45856770873069763),
 ('son', 0.45351937413215637)]

#### odd-one-out (find most dissimilar within list)

In [15]:
w2v.wv.doesnt_match(['nelson', 'bart', 'milhouse'])

'nelson'

#### analogy difference

- which word is to 'woman' as 'homer' is to 'marge' ?

In [17]:
w2v.wv.most_similar(positive = ["woman", "homer"], negative = ["marge"], topn = 3)

[('man', 0.5492452383041382),
 ('guy', 0.5060333609580994),
 ('person', 0.4851493239402771)]

- which word is to 'woman' as 'bart' is to 'man'?

In [18]:
w2v.wv.most_similar(positive = ["woman", "bart"], negative = ["man"], topn = 3)

[('lisa', 0.6292288303375244),
 ('mom', 0.5806299448013306),
 ('maggie', 0.5580624938011169)]