This notebook will tell how we construct Word2Vec model and use it to compute the similarity. 

In [4]:
import gensim
import pandas as pd
from gensim import corpora, models

Read in the processed dataset with full vocabulary, which means we include the not important words.

In [3]:
dat = pd.read_csv('brunch_processed.csv')
reviews=dat["text"].tolist()
stars=dat["stars"].tolist()
dat.shape

(505696, 3)

Build up a words dictionary with unique word ids as the keys and corresponding words as the values.

In [5]:
rev=[]
for text in reviews:
    if type(text)==float:#in case there's nan
        continue
    rev.append(text.split(" "))
dic=gensim.corpora.Dictionary(rev)

In [6]:
W2Vmodel = gensim.models.Word2Vec(
        rev,
        window=10,
        min_count=5,
        workers=3)

And there're some uses of this model below. We can find out the top similar words to one word or to a vector of words, calculate the similarity between two given words and select the one doesn't match to other words.

In [7]:
W2Vmodel.wv.most_similar('taco')

[('tacos', 0.9071621298789978),
 ('street_taco', 0.8789133429527283),
 ('al_pastor', 0.8185325860977173),
 ('carnitas_taco', 0.7979053258895874),
 ('taco_carne', 0.7923363447189331),
 ('shrimp_taco', 0.7880537509918213),
 ('carne_asada', 0.7851074934005737),
 ('bean_rice', 0.77857506275177),
 ('burro', 0.7780328989028931),
 ('taquitos', 0.7740388512611389)]

In [9]:
W2Vmodel.wv.most_similar(positive=["pancake","egg","omelet","waffle"])

[('buttermilk_pancake', 0.8076778650283813),
 ('blueberry_pancake', 0.8038700819015503),
 ('scramble_egg', 0.802808403968811),
 ('scrambled_egg', 0.7944670915603638),
 ('omelette', 0.7737205028533936),
 ('french_toast', 0.7620259523391724),
 ('belgian_waffle', 0.7568501234054565),
 ('hashbrowns', 0.741797149181366),
 ('pancakes', 0.732605516910553),
 ('belgium_waffle', 0.7280862331390381)]

In [10]:
W2Vmodel.wv.similarity(w1='staff',w2='server')

0.6786361

In [11]:
W2Vmodel.wv.doesnt_match(["pancake","egg","omelet","waffle","coffee"])

'coffee'

I also define a function to select the words with similarity above 0.70 for future use.

In [12]:
def above_70(w2vmodel,wordvector):
    result=w2vmodel.wv.most_similar(positive=wordvector,topn=50)
    word_select=[]
    i=0
    while (result[i][1] >=0.70) and (i<=49):
        word_select.append(result[i][0])
        i+=1
    return word_select