Using Gensim Word2Vec to create a food-related word embedding. 
See [https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial)

Connect to my Google Drive so I don't have to upload reviews.csv every time. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import re # For preprocessing
import pandas as pd # For data handling
from time import time # To time our operations
from collections import defaultdict # For word frequency

import spacy # For preprocessing

import logging # Setting up the loggings to monitor gensim
logging.basicConfig(format="5(levelnames)s - %(asctime)s: %(message)s", 
                    datefmt = '%H%M%S', level=logging.INFO)

We are using the Food.com dataset prepared by Alvin on Kaggle: [https://www.kaggle.com/irkaal/foodcom-recipes-and-reviews](https://www.kaggle.com/irkaal/foodcom-recipes-and-reviews)

In [None]:
df = pd.read_csv('/content/drive/MyDrive/reviews.csv') # read in the review dataset
df.head() #check the column names

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
0,2,992,2008,gayg msft,5,better than any you can get at a restaurant!,2000-01-25T21:44:00Z,2000-01-25T21:44:00Z
1,7,4384,1634,Bill Hilbrich,4,"I cut back on the mayo, and made up the differ...",2001-10-17T16:49:59Z,2001-10-17T16:49:59Z
2,9,4523,2046,Gay Gilmore ckpt,2,i think i did something wrong because i could ...,2000-02-25T09:00:00Z,2000-02-25T09:00:00Z
3,13,7435,1773,Malarkey Test,5,easily the best i have ever had. juicy flavor...,2000-03-13T21:15:00Z,2000-03-13T21:15:00Z
4,14,44,2085,Tony Small,5,An excellent dish.,2000-03-28T12:51:00Z,2000-03-28T12:51:00Z


In [None]:
# Remove blank reviews/those without text
df=df.dropna().reset_index(drop=True)
df.isnull().sum()

5(levelnames)s - 212753: NumExpr defaulting to 2 threads.


ReviewId         0
RecipeId         0
AuthorId         0
AuthorName       0
Rating           0
Review           0
DateSubmitted    0
DateModified     0
dtype: int64

In [None]:
nlp = spacy.load('en', disable = ['ner', 'parser']) #disabling Named Entity Recognition for speed

def cleaning(doc): 
  #function that lemmatizes and removes stopwords: i.e.) running, ran, run get converted to baseword "run"
  # doc needs to be a spacy Doc object

  txt = [token.lemma_ for token in doc if not token.is_stop]
  # lemmatizes each word in doc, if it is not a stopword

  if len(txt) > 2:
    return ' '.join(txt)
    # converts list of words back to single string with spaces. 
    # i.e) "the cat ran" -> "the, cat, ran" -> "cat run"


In [None]:
#Remove non alphabetic characters and convert everything to lower case
#brief_cleaning is an interator; each "object" in the iterator is a cleaned review. 

brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['Review'])


In [None]:
df_clean = pd.read_csv('/content/drive/MyDrive/cleaned_reviews.csv')

In [None]:
from gensim.models.phrases import Phrases, Phraser

In [None]:
list_of_lists = [row.split() for row in df_clean['clean']]

In [None]:
phrases = Phrases(list_of_lists, progress_per = 10000)

5(levelnames)s - 213152: collecting all words and their counts
5(levelnames)s - 213152: PROGRESS: at sentence #0, processed 0 words and 0 word types
5(levelnames)s - 213152: PROGRESS: at sentence #10000, processed 171802 words and 104901 word types
5(levelnames)s - 213152: PROGRESS: at sentence #20000, processed 381565 words and 194204 word types
5(levelnames)s - 213153: PROGRESS: at sentence #30000, processed 600679 words and 273256 word types
5(levelnames)s - 213153: PROGRESS: at sentence #40000, processed 820978 words and 342611 word types
5(levelnames)s - 213153: PROGRESS: at sentence #50000, processed 1044448 words and 408364 word types
5(levelnames)s - 213154: PROGRESS: at sentence #60000, processed 1263537 words and 469121 word types
5(levelnames)s - 213154: PROGRESS: at sentence #70000, processed 1480796 words and 524683 word types
5(levelnames)s - 213154: PROGRESS: at sentence #80000, processed 1695036 words and 576187 word types
5(levelnames)s - 213155: PROGRESS: at sentence 

In [None]:
bigram = Phraser(phrases)

5(levelnames)s - 213427: source_vocab length 3929266
5(levelnames)s - 213508: Phraser built with 16928 phrasegrams


In [None]:
sentences = bigram[list_of_lists]

In [None]:
word_freq = defaultdict(int)
for sent in sentences:
  for word in sent:
    word_freq[word] += 1
len(word_freq)

138866

In [None]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['recipe',
 'thank',
 'add',
 'good',
 'great',
 'love',
 'time',
 'like',
 'easy',
 'taste']

In [None]:
import multiprocessing
from gensim.models import Word2Vec

In [None]:
cores = multiprocessing.cpu_count() #Count the number of cores in a computer

In [None]:
w2v_model = Word2Vec(min_count = 100, 
                     window = 2,
                     size = 300,
                     sample = 6e-5,
                     alpha = 0.03,
                     min_alpha = 0.0007,
                     negative = 20,
                     workers = cores-1)

In [None]:
t = time()
w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time()-t)/60, 2)))

5(levelnames)s - 214750: collecting all words and their counts
5(levelnames)s - 214750: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
5(levelnames)s - 214751: PROGRESS: at sentence #10000, processed 161663 words, keeping 11339 word types
5(levelnames)s - 214752: PROGRESS: at sentence #20000, processed 358885 words, keeping 17018 word types
5(levelnames)s - 214752: PROGRESS: at sentence #30000, processed 564703 words, keeping 21196 word types
5(levelnames)s - 214753: PROGRESS: at sentence #40000, processed 771435 words, keeping 24533 word types
5(levelnames)s - 214753: PROGRESS: at sentence #50000, processed 981267 words, keeping 27528 word types
5(levelnames)s - 214754: PROGRESS: at sentence #60000, processed 1187241 words, keeping 30262 word types
5(levelnames)s - 214755: PROGRESS: at sentence #70000, processed 1391284 words, keeping 32680 word types
5(levelnames)s - 214755: PROGRESS: at sentence #80000, processed 1592336 words, keeping 34813 word types
5(levelname

Time to build vocab: 1.55 mins


In [None]:
t = time()
w2v_model.train(sentences, total_examples = w2v_model.corpus_count, epochs = 5, report_delay=1)
print("Time to train the model: {} mins".format(round((time()-t)/60,2)))

5(levelnames)s - 215138: training model with 1 workers on 8561 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2
5(levelnames)s - 215139: EPOCH 1 - PROGRESS: at 0.84% examples, 69845 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215141: EPOCH 1 - PROGRESS: at 1.57% examples, 72108 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215142: EPOCH 1 - PROGRESS: at 2.24% examples, 72295 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215143: EPOCH 1 - PROGRESS: at 2.91% examples, 72135 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215144: EPOCH 1 - PROGRESS: at 3.58% examples, 71954 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215145: EPOCH 1 - PROGRESS: at 4.24% examples, 72073 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215146: EPOCH 1 - PROGRESS: at 4.92% examples, 71810 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215147: EPOCH 1 - PROGRESS: at 5.61% examples, 71656 words/s, in_qsize 0, out_qsize 0
5(levelnames)s - 215148: EPOCH 1 -

Time to train the model: 12.55 mins


In [None]:
w2v_model.save('/content/drive/MyDrive/recipes.model')

5(levelnames)s - 220540: saving Word2Vec object under /content/drive/MyDrive/recipes.model, separately None
5(levelnames)s - 220540: not storing attribute vectors_norm
5(levelnames)s - 220540: not storing attribute cum_table
5(levelnames)s - 220541: saved /content/drive/MyDrive/recipes.model


In [None]:
w2v_model.init_sims(replace=True)

5(levelnames)s - 220639: precomputing L2-norms of word weight vectors


In [None]:
w2v_model.wv.most_similar(positive=['chocolate'])

[('choclate', 0.8654007911682129),
 ('choc', 0.8297751545906067),
 ('choco', 0.8141728043556213),
 ('dark_chocolate', 0.8139923810958862),
 ('bittersweet_chocolate', 0.7554681301116943),
 ('bittersweet', 0.7355612516403198),
 ('semisweet_chocolate', 0.7344757914543152),
 ('butterscotch', 0.7306553721427917),
 ('hershey', 0.7225421071052551),
 ('chocolate_ganache', 0.7222815155982971)]

In [None]:
w2v_model.wv.similarity('brown', 'sugar')

0.15413524

In [None]:
w2v_model.wv.most_similar(positive = ['tahini'])

[('hummus', 0.6505939364433289),
 ('hummous', 0.5677351951599121),
 ('sumac', 0.5272275805473328),
 ('sesame', 0.5063362121582031),
 ('miso', 0.5032118558883667),
 ('sambal_oelek', 0.5031461119651794),
 ('can_chickpea', 0.49904292821884155),
 ('chickpeas', 0.4927641749382019),
 ('harissa', 0.48904722929000854),
 ('sesame_oil', 0.4848523736000061)]

In [None]:
w2v_model.wv.most_similar(positive = ['milk', 'protein'], negative=['tofu'], topn=5)

[('skim_milk', 0.6056066751480103),
 ('nonfat_milk', 0.5720887184143066),
 ('nonfat', 0.5487066507339478),
 ('percent_milk', 0.5205460786819458),
 ('non_fat', 0.5107308626174927)]