## Word2Vec Tutorial

In [113]:
import pandas as pd
import numpy as np
import re

from nltk.corpus import stopwords

from gensim.models import Word2Vec
from gensim.models import Phrases
from gensim.models.phrases import Phraser

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Exploring the Data

In [2]:
data = pd.read_csv('data/corpus.csv')

We are interested in Description column.

In [3]:
data.head()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


In [4]:
data.describe()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
count,38932,38932,38932,38932,38932
unique,38932,38932,11,3,2
top,id47031,Recently stayed at the Conrad for - nights dur...,Firefox,Desktop,happy
freq,1,1,7367,15026,26521


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38932 entries, 0 to 38931
Data columns (total 5 columns):
User_ID         38932 non-null object
Description     38932 non-null object
Browser_Used    38932 non-null object
Device_Used     38932 non-null object
Is_Response     38932 non-null object
dtypes: object(5)
memory usage: 1.5+ MB


In [7]:
# get stop words from nltk
stopWords = stopwords.words('english')

# pre processing data
def cleanData(sentence):
    processedList = ""
    
    # convert to lowercase, ignore all special characters - keep only alpha numericals and spaces (not removing full-stop here)
    sentence = re.sub(r'[^A-Za-z0-9\s.]',r'',str(sentence).lower())
    sentence = re.sub(r'\n',r' ',sentence)
    
    # remove stop words
    sentence = " ".join([word for word in sentence.split() if word not in stopWords])
    
    return sentence

Sample processed review:

In [9]:
cleanData(data['Description'][2])

'booked hotel hotwire lowest price could find. got front desk manager gave us smoking room. argued little baby would booked room known smoking. manager would hear anything told hotwire books cheapest rooms available. get go unhappy. great deal persuasion discussion got nonsmoking room. thereafter room minimal amenities. besides great location near dupont circle much say overpriced hotel. room small ok condition. bathroom small tub. bathroom amenities also minimal. fridge microwave rent fridge staff keep baby things. parking costs per day best drive here. also breakfast included. lobby small feels old. thing available coffee lobby decent. poor service minimal amenities small rooms small bathrooms view great location. distance metro either mcpherson station dupont station. try look better available.'

Process the whole dataset

In [10]:
# clean data 
data['Description'] = data['Description'].map(lambda x: cleanData(x))

In [12]:
data.head()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,room kind clean strong smell dogs. generally a...,Edge,Mobile,not happy
1,id10327,stayed crown plaza april april . staff friendl...,Internet Explorer,Mobile,not happy
2,id10328,booked hotel hotwire lowest price could find. ...,Mozilla,Tablet,not happy
3,id10329,stayed husband sons way alaska cruise. loved h...,InternetExplorer,Desktop,happy
4,id10330,girlfriends stayed celebrate th birthdays. pla...,Edge,Tablet,not happy


# Word2Vec

Going to use [gensim](https://radimrehurek.com/gensim/models/word2vec.html) library to train word2vec model. Gensim accepts input in form of list of lists, where each internal list consists of review sentence.  
  
Each review in our data may have more than one sentence. We'll split each sentence and create a list of sentences to pass it to gensim.

In [14]:
tmp_corpus = data['Description'].map(lambda x: x.split('.'))

In [15]:
# corpus [[w1,w2,w3..],[..]]
corpus = []
for i in range(len(tmp_corpus)):
    for line in tmp_corpus[i]:
        words = [x for x in line.split()]
        corpus.append(words)

Our data contains **444k** sentences and **3111k** words.

In [16]:
num_of_sentences = len(corpus)
num_of_words = 0
for line in corpus:
    num_of_words += len(line)

print('Num of sentences - %s'%(num_of_sentences))
print('Num of words - %s'%(num_of_words))

Num of sentences - 444271
Num of words - 3111121


Our data contains **444k** sentences and **3111k** words.

We'll do a little more preprocessing here by extracting phrases from the corpus. For example, new york -> new_york, etc. We'll limit to only bigrams. 

In [46]:
phrases = Phrases(sentences=corpus,min_count=25,threshold=50)
bigram = Phraser(phrases)

2017-10-21 09:25:44,635 : INFO : collecting all words and their counts
2017-10-21 09:25:44,651 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-10-21 09:25:44,815 : INFO : PROGRESS: at sentence #10000, processed 62189 words and 55905 word types
2017-10-21 09:25:44,946 : INFO : PROGRESS: at sentence #20000, processed 123930 words and 98827 word types
2017-10-21 09:25:45,080 : INFO : PROGRESS: at sentence #30000, processed 186473 words and 137942 word types
2017-10-21 09:25:45,197 : INFO : PROGRESS: at sentence #40000, processed 247478 words and 173306 word types
2017-10-21 09:25:45,347 : INFO : PROGRESS: at sentence #50000, processed 308086 words and 206238 word types
2017-10-21 09:25:45,481 : INFO : PROGRESS: at sentence #60000, processed 369423 words and 238626 word types
2017-10-21 09:25:45,629 : INFO : PROGRESS: at sentence #70000, processed 431308 words and 270170 word types
2017-10-21 09:25:45,759 : INFO : PROGRESS: at sentence #80000, processed 494375 wo

In [49]:
for index,sentence in enumerate(corpus):
    corpus[index] = bigram[sentence]

In [52]:
# shuffle corpus
def shuffle_corpus(sentences):
    shuffled = list(sentences)
    random.shuffle(shuffled)
    return shuffled

In [54]:
# sg - skip gram |  window = size of the window | size = vector dimension
size = 100
window_size = 2 # sentences weren't too long, so
epochs = 100
min_count = 2
workers = 4

# train word2vec model using gensim
model = Word2Vec(corpus, sg=1,window=window_size,size=size,\
                               min_count=min_count,workers=workers,iter=epochs,sample=0.01)

2017-10-21 09:55:00,674 : INFO : collecting all words and their counts
2017-10-21 09:55:00,682 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-10-21 09:55:00,707 : INFO : PROGRESS: at sentence #10000, processed 68251 words, keeping 8812 word types
2017-10-21 09:55:00,734 : INFO : PROGRESS: at sentence #20000, processed 136136 words, keeping 12451 word types
2017-10-21 09:55:00,769 : INFO : PROGRESS: at sentence #30000, processed 204641 words, keeping 15230 word types
2017-10-21 09:55:00,802 : INFO : PROGRESS: at sentence #40000, processed 271576 words, keeping 17587 word types
2017-10-21 09:55:00,834 : INFO : PROGRESS: at sentence #50000, processed 338157 words, keeping 19539 word types
2017-10-21 09:55:00,872 : INFO : PROGRESS: at sentence #60000, processed 405603 words, keeping 21363 word types
2017-10-21 09:55:00,907 : INFO : PROGRESS: at sentence #70000, processed 473392 words, keeping 23191 word types
2017-10-21 09:55:00,944 : INFO : PROGRESS: at se

2017-10-21 09:55:22,534 : INFO : PROGRESS: at 3.21% examples, 490621 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:55:23,535 : INFO : PROGRESS: at 3.39% examples, 492057 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:55:24,543 : INFO : PROGRESS: at 3.56% examples, 492894 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:55:25,568 : INFO : PROGRESS: at 3.75% examples, 494738 words/s, in_qsize 16, out_qsize 1
2017-10-21 09:55:26,595 : INFO : PROGRESS: at 3.94% examples, 497782 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:55:27,644 : INFO : PROGRESS: at 4.12% examples, 497649 words/s, in_qsize 14, out_qsize 1
2017-10-21 09:55:28,641 : INFO : PROGRESS: at 4.29% examples, 498096 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:55:29,710 : INFO : PROGRESS: at 4.48% examples, 499539 words/s, in_qsize 14, out_qsize 1
2017-10-21 09:55:30,709 : INFO : PROGRESS: at 4.66% examples, 501026 words/s, in_qsize 14, out_qsize 1
2017-10-21 09:55:31,743 : INFO : PROGRESS: at 4.85% examples, 502055 word

2017-10-21 09:56:43,438 : INFO : PROGRESS: at 16.78% examples, 495723 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:56:44,502 : INFO : PROGRESS: at 16.95% examples, 495587 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:56:45,503 : INFO : PROGRESS: at 17.12% examples, 495572 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:56:46,535 : INFO : PROGRESS: at 17.29% examples, 495605 words/s, in_qsize 14, out_qsize 3
2017-10-21 09:56:47,533 : INFO : PROGRESS: at 17.48% examples, 496125 words/s, in_qsize 16, out_qsize 0
2017-10-21 09:56:48,584 : INFO : PROGRESS: at 17.65% examples, 496197 words/s, in_qsize 13, out_qsize 2
2017-10-21 09:56:49,578 : INFO : PROGRESS: at 17.84% examples, 496717 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:56:50,594 : INFO : PROGRESS: at 18.01% examples, 496703 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:56:51,610 : INFO : PROGRESS: at 18.20% examples, 497184 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:56:52,625 : INFO : PROGRESS: at 18.37% examples, 4

2017-10-21 09:58:03,183 : INFO : PROGRESS: at 30.61% examples, 503862 words/s, in_qsize 13, out_qsize 2
2017-10-21 09:58:04,207 : INFO : PROGRESS: at 30.80% examples, 504123 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:58:05,259 : INFO : PROGRESS: at 30.97% examples, 504031 words/s, in_qsize 14, out_qsize 1
2017-10-21 09:58:06,268 : INFO : PROGRESS: at 31.14% examples, 503906 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:58:07,294 : INFO : PROGRESS: at 31.33% examples, 504229 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:58:08,361 : INFO : PROGRESS: at 31.52% examples, 504387 words/s, in_qsize 14, out_qsize 1
2017-10-21 09:58:09,416 : INFO : PROGRESS: at 31.72% examples, 504678 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:58:10,436 : INFO : PROGRESS: at 31.92% examples, 505019 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:58:11,458 : INFO : PROGRESS: at 32.07% examples, 504770 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:58:12,475 : INFO : PROGRESS: at 32.22% examples, 5

2017-10-21 09:59:23,041 : INFO : PROGRESS: at 43.94% examples, 501036 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:59:24,061 : INFO : PROGRESS: at 44.08% examples, 500706 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:59:25,081 : INFO : PROGRESS: at 44.22% examples, 500381 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:59:26,141 : INFO : PROGRESS: at 44.38% examples, 500166 words/s, in_qsize 16, out_qsize 0
2017-10-21 09:59:27,141 : INFO : PROGRESS: at 44.52% examples, 499843 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:59:28,161 : INFO : PROGRESS: at 44.68% examples, 499713 words/s, in_qsize 16, out_qsize 0
2017-10-21 09:59:29,196 : INFO : PROGRESS: at 44.83% examples, 499441 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:59:30,199 : INFO : PROGRESS: at 44.98% examples, 499193 words/s, in_qsize 15, out_qsize 0
2017-10-21 09:59:31,301 : INFO : PROGRESS: at 45.12% examples, 498655 words/s, in_qsize 13, out_qsize 2
2017-10-21 09:59:32,321 : INFO : PROGRESS: at 45.25% examples, 4

2017-10-21 10:00:43,368 : INFO : PROGRESS: at 55.20% examples, 480772 words/s, in_qsize 16, out_qsize 3
2017-10-21 10:00:44,403 : INFO : PROGRESS: at 55.32% examples, 480386 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:00:45,449 : INFO : PROGRESS: at 55.41% examples, 479727 words/s, in_qsize 14, out_qsize 1
2017-10-21 10:00:46,473 : INFO : PROGRESS: at 55.51% examples, 479102 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:00:47,485 : INFO : PROGRESS: at 55.60% examples, 478527 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:00:48,495 : INFO : PROGRESS: at 55.70% examples, 477931 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:00:49,535 : INFO : PROGRESS: at 55.81% examples, 477438 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:00:50,541 : INFO : PROGRESS: at 55.91% examples, 476937 words/s, in_qsize 12, out_qsize 1
2017-10-21 10:00:51,570 : INFO : PROGRESS: at 56.03% examples, 476579 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:00:52,585 : INFO : PROGRESS: at 56.14% examples, 4

2017-10-21 10:02:03,506 : INFO : PROGRESS: at 65.84% examples, 464135 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:02:04,526 : INFO : PROGRESS: at 66.00% examples, 464161 words/s, in_qsize 14, out_qsize 1
2017-10-21 10:02:05,532 : INFO : PROGRESS: at 66.14% examples, 464086 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:02:06,585 : INFO : PROGRESS: at 66.30% examples, 464029 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:02:07,592 : INFO : PROGRESS: at 66.46% examples, 464046 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:02:08,600 : INFO : PROGRESS: at 66.62% examples, 464039 words/s, in_qsize 16, out_qsize 0
2017-10-21 10:02:09,680 : INFO : PROGRESS: at 66.78% examples, 464000 words/s, in_qsize 16, out_qsize 3
2017-10-21 10:02:10,700 : INFO : PROGRESS: at 66.93% examples, 463933 words/s, in_qsize 14, out_qsize 1
2017-10-21 10:02:11,715 : INFO : PROGRESS: at 67.10% examples, 464011 words/s, in_qsize 14, out_qsize 1
2017-10-21 10:02:12,812 : INFO : PROGRESS: at 67.25% examples, 4

2017-10-21 10:03:23,389 : INFO : PROGRESS: at 78.02% examples, 462225 words/s, in_qsize 16, out_qsize 1
2017-10-21 10:03:24,393 : INFO : PROGRESS: at 78.16% examples, 462148 words/s, in_qsize 16, out_qsize 0
2017-10-21 10:03:25,406 : INFO : PROGRESS: at 78.32% examples, 462122 words/s, in_qsize 16, out_qsize 1
2017-10-21 10:03:26,406 : INFO : PROGRESS: at 78.48% examples, 462166 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:03:27,424 : INFO : PROGRESS: at 78.64% examples, 462154 words/s, in_qsize 16, out_qsize 0
2017-10-21 10:03:28,426 : INFO : PROGRESS: at 78.79% examples, 462118 words/s, in_qsize 16, out_qsize 0
2017-10-21 10:03:29,462 : INFO : PROGRESS: at 78.95% examples, 462091 words/s, in_qsize 16, out_qsize 4
2017-10-21 10:03:30,472 : INFO : PROGRESS: at 79.10% examples, 462087 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:03:31,532 : INFO : PROGRESS: at 79.26% examples, 462038 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:03:32,557 : INFO : PROGRESS: at 79.42% examples, 4

2017-10-21 10:04:43,231 : INFO : PROGRESS: at 90.17% examples, 460670 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:04:44,232 : INFO : PROGRESS: at 90.32% examples, 460659 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:04:45,259 : INFO : PROGRESS: at 90.49% examples, 460713 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:04:46,347 : INFO : PROGRESS: at 90.64% examples, 460617 words/s, in_qsize 12, out_qsize 3
2017-10-21 10:04:47,360 : INFO : PROGRESS: at 90.79% examples, 460562 words/s, in_qsize 15, out_qsize 3
2017-10-21 10:04:48,395 : INFO : PROGRESS: at 90.96% examples, 460626 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:04:49,448 : INFO : PROGRESS: at 91.10% examples, 460491 words/s, in_qsize 13, out_qsize 2
2017-10-21 10:04:50,479 : INFO : PROGRESS: at 91.27% examples, 460540 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:04:51,493 : INFO : PROGRESS: at 91.42% examples, 460503 words/s, in_qsize 15, out_qsize 0
2017-10-21 10:04:52,519 : INFO : PROGRESS: at 91.57% examples, 4

If you want to extend training - also called online word2vec training

In [18]:
model.build_vocab(sentences=shuffle_corpus(corpus),update=True)

for i in range(5):
    model.train(sentences=shuffle_corpus(corpus),epochs=50,total_examples=model.corpus_count)

Save word2vec model to load and use later

In [56]:
# save model
model.save('w2v_model')

2017-10-21 10:06:19,395 : INFO : saving Word2Vec object under w2v_model, separately None
2017-10-21 10:06:19,417 : INFO : not storing attribute syn0norm
2017-10-21 10:06:19,423 : INFO : not storing attribute cum_table
2017-10-21 10:06:19,919 : INFO : saved w2v_model


Or load already saved word2vec model

In [58]:
# load word2vec model
model = Word2Vec.load('w2v_model')

2017-10-21 10:06:31,821 : INFO : loading Word2Vec object from w2v_model
2017-10-21 10:06:32,086 : INFO : loading wv recursively from w2v_model.wv.* with mmap=None
2017-10-21 10:06:32,086 : INFO : setting ignored attribute syn0norm to None
2017-10-21 10:06:32,103 : INFO : setting ignored attribute cum_table to None
2017-10-21 10:06:32,107 : INFO : loaded w2v_model


## Using the model

Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

In [92]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7119250893592834)]

In [95]:
model.most_similar(positive=['hotel', 'room'], negative=['cafe'], topn=1)

[('property', 0.5173135995864868)]

In [96]:
model.most_similar(positive=['happy', 'sad'], negative=['annoy'], topn=2)

[('delighted', 0.4968428611755371), ('pleased', 0.4967663884162903)]

### The results look very promising. Word2vec is pretty good in understanding the semantics. 

In [80]:
model.doesnt_match('hotel nice clean holiday'.split())

'holiday'

In [99]:
model.doesnt_match('breakfast dinner continental lunch'.split())

'continental'

In [105]:
model.similarity('dinner','lunch')

0.77515746728803325

In [106]:
model.most_similar('great')

[('fantastic', 0.8479016423225403),
 ('excellent', 0.8270981311798096),
 ('wonderful', 0.7989335060119629),
 ('awesome', 0.7726607918739319),
 ('amazing', 0.7459379434585571),
 ('good', 0.7419600486755371),
 ('terrific', 0.7366918921470642),
 ('fabulous', 0.7290831804275513),
 ('incredible', 0.6833620071411133),
 ('perfect', 0.6754406094551086)]

In [108]:
model.most_similar('comfortable')

[('comfy', 0.846420168876648),
 ('comftorable', 0.6946347951889038),
 ('confy', 0.6907272338867188),
 ('goodsize', 0.6500658988952637),
 ('cozy', 0.6455215215682983),
 ('comfi', 0.6436315774917603),
 ('spacious', 0.6404236555099487),
 ('modernly', 0.6295812129974365),
 ('ultracomfortable', 0.6291707158088684),
 ('comforatble', 0.6264324188232422)]

In [109]:
model.most_similar('recommend')

[('highly_recommend', 0.8297330141067505),
 ('recomend', 0.8017210960388184),
 ('recommed', 0.7851127982139587),
 ('reccomend', 0.7745145559310913),
 ('suggest', 0.7575714588165283),
 ('reccommend', 0.7544815540313721),
 ('recommned', 0.7037972211837769),
 ('recommand', 0.6724419593811035),
 ('edited', 0.6539892554283142),
 ('hesistate', 0.63225257396698)]

In [111]:
model.most_similar('boston')

[('dc', 0.7707709670066833),
 ('seattle', 0.7631182670593262),
 ('chicago', 0.736762523651123),
 ('san_fran', 0.7224879264831543),
 ('san_francisco', 0.7217927575111389),
 ('philly', 0.7175568342208862),
 ('san_diego', 0.7092623710632324),
 ('sf', 0.6956394910812378),
 ('charlotte', 0.6925041675567627),
 ('nyc', 0.6882145404815674)]