# Objective
***
When using Word2Vec, you're making the assumption of the distributional hypothesis: **the meaning of a word can be inferred by the company it keeps**. 

In this sense, if you have two words with very similar neighbors, then these words are probably similar in meaning. 

In this tutorial, we will use the Gensim implementation of Word2Vec and we will get it working. 

For this tutorial, we will use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/#.XUkA6nVKhhE) dataset. This dataset has car and hotel reviews. We will compress the hotel reviews into a single file and use them for our model. 

In [4]:
import gzip
import logging
import gensim

logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s",\
                   level=logging.INFO)

# Dataset
***
THe secret to getting Word2Vec working for you is to have lots of text data relevant to the task at hand. In this case we will be using a hotel reviews dataset. 

The total file is about 228MB in size. 

You can pass a sequence of sentences to gensim as input, so we can pass a whole review as a sentence (a large chunk of text). 

We will read the data into a list.

In [5]:
data_file = "data/hotel_reviews.tgz"

The `gensim.utils.simple_preprocess` performs tokenization, lowercasing and returns a list of tokens.

In [6]:
def read_input(file_path):
    """
    Read the input file
    """
    logging.info("reading file {}...this may take a while".format(file_path))
    
    with gzip.open(file_path, 'rb') as filehandle:
        for i, line in enumerate(filehandle):
            if (i%10000==0):
                logging.info("read {} reviews".format(i))
            # do some preprocessing and return list of tokens (list of strings) for each review
            yield gensim.utils.simple_preprocess(line)

Now create a list of lists. Each element of this list is a list of tokens from that review.

In [7]:
documents = list(read_input(data_file))
logging.info("Done reading data file")

2019-08-10 23:21:16,971 : INFO : reading file data/hotel_reviews.tgz...this may take a while
2019-08-10 23:21:16,973 : INFO : read 0 reviews
2019-08-10 23:21:17,772 : INFO : read 10000 reviews
2019-08-10 23:21:18,608 : INFO : read 20000 reviews
2019-08-10 23:21:19,438 : INFO : read 30000 reviews
2019-08-10 23:21:20,272 : INFO : read 40000 reviews
2019-08-10 23:21:21,113 : INFO : read 50000 reviews
2019-08-10 23:21:22,010 : INFO : read 60000 reviews
2019-08-10 23:21:22,870 : INFO : read 70000 reviews
2019-08-10 23:21:23,723 : INFO : read 80000 reviews
2019-08-10 23:21:24,552 : INFO : read 90000 reviews
2019-08-10 23:21:25,401 : INFO : read 100000 reviews
2019-08-10 23:21:26,238 : INFO : read 110000 reviews
2019-08-10 23:21:27,101 : INFO : read 120000 reviews
2019-08-10 23:21:27,951 : INFO : read 130000 reviews
2019-08-10 23:21:28,797 : INFO : read 140000 reviews
2019-08-10 23:21:29,814 : INFO : read 150000 reviews
2019-08-10 23:21:30,643 : INFO : read 160000 reviews
2019-08-10 23:21:31,

In [12]:
print(type(documents))
print(type(documents[0]))
print(documents[0])
print(type(documents[0][0]))

<class 'list'>
<class 'list'>
['hotel_reviews', 'csv', 'ustar', 'joseph', 'joseph', 'hotel_address', 'review_date', 'average_score', 'hotel_name', 'negative_review', 'positive_review', 'reviewer_score', 'tags', 'lat', 'lng']
<class 'str'>


# Training the Word2Vec Model
***
Now we train the Word2Vec model. To gensim for this, we simple instantiate Word2Vec and pass the list of reviews we just read. 

Word2Vec will use this list of lists to create an internal vocabulary.

After building the vocabulary, we call `train(...)` to train the Word2Vec model. 

We train on the OpinRank dataset, which should take around 10 minutes. 
This is training a neural network with a single hidden layer. For the word vectors, we do not use the actual resulting neural network as a model, but we take the weights from the hidden layer. These are essentially the word vectors that we're trying to learn
## Parameters
***
The constructor for our `Word2Vec` model take some parameters:
```python
model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=2, workers=8)
```
* **size** - size of the dense vector to represent each token or word. if you have few data, choose a smaller size.
* **window** - the maximum distance between a target word and its neighboring context word
* **min_count** - minimum frequency count of words. ignore words that don't satisfy the word count. infrequent words are usually unimportant so this removes those
* **workers** - number of threads to use on the job

In [10]:
model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=2, workers=8)
model.train(documents, total_examples=len(documents), epochs=10)

2019-08-06 17:34:02,740 : INFO : collecting all words and their counts
2019-08-06 17:34:02,741 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-08-06 17:34:02,848 : INFO : PROGRESS: at sentence #10000, processed 631538 words, keeping 10659 word types
2019-08-06 17:34:02,932 : INFO : PROGRESS: at sentence #20000, processed 1188643 words, keeping 14156 word types
2019-08-06 17:34:03,021 : INFO : PROGRESS: at sentence #30000, processed 1772531 words, keeping 17164 word types
2019-08-06 17:34:03,110 : INFO : PROGRESS: at sentence #40000, processed 2362096 words, keeping 19706 word types
2019-08-06 17:34:03,199 : INFO : PROGRESS: at sentence #50000, processed 2946890 words, keeping 21949 word types
2019-08-06 17:34:03,287 : INFO : PROGRESS: at sentence #60000, processed 3522800 words, keeping 23966 word types
2019-08-06 17:34:03,381 : INFO : PROGRESS: at sentence #70000, processed 4121671 words, keeping 26098 word types
2019-08-06 17:34:03,473 : INFO : PROGRES

2019-08-06 17:34:19,215 : INFO : EPOCH 1 - PROGRESS: at 78.44% examples, 1517304 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:34:20,219 : INFO : EPOCH 1 - PROGRESS: at 85.49% examples, 1518734 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:34:21,226 : INFO : EPOCH 1 - PROGRESS: at 92.91% examples, 1520712 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:34:22,171 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-08-06 17:34:22,173 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-08-06 17:34:22,174 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-08-06 17:34:22,176 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-08-06 17:34:22,181 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-08-06 17:34:22,185 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-06 17:34:22,192 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-06 17:34:2

2019-08-06 17:35:05,041 : INFO : EPOCH 5 - PROGRESS: at 7.09% examples, 1504368 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:06,045 : INFO : EPOCH 5 - PROGRESS: at 14.36% examples, 1517225 words/s, in_qsize 13, out_qsize 2
2019-08-06 17:35:07,052 : INFO : EPOCH 5 - PROGRESS: at 21.73% examples, 1537370 words/s, in_qsize 15, out_qsize 1
2019-08-06 17:35:08,056 : INFO : EPOCH 5 - PROGRESS: at 28.69% examples, 1529102 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:35:09,061 : INFO : EPOCH 5 - PROGRESS: at 35.98% examples, 1528013 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:10,071 : INFO : EPOCH 5 - PROGRESS: at 43.26% examples, 1530786 words/s, in_qsize 16, out_qsize 1
2019-08-06 17:35:11,079 : INFO : EPOCH 5 - PROGRESS: at 50.60% examples, 1535618 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:12,083 : INFO : EPOCH 5 - PROGRESS: at 57.90% examples, 1538633 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:13,091 : INFO : EPOCH 5 - PROGRESS: at 65.36% examples, 1543530 

2019-08-06 17:35:53,563 : INFO : EPOCH 3 - PROGRESS: at 57.77% examples, 1539098 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:54,577 : INFO : EPOCH 3 - PROGRESS: at 65.05% examples, 1539764 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:35:55,583 : INFO : EPOCH 3 - PROGRESS: at 72.42% examples, 1540925 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:35:56,590 : INFO : EPOCH 3 - PROGRESS: at 79.87% examples, 1544654 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:57,590 : INFO : EPOCH 3 - PROGRESS: at 87.05% examples, 1544979 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:35:58,600 : INFO : EPOCH 3 - PROGRESS: at 94.47% examples, 1544987 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:35:59,342 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-08-06 17:35:59,345 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-08-06 17:35:59,346 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-08-06 17:35:59,349 : INFO : work

2019-08-06 17:36:40,953 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-06 17:36:40,958 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-06 17:36:40,959 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-08-06 17:36:40,960 : INFO : EPOCH - 6 : training on 30031149 raw words (21371192 effective words) took 13.9s, 1539034 effective words/s
2019-08-06 17:36:41,973 : INFO : EPOCH 7 - PROGRESS: at 7.03% examples, 1488824 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:36:42,977 : INFO : EPOCH 7 - PROGRESS: at 14.28% examples, 1509675 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:36:43,980 : INFO : EPOCH 7 - PROGRESS: at 21.38% examples, 1515759 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:36:44,983 : INFO : EPOCH 7 - PROGRESS: at 28.71% examples, 1530491 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:36:45,984 : INFO : EPOCH 7 - PROGRESS: at 35.98% examples, 1530817 words/s, in_qsize 16, out_qsize 0
2019-08-06 1

2019-08-06 17:37:29,778 : INFO : EPOCH 10 - PROGRESS: at 50.69% examples, 1538175 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:37:30,779 : INFO : EPOCH 10 - PROGRESS: at 57.87% examples, 1537620 words/s, in_qsize 15, out_qsize 1
2019-08-06 17:37:31,779 : INFO : EPOCH 10 - PROGRESS: at 65.05% examples, 1537546 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:37:32,784 : INFO : EPOCH 10 - PROGRESS: at 72.40% examples, 1538653 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:37:33,786 : INFO : EPOCH 10 - PROGRESS: at 79.75% examples, 1541036 words/s, in_qsize 14, out_qsize 1
2019-08-06 17:37:34,796 : INFO : EPOCH 10 - PROGRESS: at 86.89% examples, 1540544 words/s, in_qsize 15, out_qsize 0
2019-08-06 17:37:35,802 : INFO : EPOCH 10 - PROGRESS: at 94.23% examples, 1539716 words/s, in_qsize 16, out_qsize 0
2019-08-06 17:37:36,572 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-08-06 17:37:36,573 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-0

(213694388, 300311490)

# Using the Word Vectors
***
Now we can do fun things with our learned word vectors!
## Look up Words Similar to "dirty"
***

In [11]:
target_word = "dirty"
model.wv.most_similar( positive=target_word )

2019-08-06 17:43:16,863 : INFO : precomputing L2-norms of word weight vectors


[('filthy', 0.8175210952758789),
 ('stained', 0.8072519898414612),
 ('unclean', 0.7810685634613037),
 ('dusty', 0.7645260095596313),
 ('scratched', 0.6971113681793213),
 ('smelly', 0.692798912525177),
 ('damaged', 0.6831164956092834),
 ('torn', 0.6799750924110413),
 ('sticky', 0.672492265701294),
 ('threadbare', 0.6706531643867493)]

In [15]:
target_word = "polite"
model.wv.most_similar( positive=target_word )

[('courteous', 0.8741446733474731),
 ('cheerful', 0.7976503968238831),
 ('welcoming', 0.7975963950157166),
 ('attentive', 0.7734375596046448),
 ('professional', 0.7597354650497437),
 ('freindly', 0.7273843288421631),
 ('kind', 0.726075291633606),
 ('pleasant', 0.6902655363082886),
 ('friendly', 0.682566225528717),
 ('accommodating', 0.6726747751235962)]

In [16]:
target_word = "france"
model.wv.most_similar( positive=target_word )

[('italy', 0.5549201369285583),
 ('netherlands', 0.5487757921218872),
 ('austria', 0.4729192554950714),
 ('spain', 0.4533565044403076),
 ('iraq', 0.39420443773269653),
 ('morocco', 0.38039430975914),
 ('algeria', 0.37401291728019714),
 ('lisle', 0.37372440099716187),
 ('slovenia', 0.3700857162475586),
 ('th', 0.35027825832366943)]

In [17]:
target_word = "shocked"
model.wv.most_similar( positive=target_word )

[('surprised', 0.7705385088920593),
 ('appalled', 0.705019474029541),
 ('annoyed', 0.6508231163024902),
 ('upset', 0.6485998034477234),
 ('disgusted', 0.6214147210121155),
 ('suprised', 0.6194778680801392),
 ('amazed', 0.6169865131378174),
 ('disappointed', 0.5991095900535583),
 ('dissapointed', 0.588792085647583),
 ('excited', 0.5861113667488098)]

You can also provide negative examples so you can add more selective filtering on similar words:

In [18]:
# get everything related to stuff on the bed
positive_examples = ["bed",'sheet','pillow']
negative_examples = ['couch']
model.wv.most_similar(positive=positive_examples,negative=negative_examples,topn=10)

[('duvet', 0.7301138043403625),
 ('comforter', 0.7002500891685486),
 ('mattress', 0.6890642046928406),
 ('blanket', 0.6781498789787292),
 ('quilt', 0.6764400601387024),
 ('pillows', 0.6585404276847839),
 ('bedding', 0.6240412592887878),
 ('matress', 0.6225305199623108),
 ('duvets', 0.6080532073974609),
 ('protector', 0.5990015864372253)]

## Similarity between two words in the vocabulary
***
The following snippets compute the cosine similarity between the specified word vectors of the input words.

In [19]:
model.wv.similarity( w1="dirty", w2="smelly" )

0.6927988990782136

In [21]:
model.wv.similarity(w1="dirty",w2="dirty")

1.0000000000000002

In [22]:
model.wv.similarity(w1="dirty",w2="clean")

0.2088930286059534

## Find the Odd One Out
***

In [23]:
model.wv.doesnt_match(["cat","dog","france"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'france'

In [24]:
model.wv.doesnt_match(["bed","pillow","duvet","shower"])

'shower'

# Applications
***
You can use Word2Vec to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you develop this lexicon.

If you had tags for a million stackoverflow questions and answers, one could find tags related to a given tag and recommend related ones for exploration. You do this by treating each set of co-occurring tags as a sentence and train a Word2Vec model on this data