# Word2Vec

In [None]:
# NLP tools
import nltk
import gensim

# Data tools
import numpy as np
import pandas as pd

In [None]:
# Load the Google vectors
! mkdir -p ~/downloads
! wget -nc -P ~/downloads http://nlp.stanford.edu/data/glove.6B.zip
! unzip -d ~/downloads/glove -o ~/downloads/glove.6B.zip

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

glove_file = datapath('/home/ubuntu/downloads/glove/glove.6B.50d.txt')
tmp_file = get_tmpfile("glove_word2vec.txt")

# call glove2word2vec script
# default way (through CLI): python -m gensim.scripts.glove2word2vec --input <glove_file> --output <w2v_file>
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)

model = KeyedVectors.load_word2vec_format(tmp_file)

This will take about a minute.

In [None]:
type(model.vocab)

In [None]:
# Number of Vectors
len(model.vocab.keys())

In [None]:
# Size of the Vectors
model.vector_size

## Question 2
**Exploring Word2Vec Vectors**
* Print out a few word vectors from the Google set
* Print out the similarity between the following pairs (feel free to experiment with more if you like):
  * baseball, bat
  * baseball, ocean
  * bat, fly
* What sorts of patterns do you notice?  Where does it succeed?  Where does it fail?  How might one improve it?
* Print out the most similar words to the following words:
  * baseball
  * president
* Print out words similar to the positive words and dissimilar to the negative words for the following positive/negative groups:
* Print out the words that don't match the others in each of the following groups:

In [None]:
# Word Vectors
model.word_vec('baseball')

In [None]:
# Pairwise Similarity
model.similarity('obama', 'clinton')

In [None]:
model.similarity('obama', 'reagan')

* Word Sense Disambiguation

In [None]:
# Most similar words
model.similar_by_word('obama')

In [None]:
# Positive Negative Similar Words
model.most_similar(positive=['obama', 'clinton'], negative=['president'])

In [None]:
model.most_similar(positive=['king', 'woman'], negative=['man'], topn=10)

In [None]:
model.most_similar("brooklyn")

In [None]:
# Words that don't match
model.doesnt_match(['breakfast', 'lunch', 'dinner', 'baseball'])

## Question 3


word similarities

In [None]:
# Comparing via Cosine Similarity
model.n_similarity(['obama', 'president'], ['clinton', 'president'])

In [None]:
# Comparing via Word Mover's Distance
model.wmdistance(['obama', 'president'], ['clinton', 'president'])

In [None]:
# Comparing via Word Mover's Distance
model.wmdistance(['obama', 'president'], ['huckabee', 'president'])

Higher because Ted Cruz was not president.

## Vocabulary Features

Each word contains an array of 300 features.

In [None]:
len(model.word_vec('cat'))

In [None]:
model.word_vec('cat')[:20]

The cosine similarity between words can be computed and produces intuitive trends.

In [None]:
print(model.similarity('cat', 'cat'))
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'car'))

In [None]:
print(model.similarity('car', 'truck'))
print(model.similarity('car', 'drive'))

Word2Vec captures some interesting similarities between words, such as the relationship between **man --> king** and **woman --> queen**.

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

In [None]:
# Paris is to France as London is to England
# [positive] is to [negative] as [query] is to [positive]

model.most_similar(positive=['paris', 'australia'], negative=['france'], topn=10)

In [None]:
# stripes are to zebras as spots are to leopards
# [positive] is to [negative] as [query] is to [positive]

model.most_similar(positive=['spots', 'zebras'], negative=['leopards'], topn=10)

##### Question: Why does the previous command take so much longer than others?
Because it has to generate a new vector that is **woman** + **king** - **man** and compare that vector to all 3 million vectors, then sort to find the closest 3.  The 3 million are stored in such a way that they can be compared quickly, but any new vector is not.

In [None]:
model.wmdistance("Obama is the president of the United States".lower().split(), 
                        "Bush was the president of the United States".lower().split())

It can also detect words that don't belong in a sequence:

In [None]:
model.doesnt_match("breakfast cereal dinner lunch".split())