## Additional Exercises for 04.10: Word2Vec

## Question 0
We're going to use `corpus`es from the `nltk` package to go through more examples of `Word2Vec`.
1. Import the `movie_reviews` corpus from `nltk.corpus`.
    - May need to download the actual corpus first if you've never downloaded/imported it before.
2. Fit a model to the contents in `movie_reviews`.
    - To retrieve the context, simply call `movie_reviews.sents()`.

In [None]:
import nltk
nltk.download('movie_reviews')

In [60]:
from gensim.models import Word2Vec
from nltk.corpus import movie_reviews

content = movie_reviews.sents()
content

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]

In [61]:
model = Word2Vec(content)

## Question 1
1. Find the similarities between different words that might appear in movie reviews
    - ex. `speeding` and `cars`.
2. For any words you'd like, find the top similar words to it.
    - ex. `battle`.

In [54]:
model.similarity('speeding', 'cars')

0.79489191769859613

In [56]:
model.most_similar('battle')

[('fight', 0.8269006013870239),
 ('chase', 0.8190931081771851),
 ('thrown', 0.8165372610092163),
 ('shots', 0.8126580119132996),
 ('showdown', 0.8015738129615784),
 ('confrontation', 0.7930744886398315),
 ('players', 0.7896624803543091),
 ('battles', 0.7890263199806213),
 ('closing', 0.7828884720802307),
 ('flare', 0.7728803157806396)]

## Question 2
1. Play around with the parameters for building the model and see what kind of different results you'll get.

### Word2Vec Features
<ul>
<li>Size: Number of dimensions for word embedding model</li>
<li>Window: Number of context words to observe in each direction</li>
<li>min_count: Minimum frequency for words included in model</li>
<li>sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram</li>
<li>Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning</li>
<li>Iterations: Number of passes through dataset</li>
<li>Batch Size: Number of words to sample from data during each pass</li>
<li>Worker: Set the 'worker' option to ensure reproducibility</li>
</ul>

In [62]:
model2 = Word2Vec(content, size=100, window=5, \
                               min_count=1, sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)

In [63]:
model.similarity('speeding', 'cars')

0.79800621595734733

In [64]:
model.most_similar('battle')

[('fight', 0.8245350122451782),
 ('chase', 0.8221967220306396),
 ('shots', 0.819613516330719),
 ('thrown', 0.8159714937210083),
 ('showdown', 0.8012252449989319),
 ('confrontation', 0.7929671406745911),
 ('battles', 0.7927889823913574),
 ('players', 0.7824958562850952),
 ('closing', 0.7824680805206299),
 ('sexual', 0.7723293304443359)]