<a href="https://colab.research.google.com/github/olinml2024/notebooks/blob/main/ML24_Assignment11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands-on With Word2Vec

In this notebook you are going to take word2vec embeddings for a spin.  These embeddings were introduced in the paper, and represent [a very influential paper in the field](https://arxiv.org/abs/1301.3781).  The field of natural language process (NLP) has certainly progressed far beyond word2vec, but it remains a useful example to learn about.

We'll start out by loading a commonly used pre-trained set of word embeddings based on the Google Newsgroup Dataset.

In [2]:
import gdown

gdown.download(id='0B7XkCwpI5KDYNlNUTTlSS21pQmM')
!gunzip -k GoogleNews-vectors-negative300.bin.gz

Downloading...
From (original): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=d737beb6-68e2-442e-a046-55cb7613d446
To: /content/GoogleNews-vectors-negative300.bin.gz
100%|██████████| 1.65G/1.65G [00:26<00:00, 62.9MB/s]


'GoogleNews-vectors-negative300.bin.gz'

Next, we can parse the vectors and calculate words that are most similar to a test word.

In [4]:
from gensim.models import KeyedVectors

# Load model from local path
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [34]:
model.most_similar('cat')

[('cats', 0.8099379539489746),
 ('dog', 0.760945737361908),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326234579086304),
 ('beagle', 0.7150582671165466),
 ('puppy', 0.7075453400611877),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931973457336),
 ('chihuahua', 0.6709762215614319)]

it's also possible to explore how the word embeddings themselves behave as a vector space (e.g., what happens when you add or subtract embeddings from each other).  Below is an example of the famous analogy from the original paper.

In [19]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

You can also look for outliers using the ``doesnt_match`` function.

In [38]:
model.doesnt_match(['orange', 'banana', 'computer'])

'computer'

## Notebook Exercise 1

Play around with ``model.most_similar`` or ``doesnt_match``.  Summarize your observations.

*Warning:* it's possible you may uncover some disturbing associations here.  We want you to share what you found, and we want you to be prepared that some of what you find may be shocking.  The word2vec model is known to contain gender and racial bias, which you'll read about later in this assignment.

We can access the word embeddings directly using the following syntax.

In [20]:
print(f"word embedding shape is {model['boy'].shape}")
print(model['boy'])

word embedding shape is (300,)
[ 2.35351562e-01  1.65039062e-01  9.32617188e-02 -1.28906250e-01
  1.59912109e-02  3.61328125e-02 -1.16699219e-01 -7.32421875e-02
  1.38671875e-01  1.15356445e-02  1.87500000e-01 -2.91015625e-01
  1.70898438e-02 -1.84570312e-01 -2.87109375e-01  2.54821777e-03
 -2.19726562e-01  1.77734375e-01 -1.20605469e-01  5.39550781e-02
  3.78417969e-02  2.49023438e-01  1.76757812e-01  2.69775391e-02
  1.21093750e-01 -3.51562500e-01 -5.83496094e-02  1.22070312e-01
  5.97656250e-01 -1.60156250e-01  1.08398438e-01 -2.40478516e-02
 -1.16699219e-01  3.58886719e-02 -2.37304688e-01  1.15234375e-01
  5.27343750e-01 -2.18750000e-01 -4.54101562e-02  3.30078125e-01
  3.75976562e-02 -5.51757812e-02  3.26171875e-01  6.74438477e-03
  3.71093750e-01  3.68652344e-02  6.68945312e-02  5.17578125e-02
 -4.76074219e-02 -7.91015625e-02  4.46777344e-02  1.67968750e-01
  5.51757812e-02 -2.91015625e-01  2.59765625e-01 -1.00097656e-01
 -1.09863281e-01 -9.15527344e-03  2.63671875e-02 -3.4423828

  We leave it up to you to see if you want to explore these embeddings further.  They can be used, e.g., as a way to encode word to solve a supervised learning task (e.g., sentiment classification).  If you're interested in combining these two ideas, you can check out [bag of words meets bags of popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial) on the Kaggle website.