# Word Embedding Models: word2vec #

![oprah vector](http://lklein.com/wp-content/uploads/2021/10/oprah-everyone-3.png)

### Everything gets a vector! ###

We've already been exploring vectors involving words: consider scikit-learn's `CountVectorizer()`, for example, which we used to create the document-term matrix for our tf-idf calculations. That looked at words in relation to the documents in which they appeared.

Today, however, we're going to look at words in relation to all other words in a corpus. The vectors that describe these types of relations are called, appropriately enough, *word vectors*. (And sometimes also *word embeddings*).

### What is a word vector? ###

A *word vector* or *word embedding* is a numerical representation of a word within a corpus, based on co-occurence with other words. Linguists have found that much of the meaning of a word can be derived from looking at the context in which it appears. (In linguistics, this is known as the theory of *distributional semantics*).

### What is Word2Vec? ###

Word2vec is one popular approach to representing words in this numerical format. Conveniently, word2vec is implemented in a library called `gensim`, which we will also use later in the semester for topic modeling.

Word2Vec is a *neural-network* or *deep learning* based approach of generating word vectors.

There are many resources out there that will go into the heavy details of deep learning in general or deep learning for NLP such as Yoav Goldberg's Neural Network Methods in Natural Language Processing (Morgan & Claypool Publishers, 2017). Today, you'll get a high level overview -- just enough for you to understand what w2v is doing.

### A Picture for Reference ###

Before we get into the details of neural networks and deep learning, let's take a quick look at an image that may help anchor some of the more heady concepts we're about to discuss. This shows us the word pairs for a tiny corpus, consisting of a single sentence, "The quick brown fox jumps over the lazy dog.”

The words pairs from this sentence will constitute our training data: what we will use to generate our word vectors. I’ve used a small window size of 2 just for the example. Most of the time the window size will be slightly longer, like 5. In any case, the word highlighted in blue is the input word.

![skip-grams](http://mccormickml.com/assets/word2vec/training_data.png)

The neural network is going to learn the statistics from the number of times each pairing shows up. So, for example, the network is probably going to get many more training samples of (“brown”, “fox”) than it is of (“brown”, “unicorn”). When the training is finished, if you give it the word “brown” as input, then it will output a much higher probability for “fox” or “bear” than it will for “unicorn”.

But how are these probabilities generated, and what is a neural network anyway? Let's take a minute to talk through these ideas.

### What is a neural network? ###

Here's great explainer adapted from Jer Thorp's recent book, [Living in Data](https://www.jerthorp.com/)

"A neural network is a mathematical model that traces its roots back to the 1940s, when logicians and neuroscientists and cyberneticians were trying to explain how the human brain learns. Taking a page from how actual neurons cells work, neural net pioneers conceived of a system of nodes, each a kind of simplified neuron. These nodes hold a number, called an activation potential, above which a node will "fire," sending a signal to one or more other nodes; or, if there is no node to talk to, spitting out a result. Stitched together into collections (networks), these nodes showed the ability to recognize pattern; that is, to take a specific set of numeric inputs and to turn it, consistently, into an expected result."

"Imagine a group of thirteen children in a classroom, sitting in three neat rows of four, with one sitting alone in the back row. Each child can be given either a coookie or a nap. A cookie increases the kid's energy level by one; a nap reduces it by one. If a child's energy goes above a level ten, they have a tantrum, exhausting their excitement but also passing some of it on to any kids they may be connected to. Neural networks tend to be "feed-forward" meaning that the signal can only go in one direction from node to node. In our classroom, we can take this to mean that kids can pass energy back only to those sitting behind them."

"If we feed a plate of cookies to the kids in the front row, we can expect a wave of hysterics to pass from the front to the back of the class, ending with our lonely back-row student in tears. If every child in the front row got the same amount of sugar, and if they all had the same tolerance for it, this wave would be uniform, starting and ending with crying kids. Neural networks function the way they do, though, because the nodes aren't uniform; they are weighted. This meanas eveyr kid in our class has a different tolerance for cookies, a different level at which they'll break into a conniption. The wave of tears won't flow evenly from front to back, and the signal that we pass into the front won't be the same as the one that comes out the back."

"Asuming the kids in the class are randomly weighted, each with a unique combination of patience and metabolism, feeding different numbers of cookies to the four kids in the front row would result in times when the back-row student loses their temper and times when they don't. Importantly, feeding the same pattern of cookies to the front of the class will always result in the same outcome in the back. This means that the classroom acts together as a pattern-recognition machine. Anything we might be able to translate into 'cookie language," a set of four numbers, can be fed into the machine to get a tantrum-based yes or no."

If the teacher wanted to make sure they got a specific reation from a specific piece of cookie code, they could reseat or replace the students, feeding the cookies to the front until the teacher saw the answer they wanted from the back. The teacher might train the class to recognize their birth year--2015--or the first four notes in "Baby Shark," or a binary representation for the number twelve. In a school assembly, with many more students, this same system could be arranged to recognize bigger sets of numbers, digitized words, or pixelated faces. More than that, a large network might be trained to recognize signals that are similar: faces that are smiling or words that rhyme with "cheese." A crucial point here is that the kids in the network don't need to know anything about the signal or the desired output. They just eat cookies, cry, nap, and compute."

Of course, we're not training a kids-and-cookies neural network. We're training a computational one with a far greater number of nodes.

The important takeaway here, aside from how neural networks go from input to output, is that neural networks are not algorithms in themselves. They just go from start to finish. In order to make use of the input and output, they are most usually paired with algorithms that *train* the network, improving its performance over multiple iterations.

Which brings us back to the word2vec algorithm and how it trains the neural network at its core.

### Training the neural network ###

Here, we’re going to train our neural network to do something more complicated than predict whether a kid at the back of the room will cry or not. Our task is this: given a specific word in the middle of a sentence (the input word--like "brown," as in the image above), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.

Let's look at our image again:

![skip-grams](http://mccormickml.com/assets/word2vec/training_data.png)

So "nearby" is actually defined by the "window size" parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total). But in the image the window size is 2.

We train the neural network to do this by feeding these word pairs, and he neural network is going to learn the statistics from the number of times each pairing shows up.

**NOTE:** I've described something called the skim-gram methods of generating word vectors. THere's also another popular method called CBOW (continuous bag of words). The main difference is that while skip gram learns vectors by predicting the context words that come before and after our given word $w$, CBOW predicts the center word $w$ given context words.

### One more wrinkle / bonus  ###

One last wrinkle in the word2vec process is that, in the end, we’re not actually interested in the predictions generated by the model. What we're interested in is the weights of the nodes of the network itself. These are the actual "word vectors" that we want to work with.

We can access them fairly easily because word2vec has only a single hidden (or "projection") layer, as displayed in the image below.

![neural network](http://lklein.com/wp-content/uploads/2019/10/mikolov.png)

Conveniently, all you need is one hidden layer for a neural network to be classified as a "deep" network. So we're doing deep learning! Fancy!  

# Let's try it out!

## Import gensim, nltk tokenizers, glob, and Path

In [None]:
import gensim # remember this from last class?

# and some other stuff
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
import nltk
nltk.download('punkt')
import glob
from pathlib import Path

## Load in our corpus

In [None]:
# For downloading large files from Google Drive
# https://github.com/wkentaro/gdown
import gdown

# then download the zip files
# atlanta
gdown.download('https://drive.google.com/uc?export=download&id=1gIm9NcoeY1gn9EQjRr2MojGRJ-fpBSqz', quiet=False)

# unzip it
!unzip Atlanta-random.jsonl.zip

### Process the docs

As a first step, we'll need to create a list of all the reviews in the `Atlanta-random.jsonl` file, with each review stored as a single string. This is the same exact code we used last class, but condensed a little.

In [None]:
# import libraries
import os             # for directory/file manipulation
import json           # for json
import pandas as pd   # for dataframes
import textwrap       # for nice formatting

# read in the file
atlanta_reviews_df = pd.read_json(path_or_buf="./Atlanta-random.jsonl", lines=True)

# first extract the 'comment' values from the dataframe
comments = atlanta_reviews_df['comment'].tolist()

# create list to store reviews
all_reviews = []

# iterate through the comments and append the reviews to the list
for comment in comments:
  all_reviews.append(comment['text'])

# print out the length just to check that everything got in
len(all_reviews)

We learned last class that there were some extra HTML tags embedded in the review text, and we also saw some hex codes. Let's see if we can clean things up a little before we move further.

In [None]:
from bs4 import BeautifulSoup

# new array w/ clean text
all_reviews_clean = []

for review in all_reviews:
    soup = BeautifulSoup(review, "html.parser")
    text = soup.get_text(separator=' ')

    all_reviews_clean.append(text)

# print out first one just to check
print(textwrap.fill(all_reviews_clean[0], 100))

One last thing before we get started. Let's create some legible IDs for each revivew using some other info that's in the dataframe.

In [None]:
# extract the 'business' values from the dataframe
businesses = atlanta_reviews_df['business'].tolist()

# create list to store business aliases
aliases = []

# iterate through the business and append the alias to the list
for business in businesses:
  aliases.append(business['alias'])

# extract ratings
ratings = atlanta_reviews_df['rating'].tolist()

# create list to store IDs <-- we'll use list this going forward
ids = []

# now put them all together into IDs
for i, alias in enumerate(aliases):
  id = alias + "-review" + str(i) + "-" + str(ratings[i]) + "stars"
  ids.append(id)

# print out the first one to check
ids[0]

Next, we need to get each of the docs in our `all_reviews_clean` list into the format required by gensim's implementation of word2vec.

We know from the gensim documentation (and also common sense) that the input to `word2vec` is sentences. So let's define a function that a takes a list of texts (e.g. our `all_reviews_clean::` list) and converts it into sentences for gensim word2vec to use. The function will lower-case text and tokenize by sentence and word. It will also print out a count of the sentences in each doc, so that we get some sort of status indicator that it's parsing the sentences correctly.

In [None]:
# need our handy nltk tokenizer
tokenizer = TreebankWordTokenizer()

# and the function
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
        print(ids[counter]) # let's print the title of the article
        print("Sentences: " + str(len(sentences)))  # let's check how many sentences there are per article
        counter += 1
    return all_txt

In [None]:
# now let's run it
sentences = make_sentences(all_reviews_clean)

## Train model

Now that we have our corpus ready for gensim, we can train the model. To do so, we call the function `gensim.models.Word2Vec()`. This function has a couple dozen parameters, some of which are more important than others.

Here are a few major ones. Only two are MANDATORY: these are marked with an asterisk:

1. `sentences*`: This is where you provide your data. It must be in a format of iterable of iterables.
2. `sg`: Your choice of training algorithm. There are two standard ways of training W2V vectors -- 'skipgram' and 'CBOW'. If you enter 1 here the skip-gram is applied; otherwise, the default is CBOW.
3. `size*`: This is the length of your resulting word vectors. If you have a large corpus (>few billion tokens) you can go up to 100-300 dimensions. Generally word vectors with more dimensions give better results.
4. `window`: This is the window of context words you are training on. In other words, how many words come before and after your given word. A good number is 4 here but this can vary depending on what you are interested in. For instance, if you are more interested in embeddings that embody semantic meaning, smaller window sizes work better.
5. `alpha`: The learning rate of your model. If you are interested in machine learning experimentation with your vectors you may experiment with this parameter.
6. `seed` (int): This is the random seed for your random initialization. All deep learning models initialize the weights with random floats before training. This is a useful field if you want to replicate your experiments because giving this a seed will initialize 'randomly' deterministically.
7. `min_count`: This is the minimum frequency threshold. If a given word appears with lower frequency than provided it will be ignored. This is here because words with very low frequency are hard to train.
8. `epochs`: This is the number of iterations (entire run) over the corpus, also known as epochs. Default is 5. Usually anything between 1-10 is ok. The trade offs are that if you have higher iterations, it will take longer to train and the model may overfit on your dataset. However, longer training will allow your vectors to perform better on tasks relevant to your dataset.

Most of these settings will not concern us. As you'll see below, we are only going to use four arguments.

\* On newer versions of gensim, `size` has changed to `vector_size`. But Colab has not yet updated theirs so `size` still works.

In [None]:
# let's train our model!
atl_reviews_model = gensim.models.Word2Vec(
    sentences,
    min_count=2, # default is 5; this trims the corpus for words only used once;
    vector_size=100,
    workers=5) # parallel processing; needs Cython

Hooray! We have a trained word2vec model: `atl_reviews_model`!

### Save model — and load it

It's often useful to save your trained model to disk so that you can reload it as needed. This is very similar syntax to saving topic models.

In [None]:
# how to store the above files to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

atl_reviews_model.save('/content/gdrive/My Drive/atl_reviews_model')

And you can load an old model in the same way

In [None]:
# how you would load an old model from your own google drive
old_model = gensim.models.Word2Vec.load('/content/gdrive/My Drive/atl_reviews_model')

## Let's play!

### Similarity

word2vec can tell us which words, according to its model, are most similar to any other. We call `model.wv.most_similar("word", topn=number of similar words)`. Let's try "delicious."

In [None]:
# testing some basic functions

# basic similarity w/ adjectives
atl_reviews_model.wv.most_similar("delicious", topn=10)

In [None]:
# basic similarity w/ nouns
atl_reviews_model.wv.most_similar("biscuit", topn=10)

## Exercise 1:

**Copy the code above and test out some words until you find one that has some interesting (or problematic) similar words.**

In [None]:
# your answer here

## Similarity between two words

We can choose specific words to compare with `model.wv.similarity(w1="word_one",w2="word_two")`

In [None]:
# similarity b/t two words

print(atl_reviews_model.wv.similarity(w1="meatballs",w2="delicious"))
print(atl_reviews_model.wv.similarity(w1="meatballs",w2="disgusting"))

As expected (meatball stan here), more reviews in the model find meatballs delcious than disguisting

### Analogy

We can also play with analogy tasks. The commonly seen task is:

'Man is to King as Woman is to ____?'

The general structure is:
`A is to A\*  as  B is to B\*`
                         
gensim provides two different ways of implementing this task. You may be familiar with the the additive version also called the 3CosAdd method:

$$\underset{b*\in V}{\textrm{arg max}} (cos(b*,b) - cos(b*,a) + cos(b*,a*))$$

This reflects the abstraction of Woman - Man + King. In this maximization, we are searching which word vector will allow us to produce the highest value in this equation.

We can implement this method with a built-in function. Positive here refers to words that give the positive contribution to similarity (nominator), and negative refers to words that contribute negatively (denominatory).

Or, in simpler language, you can also think of this as, "start at 'professor'-vector, add 'she'-vector, subtract ‘he'-vector, from where you wind up, report the top-ranked word-vectors closest to that point (not including any of the 3 query vectors).

Here it is:

In [None]:
# analogies
# format is: "man is to king as woman is to ???"

result = atl_reviews_model.wv.most_similar(positive=['knowledgable', 'man'], negative=['woman'])

print("{}: {:.4f}".format(*result[0])) # this prints the top result

In [None]:
# analogies
# format is: "man is to king as woman is to ???"

result = atl_reviews_model.wv.most_similar(positive=['authentic', 'italian'], negative=['chinese'])

print("{}: {:.4f}".format(*result[0])) # this prints the top result

## Exercise 2:

**Copy the code above and test out some analogies until you find one that gets you some interesting (or problematic) results.**

In [None]:
# your code here

## There's so much more!

gensim has quite a few built-in tools, and it's worth taking some time to see what's available. Check the documentation here: [https://radimrehurek.com/gensim/models/keyedvectors.html](https://radimrehurek.com/gensim/models/keyedvectors.html)


## BONUS: Visualization!

Find below some code you can use to make visualizations from your word2vec model. We can't visualize all the many dimensions in our model, so we need to reduce them to two dimensions for our meager human brains. We do that with something called principal component analysis (PCA).

Don't worry about the details for now. This is just a fun way to take a look at the output of our model.

**Remember**: Our visualization reduces MANY dimensions to two, so a lot of information is lost.

In [None]:
### Let's do some visualization ###

import numpy as np

# Get the interactive Tools for Matplotlib
# %matplotlib notebook # doesn't work for colab
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [None]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.wv.vocab.keys()), sample)
        else:
            words = [ word for word in model.wv.vocab ]

#    word_vectors = np.array([model[w] for w in words]) <-- gensim 3 version
    word_vectors = np.array([model.wv[w] for w in words]) # gensim 4 version

    twodim = PCA().fit_transform(word_vectors)[:,:2]

    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(atl_reviews_model, ['italian','french','american','korean','japanese','mexican','chinese'])

# display_pca_scatterplot(ccp_model, sample=20)

## Exercise 3a:

**Copy the code above and plot some words that you think might be similar or different from each other.**

In [None]:
# your plot here

## Exercise 3b:

**What do you think the plot shows you about the words? Did they confirm or contradict what you though they would show?**

In [None]:
# your answer here


*Lauren F. Klein wrote version 1.0 of this notebook in 2019 based on the [Advanced Topics in Word Vectors workshop](https://dh2018.adho.org/en/machine-reading-part-ii-advanced-topics-in-word-vectors/) at DH 2018 as well as tutorials by [Radim Rehurek](https://rare-technologies.com/word2vec-tutorial/) and [Chris McCormick](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/). It was updated again in 2021, 2022, and 2024.*