# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Word Vectors

> Creator: Matt Brems (DC)

In [None]:
# Import packages.
import numpy as np
import time

In [None]:
# Install/upgrade Gensim.
!pip install gensim --upgrade

In [None]:
# Import Gensim.
import gensim

In [None]:
# Start timer.
t0 = time.time()

# Import word vectors into "model."
model = gensim.models.KeyedVectors.load_word2vec_format('change_filepath')

# Print results of timer.
print(time.time() - t0)

### What is a vector?

There are lots of ways to think about a vector.

![](./images/vector.png)

In **physics**, vectors are arrows.

![](./images/vector.jpg)

In **computer science** and **statistics**, vectors are columns of values, like one numeric Series in a DataFrame.

#### It turns out that these are equivalent.

![](./images/vector_on_graph.png)

[This video](https://www.youtube.com/watch?v=fNk_zzaMoSs) does an exceptional job explaining vectors.

### So... what is a word vector?

A word vector, simply, is a way for us to represent words with vectors.

<details><summary>How have we technically already done this?</summary>
    
- CountVectorizer and TFIDFVectorizer. By representing each word as a new column in our DataFrame, we have represented words with vectors.

![](./images/countvectorizer.jpeg)
</details>

To be more precise, we can think of each word as its own dimension or axis. In the example below, we have represented the horizontal axis with a vector for `cat` and the vertical axis with a vecvtor for `hat`.

![](./images/cat_hat.png)

This is exactly what CountVectorization and TFIDFVectorization have done; we are now just representing it geometrically/visually! Each column in our DataFrame corresponds to a new axis.

## A little math: the dot product

One thing we have spent lots of time talking about is whether or not two things are dependent or independent.
- We assume that our $Y$ variable depends on the $X$ variables in our models.
- We assume that our $X$ variables are independent of one another in linear models.
- We frequently assume that our observations are independent of one another.

<details><summary>Thus far, how have we detected if two columns/vectors are dependent or independent?</summary>
    
- The most common way for us to detect for dependence/independence is **correlation**. 
    - If the correlation between two columns is far from zero, we say the two are dependent.
    - If the correlation between two columns is close to zero, we say that the two are (linearly) independent.    
</details>

Geometrically, we say two vectors are independent if they are [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) (perpendicular) to one another.

<details><summary>Are the cat and hat vectors independent of one another?</summary>
    
- Yes! 
    - They are orthogonal to one another.
    - They are perpendicular to one another. 
    - They form right angles. 
    - The three preceding bullet points are all equivalent: if one of them is true, then they will all be true and if one of them is false, then they will all be false.    
</details>

How can we detect this mathematically? **The dot product.**

The [dot product of two vectors](https://en.wikipedia.org/wiki/Dot_product) $\mathbf{a} = [a_1, a_2, \ldots, a_p]$ and $\mathbf{b} = [b_1, b_2, \ldots, b_p]$ is given by:

$$
\begin{eqnarray*}
\mathbf{a} \cdot \mathbf{b} &=& \sum_{i=1}^p a_i \times b_i \\
&=& (a_1 \times b_1) + (a_2 \times b_2) + \cdots + (a_p \times b_p) \\
\end{eqnarray*}
$$

<details><summary>Calculate the dot product of the vectors cat and hat. What is the value?</summary>
    
$$
\begin{eqnarray*}
\mathbf{cat} \cdot \mathbf{hat} &=& \sum_{i=1}^2 cat_i \times hat_i \\
&=& (cat_1 \times hat_1) + (cat_2 \times hat_2) \\
&=& (1 \times 0) + (0 \times 1) \\
&=& 0
\end{eqnarray*}
$$

- The dot product of the vectors `cat` and `hat` is 0.
- [We could have also written](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.dot.html)
```python
np.dot([1,0],[0,1])
```
</details>

In [None]:
# Calculate the dot product using Python.
np.dot()

### When the dot product between two vectors is zero, that means the vectors are (linearly) independent of one another!

### When we say "geometrically independent" or "mathematically independent" or "statistically linearly independent," these all mean the same thing! We just use different ways to detect them. (For example, if I have 100 vectors, I probably can't visually look at them and conclude independence. I probably have to use the dot product to determine independence.)

This type of vectorization of words (turning each word into its own column) is known as "1-of-N encoding."

![](./images/word2vec-one-hot.png)

For example:
- the vector for the word `king` would be [1, 0, 0, 0, 0].
- the vector for the word `queen` would be [0, 1, 0, 0, 0].
- the vector for the word `man` would be [0, 0, 1, 0, 0].
- the vector for the word `woman` would be [0, 0, 0, 1, 0].
- the vector for the word `child` would be [0, 0, 0, 0, 1].

<details><summary>Which of pairs of the above words are independent of one another?</summary>
    
- They all are! If you calculate the dot product for any of these pairs of words, you will get a value of zero. 
- When vectorizing words in this way, we treat words as independent of one another.
</details>

In [None]:
# Calculate the dot product using Python.
np.dot()

Thinking purely about language and the way we use it, **should** king and queen be independent of one another? **Should** man and woman be independent of one another?

<details><summary>What do you think?</summary>
    
- Probably not!
- King and queen have similar meanings. (Really, only the sex is different.)
- Man and woman have similar meanings. (i.e. I know that "man" and "woman" are more similar than "man" and "book" or "woman" and "car.")
- Our current data science strategy for NLP (CountVectorization, TFIDFVectorization) is good in that it allows us to get computers to understand natural language in a way similar to how humans do... but our current strategy has its limitations!
</details>

Rather than creating a whole new dimension each time we encounter a new word and treating it as independent of all other words, can we instead come up with "new axes" that allow us to better understand meanings and relationships among words?
- YES.

**Word embedding** is a term used to describe representing words in mathematical space.
- One word embedding technique is CountVectorization.
- A more advanced word embedding technique is `Word2Vec`.

### Word2Vec
- Word2Vec is an approach that takes in observations (sentences, tweets, books) and maps them into some other space using a neural network.

![](./images/word2vec-one-hot.png)

In this example, you can "think" of a five-dimensional space. 
- The horizontal axis corresponds to `king`.
- The vertical axis corresponds to `queen`.
- The axis extending out toward you corresponds to `man`.
- Given that we live in 3D space, we can't really visualize higher dimensions.

Instead of giving each word its own axis, the `Word2Vec` algorithm will take all of our words and map them to another set of axes that accounts for these relationships.

![](./images/word2vec-king-queen-vectors.png)

<details><summary>How can I tell that these vectors are not independent just by looking at them?</summary>
    
- They are not perpendicular to each other!
</details>

### Why do we care?
The structure of language has a lot of valuable information in it! The way we organize our text/speech tells us a lot about what things mean.

By using machine learning to "learn" about the structure and content of language, our models can now organize concepts and learn the relationships among them.
- Above, we did not explicitly tell the computer what "king" or "queen" or "man" or "woman" actually mean. But by learning from the data, our model can quantify the relationship among these entities!

![](./images/word2vec-king-queen-composition.png)

If we represent words with vectors, then we can define "distances" among words and do operations on them!
- For example, if I take the "king" vector and subtract the "man" vector, what's leftover might be the idea of "royalty."
- If I take "royalty" and add "woman" to it, then I get "queen!"

### How does Word2Vec work?

#### Basic Answer:
The idea is that we can use the position of words in sentences (i.e. see which words were commonly used together) to understand their relationships.
- If "king" and "queen" are used near one another a lot, then it suggests that there may be some sort of relationship between them.
- If "king" and "queen" are used near similar words a lot (i.e. "throne," "royal," "princess," "prince," "heir"), then it suggests that there may be some sort of relationship between them.

#### More Advanced Answer:
There are two algorithms that use neural networks to learn these relationships: Continuous Bag-of-Words (CBOW) and Continuous Skip-grams.

![](./images/cbow.png)

**CBOW (BONUS)**

A continuous Bag-of-Words model is a two-layer neural network that:
- takes the surrounding "context words" as an input.
- generates the "focus word" as the output.
![](./images/word2vec-cbow.png)

**Skip-Gram (BONUS)**

A Continuous Skip-gram model is a two-layer neural network that:
- takes the "focus word" as an input.
- generates the surrounding "context words" as the output.

![](./images/skipgram.png)

## Neat application 1: Which of these is not like the other?

In [None]:
model.doesnt_match(['man', 'woman', 'king', 'queen', 'dog'])

In [None]:
model.doesnt_match()

In [None]:
model.doesnt_match()

In [None]:
model.doesnt_match()

Try your own and share the most mind-blowing one in a thread.

**Real-world application of this**: Suppose you're attempting to automatically detect spam emails or detect plagiarism based on words that don't belong.

## Neat application 2: What is most alike?

In [None]:
model.most_similar("paris")

In [None]:
model.most_similar()

In [None]:
model.most_similar()

In [None]:
model.most_similar()

**Real-world application of this**: Suppose you're building out a process to detect when people are tweeting about an emergency. They may not just use the word "emergency." Rather than manually creating a list of words people could use, you may want to learn from a much larger corpus of data than just your personal experience!

## Neat application 3: Analogies

In physics, we can add/subtract vectors to understand how two forces might act on an object. With word vectors, we can do the same thing!

![](./images/word2vec-king-queen-composition.png)

$$
\begin{eqnarray*}
\text{king - man} &=& \text{queen - woman} \\
x_1 - x_2 &=& y_1 - y_2 \\
x_1 - x_2 - y_1 &=& -y_2 \\
-x_1 + x_2 + y_1 &=& y_2 \\
\text{king is to man as queen is to...} &&
\end{eqnarray*}
$$

In [None]:
# Define analogy function.

    
    # Find the vector y2 that is closest to $x_1 - x_2 + y_2$.

    
    # Return the result.


In [None]:
analogy()

In [None]:
analogy()

In [None]:
analogy()

Many of the images in this lesson were pulled from [this amazing resource](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/).

## Create word vectors from your own corpus! (BONUS)

### NOTE: This will usually take a *long* time!

In [None]:
# Import Word2Vec
from gensim.models.word2vec import Word2Vec

# If you want to use gensim's data, import their downloader
# and load it.
import gensim.downloader as api
corpus = api.load('text8')

# If you have your own iterable corpus of cleaned data:

# Train a model! 
model = Word2Vec(corpus,      # Corpus of data.
                 size=100,    # How many dimensions do you want in your word vector?
                 window=5,    # How many "context words" do you want?
                 min_count=1, # Ignores words below this threshold.
                 sg=0,        # SG = 1 uses SkipGram, SG = 0 uses CBOW (default).
                 workers=4)   # Number of "worker threads" to use (parallelizes process).

# Do what you'd like to do with your dataa!
model.most_similar("car")

Check out the documentation for Gensim's implementation of [Word2Vec here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).