<h1>What is word2vec?</h1>
<p>Word2vec is a machine learning method to efficiently create `word embeddings`.</p>

Every word is transformed into a numeral sequence or an array of numbers: 
`multi-dimensional meaning representations of a word`

The vector data of one word `banana` can look like:
```
    array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
    
```
To better visualize what w2v: [illustrated word2vec](http://jalammar.github.io/illustrated-word2vec/)

By knowing vector data of words, we can do many interesting things:
1. Similarity among words
1. Linear algebra among words 
1. Recommendation engines
1. ...

<h1> What are we doing today?</h1>

1. Use the Seinfeld wordlist we just prepared
1. Use spaCy word2vec model to vectorize a single word then many words
1. Explore semantic similarity among words using math
1. Create a fun method that transform any song title or lyrics into Seinfeld-themed word combination



# Take a look at our words

In [10]:
# We will import the prepared word list first then split them into a list
file = open('../data_prep/seinfeld_tokens.txt','r')
seinfeld_tokens = file.read().lower().split('\n')
print("There are %d unique word tokens in our Seinfeld word collection." %(len(seinfeld_tokens)))

# Let's randomly print our 10 of them
print(seinfeld_tokens[12000:12010])

There are 18209 unique word tokens in our Seinfeld word collection.
['plotting', 'ploughing', 'ploy', 'pluck', 'plug', 'plugged', 'pluggin', 'plugola', 'plugs', 'plum']


# Vectorize words with spaCY

We will use spaCy's w2v model - a Natural Language Processing library that contains pre-trained vector data of common vocabularies.

> spaCy w2v documentation [here](https://spacy.io/usage/vectors-similarity)

Before we start, if you haven't downloaded spaCy, remember to pip install. Also download the medium size spaCy model, which contains word vectors data.

```python
!pip install spacy
python -m spacy download en_core_web_md

```

In [12]:
# Import spaCy and load spaCy medium model
import spacy
nlp = spacy.load("en_core_web_md")

In [15]:
# Look up vector data of words
nlp.vocab["banana"].vector
# Define a function for getting word vector data
def vec(word):
    return nlp.vocab[word].vector

In [24]:
# Look up vector of random words from our seinfeld_tokens 
from random import choice
random_seinfeld_word = choice(seinfeld_tokens)
print(random_seinfeld_word)
#vec(random_seinfeld_word)

fleas


# Vector Math

If words are represented in vectors, then we can use vector math on them?

Imagine:
1. "King" - "Man"
1. "Apple" + "Purple"

# Word Similarity

Since we have vector data of words, it also means we have positions of these words in a multi-dimensional space(100-300 dimensions)
All these words are located in relation to each other in a larger word context. In a nutshall, when the machine is training to assign/identify positional data of each word, it's looking at where and what other words this word is usually with. 

If all these words are two dimensional, with only an x and y coordinate, we can probably imagine them scattered on a 2-D coordinate. 
Anna and I made a little game a while ago where we used the same Seinfeld raw text and plotted everything on a two-dimensional vector space. You can find that multi-dimensional reduction process [here](https://github.com/lanzhang76/toast/blob/master/word-processing/w2v_SNE_graph/w2v_t-SNE_Seinfeld.ipynb).

We can also compare similarity among words by comparing their distances with each other.

Popular methods for calculating vector distance are:
1. Euclidean Distance (good for two-dimensional vector)
1. Cosine Similary (good for multi-dimensional vector)


<img src="../w2v/img.jpeg" alt="drawing" width="400" style="float: left;"/>

In [29]:
# Turning Cosine Similarity equation into a function
# We will need to download and import numpy in order to use following:

from numpy import dot
from numpy.linalg import norm

# Define a function that outputs cosine similarity of any two given vectors
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

In [38]:
# Higher number means more similar
print(cosine(vec("milk"),vec("water")))
print(cosine(vec("milk"),vec("car")))
cosine(vec("milk"),vec("water")) > cosine(vec("milk"),vec("car"))

0.50557566
0.15468664


True

In [39]:
# Create a function that iterates through token_list
def spacy_closest(token_list, # Given token list 
                  target_word_vec, # Any word vector
                  n=10 # by default 10 closest words
                 ):
    
    # compare every word to the target_word, outputs a sorted list of n cloesets words.
    return sorted(token_list,
                  key=lambda x: cosine(target_word_vec, vec(x)), # lamda function is a shortened function 
                  reverse=True # True is ascending 
                 )[:n]

In [40]:
# Get 10 closest words appeared in Seinfeld to the target word "cat"
spacy_closest(seinfeld_tokens,vec("cat"))

['cat',
 'cats',
 'dog',
 'kitty',
 'pet',
 'poodle',
 'puppy',
 'retriever',
 'dogs',
 'rabbit']

In [41]:
spacy_closest(seinfeld_tokens,vec("deli"),15)

['bodega',
 'deli',
 'delicatessen',
 'delicatessens',
 'knish',
 'appetizers',
 'pastrami',
 'pastries',
 'sandwiches',
 'artisan',
 'bakeries',
 'bakery',
 'supermarket',
 'grocery',
 'pizza']

# Seinfeld-themed song title transformer

Let's create a fun method that transform any song title or lyrics into Seinfeld-themed word combination using our cloesest neighbors method we just created

Make sure you have following libraries imported&downloaded:

```python
    from numpy import dot
    from numpy.linalg import norm
    from random import choice
    import spacy
    nlp = spacy.load("en_core_web_md")
```


In [67]:
# A function that takes in a sentence and a list of tokens
def seinfeldTransformer(song_title,word_tokens,num = 2):
    
    # Vectorize Function
    
    # Cosine Similarity Function
    
    # Closest Neighbor Function
    
    # Replace current sentence input with a random most-similar words
    def getNewTitle(sen, li):
        word_list = sen.split(" ")
        new_list = []
        for word in word_list:
            replace_word = choice(spacy_closest(li,vec(word),10))
            new_list.append(replace_word)
        return ' '.join(new_list)
    
    # Generate
    print('"%s" can also be called:' % (song_title))
    for i in range(num):
        print("---")
        print("%d. %s" % (i+1,getNewTitle(song_title,word_tokens)))

In [68]:
# Queen?
seinfeldTransformer("Another one bites the dust",seinfeld_tokens,5)

"Another one bites the dust" can also be called:
---
1. an it antidote its pebble
---
2. another only treats one debris
---
3. least that mauled that grime
---
4. a even fleas entire sand
---
5. one it bite its debris


In [72]:
# 
seinfeldTransformer("This guy is in love with you",seinfeld_tokens,5)

"This guy is in love with you" can also be called:
---
1. one jocks being within know well sure
---
2. it dude comes the loving together sure
---
3. one dude it in loooove up can
---
4. one guy only around really both get
---
5. that someone comes the loves with you


In [71]:
# more Queen?
seinfeldTransformer("We are the champions",seinfeld_tokens,5)

"We are the champions" can also be called:
---
1. they are part victory
---
2. let these one decathlon
---
3. what these part finals
---
4. they are part triumph
---
5. we those into decathlon


In [73]:
# or any sentence really:
seinfeldTransformer("A fox is chasing a chicken",seinfeld_tokens,5)

"A fox is chasing a chicken" can also be called:
---
1. second rabbit is pouncing another marsala
---
2. one wolf one tailing is brisket
---
3. another skunk is chasing the pork
---
4. kind skunk be pouncing something chowder
---
5. another rabbit comes chasing first meat
