The purpose of this notebook is to introduce two SOTA word embedding methods, **Word2Vec** and **FastText**, using Gensim

# Traditional Approach
***
A traditional way of representing words is using a one-hot vector, so each word gets its own basis vector. The length of this one-hot vector is always equal to the size of the unique vocabulary in the corpus. 

This representation is simple and easy to implement, but it does not embed any sort of semantic meaning between two words given their one-hot representation. For example, assuming you assign index positions to words in alphabetical order, the words "endure" and "tolerate" would be very far away in their one-hot encoded space ("en" is far from "to", alphabetically). However, we ideally want to embed words in a space where these two words are close to one another. 

# Word2Vec
***
Word2Vec is an efficient solution to these problems that leverages the context of the target words. We use the surrounding words to represent target words (distributional hypothesis) using a Neural Network whose hidden layer encodes the word representation.

There are two variations of Word2Vec: **Skip-gram** and **Continuous Bag of Words**

## Skip-gram
***
In skip-gram, the input is the target word, and the outputs are the words surrounding the target words. For example, consider the sentence "I have a cute dog". The input to the neural network could be the word "a", and the output of the neural network could be ["I","have","cute","dog"] (assuming a window of length 5). The input and output data are of the same dimension, and use a one-hot encoding. The network is typically shallow and contains 1 hidden layer whose number of nodes is equal to the dimension of the embedding space. Typically, the size of the hidden layer is smaller than the input or output vector size (to have a smaller dimensional embedding space). At the end of the output layer, a softmax function is applied so that each element of the output vector describes the probability of seeing that word in the target word's context.

The word embedding for the target words is obtained by extracting the hidden layers after feeding the one-hot representation of that word into the network.

With skip-gram, the dimension of the representation decreases from the vocabulary size to the length of the hidden layer (N). In addition, the vectors are more meaningful in terms of the probability of being in the same context as another word (again, by assumption of the distributional hypothesis). Under this structure, the vector difference obtained by subtracting two related words sometimes expresses a meaningful concept such as gender or verb tense (king+woman-queen = man). 

## (CBOW) Continuous Bag of Words
***
The continuous Bag of Words is very similar to the skip-gram, except it swaps the input and output. Given some context, we predict what the target word is. 

The largest difference between the skip-gram and CBOW methods is the way the word vectors are generated. For CBOW, the examples with the target word are fed into the networks, and one takes the average of the extracted hidden layer. For example, assume we have two sentences: ["He is a nice guy","She is a wise queen"]. To compute the word representation for the word "a", we feed these two sentences into the neural network and we take the average of the values in the hidden layer. 

In skip-gram, we feed in one target word as a one-hot vector and we get back a context vector. 

# Word2Vec Implementation
***

In [21]:
import pandas as pd
import numpy as np
import os 
import re
from gensim.models import Word2Vec, FastText

In [12]:
data = pd.read_csv("data/yelp/raw_train.csv", header=None,names=["sentiment","review"])

In [13]:
data.head()

Unnamed: 0,sentiment,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


Now we do some preprocessing to make sure we can learn contexts a little better, and then we tokenize it, so that our final version can be read in by the Word2Vec model. 

In [14]:
def preprocess(review):
    # remove parentheses
    review = re.sub(r"\([^)]*\)", "", review)
    # remove any non-alphanumeric characters and make lowercase
    review = re.sub(r"[^a-z0-9]+", " ", review.lower()).split()
    return review

In [18]:
data["cleaned_review"] = data.review.apply(preprocess)

In [19]:
model = Word2Vec(sentences=data["cleaned_review"], size=100, window=5, min_count=5,\
                workers=8, sg=0)

In [20]:
model.wv.most_similar("man")

[('guy', 0.858910322189331),
 ('woman', 0.8234066963195801),
 ('gentleman', 0.8075653910636902),
 ('lady', 0.8026242852210999),
 ('girl', 0.7914206385612488),
 ('dude', 0.7794543504714966),
 ('gal', 0.7622164487838745),
 ('gent', 0.7058786749839783),
 ('bouncer', 0.6754287481307983),
 ('gentlemen', 0.6743558049201965)]

One of the biggest challenges for Word2Vec is that it cannot represent words it didn't see in the training set. For this reason, we can use **FastText**

# FastText
***
FastText is an extension to Word2Vec. Instead of feeding whole individual words into the Neural Network, FastText instead breaks each word into several n-grams (sub parts of word). For example, the tri-grams for "apple" are ["app","ppl","ple"]. Under this featurization, the word embedding vector for apple will be the sum of these n-grams. After training the Neural Network, rare words can now be properly represented since it is highly likely that some of their n-grams will have appeared in the training set. 
# FastText Implementation
***

In [22]:
fasttext_model = FastText(sentences=data["cleaned_review"], size=100, window=5, \
                          min_count=5, workers=8, sg=0)

Now we can try it with a word that does not appear in the training set:

In [23]:
word_not_in_training_set = "Gastroenteritis"
print("Word in training set: {}"\
      .format(word_not_in_training_set_not_in_training_set in model.wv.vocab.keys()))
fasttext_model.wv.most_similar(word_not_in_training_set)

Word in training set: False


[('antibiotics', 0.6070331335067749),
 ('pregnancies', 0.5956847667694092),
 ('surgeries', 0.5834647417068481),
 ('pregnancy', 0.574320912361145),
 ('resurgence', 0.573951244354248),
 ('innocence', 0.5651587247848511),
 ('invasion', 0.5600395798683167),
 ('illness', 0.5597577095031738),
 ('symptoms', 0.557685375213623),
 ('probiotics', 0.5571395754814148)]