<h1 align="center">PART II</h1>
<h1 align="center">Sentiment Analysis Classifications - Review and Comparison</h1>

First, we needed to create vector words. For simplicity, we used a pre-trained model.

Google was able to teach the Word2Vec model on a massive Google News dataset that contained over 100 billion different words! Google has created [3 million vector words](https://code.google.com/archive/p/word2vec/#Pre-trained_word_and_phrase_vectors) from this model, each with a dimension of 300.

Ideally, we would use these vectors, but because the vector-word matrix is quite large (3.6 GB), we used a much more manageable matrix, which was trained using [GloVe](https://nlp.stanford.edu/projects/glove/), with a similar model of vector word generation. This matrix contains 400,000 vector words, each with a dimension of 50. You can also download model [here](https://www.kaggle.com/anindya2906/glove6b?select=glove.6B.50d.txt).

#### How word2vec works:

The idea behind word2vec is that:

    Take a 3 layer neural network. (1 input layer + 1 hidden layer + 1 output layer)
    Feed it a word and train it to predict its neighbouring word.
    Remove the last (output layer) and keep the input and hidden layer.
    Now, input a word from within the vocabulary. The output given at the hidden layer is the ‘word embedding’ of the input word.
    
Two popular examples of methods of learning word embeddings from text include:

    Word2Vec
    GloVe

To get started, let's download the necessary libraries:

In [1]:
import numpy as np
import pandas as pd
import pickle
import gensim, logging
import gensim.models.keyedvectors as word2vec
import matplotlib.pyplot as plt

%matplotlib inline

Also let's write a style for alignment in the middle of all graphs, images, etc:

In [2]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Next, we will load the sample data we processed in the previous part:

In [3]:
with open('documents.pql', 'rb') as f:
     docs = pickle.load(f)

In [4]:
print("Number of documents:", len(docs))

Number of documents: 38544


Now we will load our glove model in word2vec format. Because the GloVe dump from Stanford's site is slightly different from the word2vec format. You can convert a GloVe file to word2vec format using the following command in your console:

`python -m gensim.scripts.glove2word2vec --input  model/glove.6B.50d.txt --output model/glove.6B.50d.w2vformat.txt`

After that you can delete original GloVe model.

Next operation may take some time, as the model contains 400 000 words, so we will get a 400 000 x 50 embedding matrix that contains all the values of the word vectors.

In [5]:
model = word2vec.KeyedVectors.load_word2vec_format('model/glove.6B.50d.w2vformat.txt', binary=False)

Now let's get a list of all the words from our dictionary:

In [6]:
words = list(model.vocab)

Just to make sure everything is loaded correctly, we can look at the dimensions of the dictionary list and the embedding matrix:

In [7]:
print(words[:50], "\n\nTotal words:", len(words), "\n\nWord-Vectors shape:", model.vectors.shape)

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as', 'it', 'by', 'at', '(', ')', 'from', 'his', "''", '``', 'an', 'be', 'has', 'are', 'have', 'but', 'were', 'not', 'this', 'who', 'they', 'had', 'i', 'which', 'will', 'their', ':', 'or', 'its', 'one', 'after'] 

Total words: 400000 

Word-Vectors shape: (400000, 50)


We can also find a word like "football" in our word list and then access the corresponding vector through the embedding matrix:

In [8]:
print(model['football'])

[-1.8209    0.70094  -1.1403    0.34363  -0.42266  -0.92479  -1.3942
  0.28512  -0.78416  -0.52579   0.89627   0.35899  -0.80087  -0.34636
  1.0854   -0.087046  0.63411   1.1429   -1.6264    0.41326  -1.1283
 -0.16645   0.17424   0.99585  -0.81838  -1.7724    0.078281  0.13382
 -0.59779  -0.45068   2.5474    1.0693   -0.27017  -0.75646   0.24757
  1.0261    0.11329   0.17668  -0.23257  -1.1561   -0.10665  -0.25377
 -0.65102   0.32393  -0.58262   0.88137  -0.13465   0.96903  -0.076259
 -0.59909 ]


<h2 align="center">Word Average Embedding Model</h2>

Well, let's start analyzing our vectors. Our first approach will be the **word average embedding model**. 

The essence of this naive approach is to take the average of all word vectors from a sentence to get one 50-dimensional vector that represents the tone of the whole sentence that we feed the model and try to get some quick result.

We didn't have to put a try/except, but even though I cleaned up our sample, there were a couple of words left after the processing that needed to be searched for and removed.

In [11]:
def sent_embed(words, docs):
    x_sent_embed, y_sent_embed = [], []
    count_words, count_non_words = 0, 0  
    
    # recover the embedding of each sentence with the average of the vector that composes it
    # sent - sentence, state - state of the sentence (pos/neg)
    for sent, state in docs:
        # average embedding of all words in a sentence
        sent_embed = []
        for word in sent:
            try:
                # if word is present in the dictionary - add its vector representation
                count_words += 1
                sent_embed.append(model[word])
            except KeyError:
                # if word is not in the dictionary - add a zero vector
                count_non_words += 1
                sent_embed.append([0] * 50)
        
        # add a sentence vector to the list
        x_sent_embed.append(np.mean(sent_embed, axis=0).tolist())
        
        # add a label to y_sent_embed
        if state == 'pos': y_sent_embed.append(1)
        elif state == 'neg': y_sent_embed.append(0)
            
    print(count_non_words, "out of", count_words, "words were not found in the vocabulary.")
    
    return x_sent_embed, y_sent_embed

In [12]:
x, y = sent_embed(words, docs)

30709 out of 1802696 words were not found in the vocabulary.
