# Sentiment Analysis using word2vec
In this tutorial competition, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.

### Reference
* https://www.kaggle.com/c/word2vec-nlp-tutorial/overview
* https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec/data

In [None]:
# !nltk.download('popular')

In [None]:
# Importing the built-in logging module
import logging

logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)

In [None]:
# Firstly, please note that the performance of google word2vec is better on big datasets.
# In this example we are considering only 25000 training examples from the imdb dataset.
# Therefore, the performance is similar to the "bag of words" model.

# Importing libraries
import numpy as np
import pandas as pd

# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup
import re  # For regular expressions

# Stopwords can be useful to undersand the semantics of the sentence.
# Therefore stopwords are not removed while creating the word2vec model.
# But they will be removed  while averaging feature vectors.
from nltk.corpus import stopwords

# word2vec expects a list of lists.
# Using punkt tokenizer for better splitting of a paragraph into sentences.

import nltk.data


tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

In [162]:
# Read data from files


In [None]:
# This function converts a text to a sequence of words.
def review_wordlist(review, remove_stopwords=False):
    # 1. Removing html tags
    review_text = BeautifulSoup(review).get_text()

    # 2. Removing non-letter.
    review_text = re.sub("[^a-zA-Z]", " ", review_text)

    # 3. Converting to lower case and splitting
    words = review_text.lower().split()

    # 4. Optionally remove stopwords
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]

    return words


# This function splits a review into sentences
def review_sentences(review, tokenizer, remove_stopwords=False):
    # 1. Using nltk tokenizer
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []

    # 2. Loop for each sentence
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_wordlist(raw_sentence, remove_stopwords))

    # This returns the list of lists
    return sentences

In [None]:
# test `review_sentences()`


In [None]:
# !pip install -U tqdm

In [15]:
from tqdm.auto import tqdm

# Parse sentences from training set 


In [None]:
import multiprocessing

# Train & save word2vec model

In [None]:
# Few tests: print the odd word among them

In [None]:
# Print the most similar words present in the model


### Load the model trained on bigger corpus (for better result)

### Solving Word Analogies!

* Man is to Woman what King is to ___?
* USA is to hamburger what UK is to ___?
* Korea is to kimchi what USA is to ___?

![](./figures/analogy.png)

### Now Back to our analysis again...

In [None]:
# This will give the total number of words in the vocabolary created from this dataset
model.wv.vectors.shape

In [None]:
# Function to average all word vectors in a paragraph
def featureVecMethod(words, model, num_features):
    # Pre-initialising empty numpy array for speed
    featureVec = np.zeros(num_features, dtype="float32")
    nwords = 0

    # Converting Index2Word which is a list to a set for better speed in the execution.
    index2word_set = set(model.wv.index_to_key)

    for word in words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec, model.wv.get_vector(word))

    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec


# Function for calculating the average feature vector
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype="float32")
    for review in reviews:
        # Printing a statuse1 message every 1000th review
        if counter % 1000 == 0:
            print("Review %d of %d" % (counter, len(reviews)))

        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter + 1

    return reviewFeatureVecs

In [None]:
# Calculating average feature vector for training set


In [None]:
# Calculating average feature vactors for test set


In [None]:
# Fitting a random forest classifier to the training data


In [None]:
# Predicting the sentiment values for test data and saving the results in a csv file

Submit the output at https://www.kaggle.com/c/word2vec-nlp-tutorial/leaderboard

# Bonus: Aspect-base Sentiment Analysis

In [None]:
sentences = [
    "The food we had yesterday was delicious",
    "My time in Italy was very enjoyable",
    "I found the meal to be tasty",
    "The internet was slow.",
    "Our experience was suboptimal",
]