# Sentiment Analysis using word2vec
In this tutorial competition, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.

### Reference
* https://www.kaggle.com/c/word2vec-nlp-tutorial/overview
* https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec/data

In [2]:
# !nltk.download('popular')

In [3]:
# Importing the built-in logging module
import logging

logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)

In [4]:
# Firstly, please note that the performance of google word2vec is better on big datasets.
# In this example we are considering only 25000 training examples from the imdb dataset.
# Therefore, the performance is similar to the "bag of words" model.

# Importing libraries
import numpy as np
import pandas as pd

# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup
import re  # For regular expressions

# Stopwords can be useful to undersand the semantics of the sentence.
# Therefore stopwords are not removed while creating the word2vec model.
# But they will be removed  while averaging feature vectors.
from nltk.corpus import stopwords

# word2vec expects a list of lists.
# Using punkt tokenizer for better splitting of a paragraph into sentences.

import nltk.data


tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

In [5]:
# Read data from files
train = pd.read_csv(
    "./data/labeledTrainData.tsv.gz",
    delimiter="\t",
)
test = pd.read_csv("./data/testData.tsv.gz", delimiter="\t")

In [6]:
# This function converts a text to a sequence of words.
def review_wordlist(review, remove_stopwords=False):
    # 1. Removing html tags
    review_text = BeautifulSoup(review).get_text()

    # 2. Removing non-letter.
    review_text = re.sub("[^a-zA-Z]", " ", review_text)

    # 3. Converting to lower case and splitting
    words = review_text.lower().split()

    # 4. Optionally remove stopwords
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]

    return words


# This function splits a review into sentences
def review_sentences(review, tokenizer, remove_stopwords=False):
    # 1. Using nltk tokenizer
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []

    # 2. Loop for each sentence
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_wordlist(raw_sentence, remove_stopwords))

    # This returns the list of lists
    return sentences

In [7]:
text = train["review"].iloc[0]
print(text)

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally star

In [8]:
review_sentences(text, tokenizer)

[['with',
  'all',
  'this',
  'stuff',
  'going',
  'down',
  'at',
  'the',
  'moment',
  'with',
  'mj',
  'i',
  've',
  'started',
  'listening',
  'to',
  'his',
  'music',
  'watching',
  'the',
  'odd',
  'documentary',
  'here',
  'and',
  'there',
  'watched',
  'the',
  'wiz',
  'and',
  'watched',
  'moonwalker',
  'again'],
 ['maybe',
  'i',
  'just',
  'want',
  'to',
  'get',
  'a',
  'certain',
  'insight',
  'into',
  'this',
  'guy',
  'who',
  'i',
  'thought',
  'was',
  'really',
  'cool',
  'in',
  'the',
  'eighties',
  'just',
  'to',
  'maybe',
  'make',
  'up',
  'my',
  'mind',
  'whether',
  'he',
  'is',
  'guilty',
  'or',
  'innocent'],
 ['moonwalker',
  'is',
  'part',
  'biography',
  'part',
  'feature',
  'film',
  'which',
  'i',
  'remember',
  'going',
  'to',
  'see',
  'at',
  'the',
  'cinema',
  'when',
  'it',
  'was',
  'originally',
  'released'],
 ['some',
  'of',
  'it',
  'has',
  'subtle',
  'messages',
  'about',
  'mj',
  's',
  'feeli

In [9]:
# !pip install -U tqdm

In [10]:
from tqdm.auto import tqdm

sentences = []
print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_sentences(review, tokenizer)

Parsing sentences from training set




In [11]:
import multiprocessing


# Creating the model and setting values for the various parameters
num_features = 300  # Word vector dimensionality
min_word_count = 10  # Minimum word count
num_workers = multiprocessing.cpu_count() / 2  # Number of parallel threads
context = 10  # Context window size
downsampling = 1e-3  # (0.001) Downsample setting for frequent words

# Initializing the train model
from gensim.models import word2vec

print("Training model....")
model = word2vec.Word2Vec(
    sentences,
    workers=num_workers,
    vector_size=num_features,
    min_count=min_word_count,
    window=context,
    sample=downsampling,
)

# Saving the model for later use. Can be loaded using Word2Vec.load()
model_name = "word2vec.model"
model.save(model_name)

2021-07-03 02:54:48,818 : INFO : collecting all words and their counts
2021-07-03 02:54:48,818 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-07-03 02:54:48,881 : INFO : PROGRESS: at sentence #10000, processed 225664 words, keeping 17775 word types
2021-07-03 02:54:48,923 : INFO : PROGRESS: at sentence #20000, processed 451582 words, keeping 24944 word types
2021-07-03 02:54:48,964 : INFO : PROGRESS: at sentence #30000, processed 670632 words, keeping 30023 word types


Training model....


2021-07-03 02:54:49,030 : INFO : PROGRESS: at sentence #40000, processed 896478 words, keeping 34329 word types
2021-07-03 02:54:49,073 : INFO : PROGRESS: at sentence #50000, processed 1115469 words, keeping 37741 word types
2021-07-03 02:54:49,119 : INFO : PROGRESS: at sentence #60000, processed 1336692 words, keeping 40702 word types
2021-07-03 02:54:49,166 : INFO : PROGRESS: at sentence #70000, processed 1559365 words, keeping 43300 word types
2021-07-03 02:54:49,209 : INFO : PROGRESS: at sentence #80000, processed 1778623 words, keeping 45699 word types
2021-07-03 02:54:49,252 : INFO : PROGRESS: at sentence #90000, processed 2002603 words, keeping 48113 word types
2021-07-03 02:54:49,294 : INFO : PROGRESS: at sentence #100000, processed 2224101 words, keeping 50180 word types
2021-07-03 02:54:49,338 : INFO : PROGRESS: at sentence #110000, processed 2442894 words, keeping 52050 word types
2021-07-03 02:54:49,409 : INFO : PROGRESS: at sentence #120000, processed 2665092 words, keepin

KeyboardInterrupt: 

In [None]:
# Few tests: This will print the odd word among them
model.wv.doesnt_match("man woman dog child kitchen".split())

In [None]:
model.wv.doesnt_match("france england germany berlin".split())

In [None]:
# This will print the most similar words present in the model
model.wv.most_similar("man")

In [None]:
model.wv.most_similar("awful")

### Load the model trained on bigger corpus (for better result)

In [None]:
# https://github.com/RaRe-Technologies/gensim-data#models
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-100")  # trained with 6B tokens

### Solving Word Analogies!

* Man is to Woman what King is to ___?
* USA is to hamburger what UK is to ___?
* Korea is to kimchi what USA is to ___?

![](./figures/analogy.png)

In [None]:
model.most_similar(positive=["king", "woman"], negative=["man"])
# model.most_similar(positive=["hamburger", "uk"], negative=["usa"])
# model.most_similar(positive=["kimchi", "usa"], negative=["korea"])

In [None]:
# This will give the total number of words in the vocabolary created from this dataset
model.wv.vectors.shape

In [None]:
# Function to average all word vectors in a paragraph
def featureVecMethod(words, model, num_features):
    # Pre-initialising empty numpy array for speed
    featureVec = np.zeros(num_features, dtype="float32")
    nwords = 0

    # Converting Index2Word which is a list to a set for better speed in the execution.
    index2word_set = set(model.wv.index_to_key)

    for word in words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec, model.wv.get_vector(word))

    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec


# Function for calculating the average feature vector
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype="float32")
    for review in reviews:
        # Printing a statuse1 message every 1000th review
        if counter % 1000 == 0:
            print("Review %d of %d" % (counter, len(reviews)))

        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter + 1

    return reviewFeatureVecs

In [None]:
# Calculating average feature vector for training set
clean_train_reviews = []
for review in train["review"]:
    cleaned = review_wordlist(review, remove_stopwords=True)
    clean_train_reviews.append(cleaned)

trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)

In [None]:
# Calculating average feature vactors for test set
clean_test_reviews = []
for review in test["review"]:
    cleaned = review_wordlist(review, remove_stopwords=True)
    clean_test_reviews.append(cleaned)

testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features)

In [None]:
# Fitting a random forest classifier to the training data
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100)

print("Fitting random forest to training data....")
forest = forest.fit(trainDataVecs, train["sentiment"])

In [None]:
# Predicting the sentiment values for test data and saving the results in a csv file
result = forest.predict(testDataVecs)
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})
output.to_csv("output.csv", index=False, quoting=3)

Submit the output at https://www.kaggle.com/c/word2vec-nlp-tutorial/leaderboard

# Bonus: Aspect-base Sentiment Analysis 
* ref: https://towardsdatascience.com/aspect-based-sentiment-analysis-using-spacy-textblob-4c8de3e0d2b9

In [34]:
import spacy
from spacy import displacy
from pprint import pprint

nlp = spacy.load("en_core_web_md")

In [19]:
sentences = [
    "The food we had yesterday was delicious",
    "My time in Italy was very enjoyable",
    "I found the meal to be tasty",
    "The internet was slow.",
    "Our experience was suboptimal",
]

### First, we pick up the sentiment description

In [23]:
for sentence in sentences:
    doc = nlp(sentence)
    descriptive_term = ""
    for token in doc:
        if token.pos_ == "ADJ":
            descriptive_term = token
    print(sentence)
    print(descriptive_term)
    print()

The food we had yesterday was delicious
delicious

My time in Italy was very enjoyable
enjoyable

I found the meal to be tasty
tasty

The internet was slow.
slow

Our experience was suboptimal
suboptimal



### Try to also extract intensifiers (e.g., "very")

In [25]:
for sentence in sentences:
    doc = nlp(sentence)
    descriptive_term = ""
    for token in doc:
        if token.pos_ == "ADJ":
            prepend = ""
            for child in token.children:
                if child.pos_ != "ADV":
                    continue
                prepend += child.text + " "
            descriptive_term = prepend + token.text
    print(sentence)
    print(descriptive_term)
    print()

The food we had yesterday was delicious
delicious

My time in Italy was very enjoyable
very enjoyable

I found the meal to be tasty
tasty

The internet was slow.
slow

Our experience was suboptimal
suboptimal



### Now, identify the targets of the sentiments

In [38]:
doc = nlp(sentences[0])
displacy.render(doc, style="dep")

In [33]:
aspects = []
for sentence in sentences:
    doc = nlp(sentence)
    descriptive_term = ""
    target = ""
    for token in doc:
        if token.dep_ == "nsubj" and token.pos_ == "NOUN":
            target = token.text
        if token.pos_ == "ADJ":
            prepend = ""
            for child in token.children:
                if child.pos_ != "ADV":
                    continue
                prepend += child.text + " "
            descriptive_term = prepend + token.text
    aspects.append({"aspect": target, "description": descriptive_term})
pprint(aspects)

[{'aspect': 'food', 'description': 'delicious'},
 {'aspect': 'time', 'description': 'very enjoyable'},
 {'aspect': 'meal', 'description': 'tasty'},
 {'aspect': 'internet', 'description': 'slow'},
 {'aspect': 'experience', 'description': 'suboptimal'}]


### Classify the sentiment using `TextBlob`

In [31]:
from textblob import TextBlob

for aspect in aspects:
    aspect["sentiment"] = TextBlob(aspect["description"]).sentiment  # or other sentiment classifiers
pprint(aspects)

[{'aspect': 'food',
  'description': 'delicious',
  'sentiment': Sentiment(polarity=1.0, subjectivity=1.0)},
 {'aspect': 'time',
  'description': 'very enjoyable',
  'sentiment': Sentiment(polarity=0.65, subjectivity=0.78)},
 {'aspect': 'meal',
  'description': 'tasty',
  'sentiment': Sentiment(polarity=0.0, subjectivity=0.0)},
 {'aspect': 'internet',
  'description': 'slow',
  'sentiment': Sentiment(polarity=-0.30000000000000004, subjectivity=0.39999999999999997)},
 {'aspect': 'experience',
  'description': 'suboptimal',
  'sentiment': Sentiment(polarity=0.0, subjectivity=0.0)}]
