# Analyzing Movie Reviews - Sentiment Analysis
In this notebook, we focus on trying to analyze a large corpus of movie reviews and derive the sentiment.

[![image](http://www.moviereviewworld.com/wp-content/uploads/2013/06/movie-review-world-homepage-image.jpg)](http://www.moviereviewworld.com/)

We cover a wide variety of techniques for analyzing sentiment, which include the following.
- Unsupervised lexicon-based models
- Traditional supervised Machine Learning models
- Newer supervised Deep Learning models
- Advanced supervised Deep Learning models

Besides looking at various approaches and models, we also focus on important aspects in the Machine Learning pipeline including text pre-processing, normalization, and in-depth analysis of models, including model interpretation and topic models. The key idea here is to understand how we tackle a problem like sentiment analysis on unstructured text, learn various techniques, models and understand how to interpret the results. This will enable you to use these methodologies in the future on your own datasets. Let's get started!

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#How-to-classify-Sentiment?" data-toc-modified-id="How-to-classify-Sentiment?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>How to classify Sentiment?</a></span></li></ul></li><li><span><a href="#Preparing-environment-and-data" data-toc-modified-id="Preparing-environment-and-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparing environment and data</a></span><ul class="toc-item"><li><span><a href="#Import-and-Setting-Up-Dependencies" data-toc-modified-id="Import-and-Setting-Up-Dependencies-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import and Setting Up Dependencies</a></span></li><li><span><a href="#Text-Pre-Processing-and-Normalization" data-toc-modified-id="Text-Pre-Processing-and-Normalization-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Text Pre-Processing and Normalization</a></span><ul class="toc-item"><li><span><a href="#Cleaning-Text---strip-HTML" data-toc-modified-id="Cleaning-Text---strip-HTML-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Cleaning Text - strip HTML</a></span></li><li><span><a href="#Removing-accented-characters" data-toc-modified-id="Removing-accented-characters-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Removing accented characters</a></span></li><li><span><a href="#Expanding-Contractions" data-toc-modified-id="Expanding-Contractions-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>Expanding Contractions</a></span></li><li><span><a href="#Removing-Special-Characters" data-toc-modified-id="Removing-Special-Characters-2.2.4"><span class="toc-item-num">2.2.4&nbsp;&nbsp;</span>Removing Special Characters</a></span></li><li><span><a href="#Lemmatizing-text" data-toc-modified-id="Lemmatizing-text-2.2.5"><span class="toc-item-num">2.2.5&nbsp;&nbsp;</span>Lemmatizing text</a></span></li><li><span><a href="#Removing-Stopwords" data-toc-modified-id="Removing-Stopwords-2.2.6"><span class="toc-item-num">2.2.6&nbsp;&nbsp;</span>Removing Stopwords</a></span></li><li><span><a href="#Normalize-text-corpus---tying-it-all-together" data-toc-modified-id="Normalize-text-corpus---tying-it-all-together-2.2.7"><span class="toc-item-num">2.2.7&nbsp;&nbsp;</span>Normalize text corpus - tying it all together</a></span></li></ul></li><li><span><a href="#Topics-Help-Functions" data-toc-modified-id="Topics-Help-Functions-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Topics Help Functions</a></span></li><li><span><a href="#Simplify-Get-Results" data-toc-modified-id="Simplify-Get-Results-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Simplify Get Results</a></span></li><li><span><a href="#Load-and-normalize-data" data-toc-modified-id="Load-and-normalize-data-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Load and normalize data</a></span></li></ul></li><li><span><a href="#Sentiment-Analysis---Unsupervised-Lexical" data-toc-modified-id="Sentiment-Analysis---Unsupervised-Lexical-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Sentiment Analysis - Unsupervised Lexical</a></span><ul class="toc-item"><li><span><a href="#Sentiment-Analysis-with-AFINN" data-toc-modified-id="Sentiment-Analysis-with-AFINN-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Sentiment Analysis with AFINN</a></span></li><li><span><a href="#Sentiment-Analysis-with-SentiWordNet" data-toc-modified-id="Sentiment-Analysis-with-SentiWordNet-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Sentiment Analysis with SentiWordNet</a></span></li><li><span><a href="#Sentiment-Analysis-with-VADER" data-toc-modified-id="Sentiment-Analysis-with-VADER-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Sentiment Analysis with VADER</a></span></li></ul></li><li><span><a href="#Classifying-Sentiment-with-Supervised-Learning" data-toc-modified-id="Classifying-Sentiment-with-Supervised-Learning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Classifying Sentiment with Supervised Learning</a></span><ul class="toc-item"><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Feature Engineering</a></span></li><li><span><a href="#Traditional-Supervised-Machine-Learning-Models" data-toc-modified-id="Traditional-Supervised-Machine-Learning-Models-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Traditional Supervised Machine Learning Models</a></span><ul class="toc-item"><li><span><a href="#Model-Training" data-toc-modified-id="Model-Training-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Model Training</a></span></li><li><span><a href="#Prediction-and-Performance-Evaluation" data-toc-modified-id="Prediction-and-Performance-Evaluation-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Prediction and Performance Evaluation</a></span></li></ul></li><li><span><a href="#Newer-Supervised-Deep-Learning-Models" data-toc-modified-id="Newer-Supervised-Deep-Learning-Models-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Newer Supervised Deep Learning Models</a></span><ul class="toc-item"><li><span><a href="#Prediction-class-label-encoding" data-toc-modified-id="Prediction-class-label-encoding-4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>Prediction class label encoding</a></span></li><li><span><a href="#Feature-Engineering-with-word-embeddings" data-toc-modified-id="Feature-Engineering-with-word-embeddings-4.3.2"><span class="toc-item-num">4.3.2&nbsp;&nbsp;</span>Feature Engineering with word embeddings</a></span></li><li><span><a href="#Modeling-with-deep-neural-networks" data-toc-modified-id="Modeling-with-deep-neural-networks-4.3.3"><span class="toc-item-num">4.3.3&nbsp;&nbsp;</span>Modeling with deep neural networks</a></span><ul class="toc-item"><li><span><a href="#Building-Deep-neural-network-architecture" data-toc-modified-id="Building-Deep-neural-network-architecture-4.3.3.1"><span class="toc-item-num">4.3.3.1&nbsp;&nbsp;</span>Building Deep neural network architecture</a></span></li><li><span><a href="#Model-Training,-Prediction-and-Performance-Evaluation" data-toc-modified-id="Model-Training,-Prediction-and-Performance-Evaluation-4.3.3.2"><span class="toc-item-num">4.3.3.2&nbsp;&nbsp;</span>Model Training, Prediction and Performance Evaluation</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Advanced-Supervised-Deep-Learning-Models" data-toc-modified-id="Advanced-Supervised-Deep-Learning-Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Advanced Supervised Deep Learning Models</a></span><ul class="toc-item"><li><span><a href="#Preparing-data" data-toc-modified-id="Preparing-data-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Preparing data</a></span><ul class="toc-item"><li><span><a href="#Tokenize-train-&amp;-test-datasets" data-toc-modified-id="Tokenize-train-&amp;-test-datasets-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Tokenize train &amp; test datasets</a></span></li><li><span><a href="#Build-Vocabulary-Mapping-(word-to-index)" data-toc-modified-id="Build-Vocabulary-Mapping-(word-to-index)-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Build Vocabulary Mapping (word to index)</a></span></li><li><span><a href="#Encode-and-Pad-datasets-&amp;-Encode-prediction-class-labels" data-toc-modified-id="Encode-and-Pad-datasets-&amp;-Encode-prediction-class-labels-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>Encode and Pad datasets &amp; Encode prediction class labels</a></span></li></ul></li><li><span><a href="#Build-the-LSTM-Model-Architecture" data-toc-modified-id="Build-the-LSTM-Model-Architecture-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Build the LSTM Model Architecture</a></span><ul class="toc-item"><li><span><a href="#Visualize-model-architecture" data-toc-modified-id="Visualize-model-architecture-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Visualize model architecture</a></span></li></ul></li><li><span><a href="#Train-the-model" data-toc-modified-id="Train-the-model-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Train the model</a></span></li><li><span><a href="#Predict-and-Evaluate-Model-Performance" data-toc-modified-id="Predict-and-Evaluate-Model-Performance-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Predict and Evaluate Model Performance</a></span></li></ul></li><li><span><a href="#Analyzing-Sentiment-Causation" data-toc-modified-id="Analyzing-Sentiment-Causation-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Analyzing Sentiment Causation</a></span><ul class="toc-item"><li><span><a href="#Build-Text-Classification-Pipeline-with-The-Best-Model" data-toc-modified-id="Build-Text-Classification-Pipeline-with-The-Best-Model-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Build Text Classification Pipeline with The Best Model</a></span></li><li><span><a href="#Interpreting-Predictive-Models" data-toc-modified-id="Interpreting-Predictive-Models-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Interpreting Predictive Models</a></span><ul class="toc-item"><li><span><a href="#Analyze-Model-Prediction-Probabilities" data-toc-modified-id="Analyze-Model-Prediction-Probabilities-6.2.1"><span class="toc-item-num">6.2.1&nbsp;&nbsp;</span>Analyze Model Prediction Probabilities</a></span></li><li><span><a href="#Interpreting-Model-Decisions" data-toc-modified-id="Interpreting-Model-Decisions-6.2.2"><span class="toc-item-num">6.2.2&nbsp;&nbsp;</span>Interpreting Model Decisions</a></span></li></ul></li><li><span><a href="#Analyzing-Topic-Models" data-toc-modified-id="Analyzing-Topic-Models-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Analyzing Topic Models</a></span><ul class="toc-item"><li><span><a href="#Extract-features-from-positive-and-negative-reviews" data-toc-modified-id="Extract-features-from-positive-and-negative-reviews-6.3.1"><span class="toc-item-num">6.3.1&nbsp;&nbsp;</span>Extract features from positive and negative reviews</a></span></li><li><span><a href="#Topic-Modeling-on-Reviews" data-toc-modified-id="Topic-Modeling-on-Reviews-6.3.2"><span class="toc-item-num">6.3.2&nbsp;&nbsp;</span>Topic Modeling on Reviews</a></span></li><li><span><a href="#Visualize-topics-for-positive-reviews" data-toc-modified-id="Visualize-topics-for-positive-reviews-6.3.3"><span class="toc-item-num">6.3.3&nbsp;&nbsp;</span>Visualize topics for positive reviews</a></span></li><li><span><a href="#Display-and-visualize-topics-for-negative-reviews" data-toc-modified-id="Display-and-visualize-topics-for-negative-reviews-6.3.4"><span class="toc-item-num">6.3.4&nbsp;&nbsp;</span>Display and visualize topics for negative reviews</a></span></li></ul></li></ul></li></ul></div>

## Introduction

The problem at hand is sentiment analysis or opinion mining, where we want to analyze some textual documents and predict their sentiment or opinion based on the content of these documents.

A text corpus consists of multiple text documents and each document can be as simple as a single sentence to a complete document with multiple paragraphs. Textual data, in spite of being highly unstructured, can be classified into two major types of documents:
- ***Factual/objective documents***: typically depict some form of statements or facts with no specific feelings or emotion attached to them. 
- ***Subjective documents***: text that expresses feelings, moods, emotions, and opinions.

Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment.

![image](https://www.kdnuggets.com/images/sentiment-fig-1-689.jpg)

**Sentiment analysis** is also popularly known as **opinion analysis** or **opinion mining**. The key idea is to use techniques from text analytics, NLP, Machine Learning, and linguistics to extract important information or data points from unstructured text. This in turn can help us derive ***qualitative outputs*** like the overall sentiment being on a ***positive***, ***neutral***, or ***negative*** scale and ***quantitative outputs*** like the sentiment ***polarity***, ***subjectivity***, and ***objectivity*** proportions. 

**Sentiment polarity** is typically a numeric score that's assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0. Of course, you can always change these thresholds based on the type of text you are dealing with.

### How to classify Sentiment?
![image](https://www.kdnuggets.com/images/sentiment-fig-2-532.jpg)
__Machine Learning__:

This approach, employes a machine-learning technique and diverse features to construct a classifier that can identify text that expresses sentiment. Nowadays, deep-learning methods are popular because they fit on data learning representations.

__Lexicon-Based__:

This method uses a variety of words annotated by polarity score, to decide the general assessment score of a given content. The strongest asset of this technique is that it does not require any training data, while its weakest point is that a large number of words and expressions are not included in sentiment lexicons.

__Hybrid__:

The combination of machine learning and lexicon-based approaches to address Sentiment Analysis is called Hybrid. Though not commonly used, this method usually produces more promising results than the approaches mentioned above.


## Preparing environment and data
### Import and Setting Up Dependencies

Let’s load the necessary dependencies and settings before getting started.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
import unicodedata

nlp = spacy.load('en', parse = False, tag=False, entity=False)
tokenizer = ToktokTokenizer()

import datetime
from datetime import timedelta
 
datetimeFormat = '%Y-%m-%d %H:%M:%S.%f'

from sklearn.preprocessing import LabelEncoder, label_binarize
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import NMF
from sklearn.base import clone

from scipy import interp

from afinn import Afinn
afn = Afinn(emoticons=True) 

import nltk
nltk.download('all', halt_on_error=False)
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import gensim

from collections import Counter

from IPython.display import SVG

import keras
from keras.models import Sequential
from keras.layers import Dropout, Activation, Dense, Embedding, Dropout, SpatialDropout1D, LSTM, Bidirectional
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing import sequence

#from skater.core.local_interpretation.lime.lime_text import LimeTextExplainer
from lime.lime_text import LimeTextExplainer

import pyLDAvis
import pyLDAvis.sklearn

np.set_printoptions(precision=2, linewidth=80)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /usr/share/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /usr/share/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /usr/share/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /usr/share/nltk_data

[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package stopwo

Using TensorFlow backend.


**Notes**: NLP libraries which will be used include spacy, nltk, and gensim. Do remember to check that your installed nltk version is at least >= 3.2.4, otherwise, the ToktokTokenizer class may not be present. If you want to use a lower nltk version for some reason, you can use any other tokenizer like the default word_tokenize() based on the TreebankWordTokenizer. The version for gensim should be at least 2.3.0 and for spacy, the version used was 1.9.0. We recommend using the latest version of spacy which was recently released (version 2.x) as this has fixed several bugs and added several improvements.

### Text Pre-Processing and Normalization

An initial step in text and sentiment classification is pre-processing. A significant amount of techniques is applied to data in order to improvement of classification effectiveness. This enables standardization across a document corpus, which helps build meaningful features, to reduce dimensionality and reduce noise that can be introduced due to many factors like irrelevant symbols, special characters, XML and HTML tags, and so on.

The main components in our text normalization pipeline are:

#### Cleaning Text - strip HTML
Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing sentiment. Hence we need to make sure we remove them before extracting features. The BeautifulSoup library does an excellent job in providing necessary functions for this. Our strip_html_tags(...) function enables in cleaning and stripping out HTML code.

In [2]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

#### Removing accented characters
In our dataset, we are dealing with reviews in the English language so we need to make sure that characters with any other format, especially accented characters are converted and standardized into ASCII characters. A simple example would be converting é to e. Our remove_accented_chars(...) function helps us in this respect.

In [3]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

#### Expanding Contractions
In the English language, contractions are basically shortened versions of words or syllables. Contractions pose a problem in text normalization because we have to deal with special characters like the apostrophe and we also have to convert each contraction to its expanded, original form. Our expand_contractions(...) function uses regular expressions and various contractions mapped to expand all contractions in our text corpus.

In [4]:
# -*- coding: utf-8 -*-

# Contraction Map
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

#### Removing Special Characters
Simple regexes can be used to achieve this. Our function remove_special_characters(...) helps us remove special characters. In our code, we have retained numbers but you can also remove numbers if you do not want them in your normalized corpus.

In [5]:
def remove_special_characters(text):
    text = re.sub(r'[^a-zA-z0-9\s]', '', text)
    return text

#### Lemmatizing text
**Word stems** are usually the base form of possible words that can be created by ***attaching affixes*** like prefixes and suffixes ***to the stem*** to create new words. This is known as **inflection**. The **reverse process** of obtaining the base form of a word is known as **stemming**. The nltk package offers a wide range of stemmers like the PorterStemmer and LancasterStemmer. **Lemmatization** is very similar to stemming, where we remove word affixes to get to the base form of a word. However the base form in this case is known as the **root word** but not the root stem. The difference being that ***the root word is always a lexicographically correct word***, present in the dictionary, but the root stem may not be so. We will be using lemmatization only in our normalization pipeline to retain lexicographically correct words. The function lemmatize_text(...) helps us with this aspect.

In [6]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

#### Removing Stopwords
Words which have little or no significance especially when constructing meaningful features from text are also known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a document corpus. Words like a, an, the, and so on are considered to be stopwords. There is no universal stopword list but we use a standard English language stopwords list from nltk. You can also add your own domain specific stopwords if needed. The function remove_stopwords(...) helps us remove stopwords and retain words having the most significance and context in a corpus.

In [7]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

#### Normalize text corpus - tying it all together

We use all these components and tie them together in the following function called normalize_corpus(...), which can be used to take a document corpus as input and return the same corpus with cleaned and normalized text documents.

In [8]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters    
        if special_char_removal:
            doc = remove_special_characters(doc)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus


### Topics Help Functions

We will also leverage some utility functions to support get and display topics from a corpus with their terms and weights.

In [9]:
# Prints components of all the topics obtained from topic modeling
def print_topics_udf(topics, total_topics=1,
                     weight_threshold=0.0001,
                     display_weights=False,
                     num_terms=None):
    
    for index in range(total_topics):
        topic = topics[index]
        topic = [(term, float(wt))
                 for term, wt in topic]
        topic = [(word, round(wt,2)) 
                 for word, wt in topic 
                 if abs(wt) >= weight_threshold]
                     
        if display_weights:
            print('Topic #'+str(index+1)+' with weights')
            print(topic[:num_terms]) if num_terms else topic
        else:
            print('Topic #'+str(index+1)+' without weights')
            tw = [term for term, wt in topic]
            print(tw[:num_terms]) if num_terms else tw
        print()
        

# Extracts topics with their terms and weights 
# Format is Topic N: [(term1, weight1), ..., (termn, weightn)]        
def get_topics_terms_weights(weights, feature_names):
    feature_names = np.array(feature_names)
    sorted_indices = np.array([list(row[::-1]) 
                           for row 
                           in np.argsort(np.abs(weights))])
    sorted_weights = np.array([list(wt[index]) 
                               for wt, index 
                               in zip(weights,sorted_indices)])
    sorted_terms = np.array([list(feature_names[row]) 
                             for row 
                             in sorted_indices])
    
    topics = [np.vstack((terms.T, term_weights.T)).T 
              for terms, term_weights 
              in zip(sorted_terms, sorted_weights)]     
    
    return topics         

### Simplify Get Results
Let's build a function to standardize the capture and exposure of the results of our models.

As a classification problem, Sentiment Analysis uses the evaluation metrics of Precision, Recall, F-score, and Accuracy. Also, average measures like macro, micro, and weighted F1-scores are useful for multi-class problems. 

In [10]:
def get_results(model, name, data, true_labels, target_names = ['positive', 'negative'], results=None, reasume=False):

    if hasattr(model, 'layers'):
        param = wtp_dnn_model.history.params
        best = np.mean(wtp_dnn_model.history.history['val_acc'])
        predicted_labels = model.predict_classes(data) 
        im_model = InMemoryModel(model.predict, examples=data, target_names=target_names)

    else:
        param = gs.best_params_
        best = gs.best_score_
        predicted_labels = model.predict(data).ravel()
        if hasattr(model, 'predict_proba'):
            im_model = InMemoryModel(model.predict_proba, examples=data, target_names=target_names)
        elif hasattr(clf, 'decision_function'):
            im_model = InMemoryModel(model.decision_function, examples=data, target_names=target_names)
        
    print('Mean Best Accuracy: {:2.2%}'.format(best))
    print('-'*60)
    print('Best Parameters:')
    print(param)
    print('-'*60)
    
    y_pred = model.predict(data).ravel()
    
    display_model_performance_metrics(true_labels, predicted_labels = predicted_labels, target_names = target_names)
    if len(target_names)==2:
        ras = roc_auc_score(y_true=true_labels, y_score=y_pred)
    else:
        roc_auc_multiclass, ras = roc_auc_score_multiclass(y_true=true_labels, y_score=y_pred, target_names=target_names)
        print('\nROC AUC Score by Classes:\n',roc_auc_multiclass)
        print('-'*60)

    print('\n\n              ROC AUC Score: {:2.2%}'.format(ras))
    prob, score_roc, roc_auc = plot_model_roc_curve(model, data, true_labels, label_encoder=None, class_names=target_names)
    
    interpreter = Interpretation(data, feature_names=cols)
    plots = interpreter.feature_importance.plot_feature_importance(im_model, progressbar=False, n_jobs=1, ascending=True)
    
    r1 = pd.DataFrame([(prob, best, np.round(accuracy_score(true_labels, predicted_labels), 4), 
                         ras, roc_auc)], index = [name],
                         columns = ['Prob', 'CV Accuracy', 'Accuracy', 'ROC AUC Score', 'ROC Area'])
    if reasume:
        results = r1
    elif (name in results.index):        
        results.loc[[name], :] = r1
    else: 
        results = results.append(r1)
        
    return results

def roc_auc_score_multiclass(y_true, y_score, target_names, average = "macro"):

  #creating a set of all the unique classes using the actual class list
  unique_class = set(y_true)
  roc_auc_dict = {}
  mean_roc_auc = 0
  for per_class in unique_class:
    #creating a list of all the classes except the current class 
    other_class = [x for x in unique_class if x != per_class]

    #marking the current class as 1 and all other classes as 0
    new_y_true = [0 if x in other_class else 1 for x in y_true]
    new_y_score = [0 if x in other_class else 1 for x in y_score]
    num_new_y_true = sum(new_y_true)

    #using the sklearn metrics method to calculate the roc_auc_score
    roc_auc = roc_auc_score(new_y_true, new_y_score, average = average)
    roc_auc_dict[target_names[per_class]] = np.round(roc_auc, 4)
    mean_roc_auc += num_new_y_true * np.round(roc_auc, 4)
    
  mean_roc_auc = mean_roc_auc/len(y_true)  
  return roc_auc_dict, mean_roc_auc

def get_metrics(true_labels, predicted_labels):
    
    print('Accuracy:  {:2.2%} '.format(metrics.accuracy_score(true_labels, predicted_labels)))
    print('Precision: {:2.2%} '.format(metrics.precision_score(true_labels, predicted_labels, average='weighted')))
    print('Recall:    {:2.2%} '.format(metrics.recall_score(true_labels, predicted_labels, average='weighted')))
    print('F1 Score:  {:2.2%} '.format(metrics.f1_score(true_labels, predicted_labels, average='weighted')))
                        

def train_predict_model(classifier,  train_features, train_labels,  test_features, test_labels):
    # build model    
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features) 
    return predictions    


def display_confusion_matrix(true_labels, predicted_labels, target_names):
    
    total_classes = len(target_names)
    level_labels = [total_classes*[0], list(range(total_classes))]

    cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels)
    cm_frame = pd.DataFrame(data=cm, 
                            columns=pd.MultiIndex(levels=[['Predicted:'], target_names], labels=level_labels), 
                            index=pd.MultiIndex(levels=[['Actual:'], target_names], labels=level_labels)) 
    print(cm_frame) 
    
def display_classification_report(true_labels, predicted_labels, target_names):

    report = metrics.classification_report(y_true=true_labels, y_pred=predicted_labels, target_names=target_names) 
    print(report)
    
def display_model_performance_metrics(true_labels, predicted_labels, target_names):
    print('Model Performance metrics:')
    print('-'*30)
    get_metrics(true_labels=true_labels, predicted_labels=predicted_labels)
    print('\nModel Classification report:')
    print('-'*30)
    display_classification_report(true_labels=true_labels, predicted_labels=predicted_labels, target_names=target_names)
    print('\nPrediction Confusion Matrix:')
    print('-'*30)
    display_confusion_matrix(true_labels=true_labels, predicted_labels=predicted_labels, target_names=target_names)


def plot_model_roc_curve(clf, features, true_labels, label_encoder=None, class_names=None):
    
    ## Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    if hasattr(clf, 'classes_'):
        class_labels = clf.classes_
    elif label_encoder:
        class_labels = label_encoder.classes_
    elif class_names:
        class_labels = class_names
    else:
        raise ValueError('Unable to derive prediction classes, please specify class_names!')
    n_classes = len(class_labels)
   
    if n_classes == 2:
        if hasattr(clf, 'predict_proba'):
            prb = clf.predict_proba(features)
            if prb.shape[1] > 1:
                y_score = prb[:, prb.shape[1]-1] 
            else:
                y_score = clf.predict(features).ravel()
            prob = True
        elif hasattr(clf, 'decision_function'):
            y_score = clf.decision_function(features)
            prob = False
        else:
            raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")
        
        fpr, tpr, _ = roc_curve(true_labels, y_score)      
        roc_auc = auc(fpr, tpr)

        plt.plot(fpr, tpr, label='ROC curve (area = {0:3.2%})'.format(roc_auc), linewidth=2.5)
        
    elif n_classes > 2:
        if  hasattr(clf, 'clfs_'):
            y_labels = label_binarize(true_labels, classes=list(range(len(class_labels))))
        else:
            y_labels = label_binarize(true_labels, classes=class_labels)
        if hasattr(clf, 'predict_proba'):
            y_score = clf.predict_proba(features)
            prob = True
        elif hasattr(clf, 'decision_function'):
            y_score = clf.decision_function(features)
            prob = False
        else:
            raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")
            
        for i in range(n_classes):
            fpr[i], tpr[i], _ = roc_curve(y_labels[:, i], y_score[:, i])
            roc_auc[i] = auc(fpr[i], tpr[i])

        ## Compute micro-average ROC curve and ROC area
        fpr["micro"], tpr["micro"], _ = roc_curve(y_labels.ravel(), y_score.ravel())
        roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

        ## Compute macro-average ROC curve and ROC area
        # First aggregate all false positive rates
        all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
        # Then interpolate all ROC curves at this points
        mean_tpr = np.zeros_like(all_fpr)
        for i in range(n_classes):
            mean_tpr += interp(all_fpr, fpr[i], tpr[i])
        # Finally average it and compute AUC
        mean_tpr /= n_classes
        fpr["macro"] = all_fpr
        tpr["macro"] = mean_tpr
        roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

        ## Plot ROC curves
        plt.figure(figsize=(6, 4))
        plt.plot(fpr["micro"], tpr["micro"], label='micro-average ROC curve (area = {0:2.2%})'
                       ''.format(roc_auc["micro"]), linewidth=3)

        plt.plot(fpr["macro"], tpr["macro"], label='macro-average ROC curve (area = {0:2.2%})'
                       ''.format(roc_auc["macro"]), linewidth=3)

        for i, label in enumerate(class_names):
            plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:2.2%})'
                                           ''.format(label, roc_auc[i]), linewidth=2, linestyle=':')
        roc_auc = roc_auc["macro"]   
    else:
        raise ValueError('Number of classes should be atleast 2 or more')
        
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([-0.01, 1.0])
    plt.ylim([0.0, 1.01])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()
    
    return prob, y_score, roc_auc

### Load and normalize data
We can now load our IMDb movie reviews dataset, use the first 40,000 reviews for training models and the remaining 10,000 reviews as the test dataset to evaluate model performance.

In [11]:
dataset = pd.read_csv(r'../input/movie_reviews.csv')
reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])

# take a peek at the data
display(dataset.head())

# build train and test datasets
train_reviews, test_reviews, train_sentiments, test_sentiments =\
    train_test_split(reviews, sentiments , test_size=0.20,  random_state=101)

sample_review_ids = [7626, 3533, 9010]

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


 Now, we will also use our normalization module to normalize our review datasets. This is a time-consuming operation.

In [13]:
now = datetime.datetime.now()
print('Current date and time: {}'.format(now.strftime("%Y-%m-%d %H:%M:%S")))

# normalize training dataset
print('-'*60)
print('Normalize training dataset:')
norm_train_reviews = normalize_corpus(train_reviews)
diff = (datetime.datetime.now() - now)
now = datetime.datetime.now()
print('Elapsed time: {}\n'.format(diff))

# normalize test dataset
print('-'*60)
print('Normalize test dataset:')
norm_test_reviews = normalize_corpus(test_reviews)
diff = (datetime.datetime.now() - now)
print('Elapsed time: {}\n'.format(diff))

Current date and time: 2019-01-30 15:45:15
------------------------------------------------------------
Normalize training dataset:
Elapsed time: 0:42:41.881894

------------------------------------------------------------
Normalize test dataset:
Elapsed time: 0:10:23.421805



## Sentiment Analysis - Unsupervised Lexical

Even though we have labeled data, this section should give you a good idea of how lexicon based models work and you can apply the same in your own datasets when you do not have labeled data.

Unsupervised sentiment analysis models use well curated knowledgebases, ontologies, lexicons, and databases that have detailed information pertaining to subjective words, phrases including sentiment, mood, polarity, objectivity, subjectivity, and so on. A lexicon model typically uses a lexicon, also known as a dictionary or vocabulary of words specifically aligned toward sentiment analysis. Usually these lexicons contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality, and so on. You can use these lexicons and compute sentiment of a text document by matching the presence of specific words from the lexicon, look at other additional factors like presence of negation parameters, surrounding words, overall context and phrases and aggregate overall sentiment polarity scores to decide the final sentiment score. 

![image](https://image.slidesharecdn.com/iccbr-12-main-121111094448-phpapp01/95/sentiment-classification-with-casebased-reasoning-10-638.jpg)

There are several popular lexicon models used for sentiment analysis. Some of them are mentioned as follows.
- Bing Liu’s Lexicon
- MPQA Subjectivity Lexicon
- Pattern Lexicon
- AFINN Lexicon
- SentiWordNet Lexicon
- VADER Lexicon

This is not an exhaustive list of lexicon models, but definitely lists among the most popular ones available today. Since we have labeled data, it will be easy for us to see how well our actual sentiment values for these movie reviews match our lexiconmodel based predicted sentiment values. We will be covering the last three lexicon models in more detail and predict their sentiment and see how well our model performs based on model evaluation metrics like accuracy, precision, recall, and F1-score.


### Sentiment Analysis with AFINN

The [AFINN lexicon](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010) is perhaps one of the simplest and most popular lexicons that can be used extensively for sentiment analysis. It is a list of words rated for valence with an integer between minus five (negative) and plus five (positive).  The current version of the lexicon is [AFINN-en-165.txt](https://github.com/fnielsen/afinn/blob/master/afinn/data/) and it contains over 3,300+ words with a polarity score associated with each word. The author has also created a nice wrapper library on top of this in Python called afinn which we will be using for our analysis needs. AFINN takes into account other aspects like emoticons and exclamations.
![image](https://image.slidesharecdn.com/phpbnl18-machine-learning-180126163450/95/learning-machine-learning-31-638.jpg)

We can now use this object and compute the polarity of our chosen four sample reviews. The results permit you compare the actual sentiment label for each review and also check out the predicted sentiment polarity score. A negative polarity typically denotes negative sentiment. 

In [14]:
sample_review_ids = [7626, 3533, 9010]

In [15]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', afn.score(review))
    print('-'*60)
    print('\n')

REVIEW: For those not in the know, the Asterix books are a hugely successful series of comic books about a village of indomitable Gauls who resist Caesar's invasion thanks to a magic potion that renders them invulnerable supermen. There have been several animated features (only one of them, The Twelve Tasks of Asterix really capturing the wit and spirit of the books despite being an original screen story) before a perfectly cast Christian Clavier and Gerard Depardieu took the lead roles in two live action adaptations that proved colossally successful throughout Europe but made no impression whatsoever in the English-speaking world. <br /><br />The uncut French version is great fun, but sadly does not appear to be available in a version with English subtitles outside of the UK DVD. While there's still no sign of a US theatrical or DVD release, the Miramax version of Asterix et Obelix: Mission Cleopatre is also on that DVD (and has played on UK TV), and you'll never guess what - it's bee

Below we used a threshold of >= 2.0 to determine if the overall sentiment is positive else negative. You can choose your own threshold based on analyzing your own corpora in the future.

In [16]:
sentiment_polarity = [afn.score(review) for review in test_reviews]
predicted_sentiments = ['positive' if score >= 2.0 else 'negative' for score in sentiment_polarity]

Now that we have our predicted sentiment labels, we can evaluate our model performance based on standard performance metrics using our utility function.

In [17]:
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, 
                                  target_names=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy:  72.42% 
Precision: 73.20% 
Recall:    72.42% 
F1 Score:  72.15% 

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.77      0.63      0.69      4959
    negative       0.69      0.82      0.75      5041

   micro avg       0.72      0.72      0.72     10000
   macro avg       0.73      0.72      0.72     10000
weighted avg       0.73      0.72      0.72     10000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       3111     1848
        negative        910     4131


We get an overall F1-Score of 72%, which is quite decent considering it's an unsupervised model. Looking at the confusion matrix we can clearly see that quite a number of positive sentiment based reviews have been misclassified as negative (1,848) and this leads to the lower recall of 63% for the positive sentiment class. Performance for negative class is better with regard to recall or f1-score, where we correctly predicted 4,131 out of 5,041 negative reviews, but precision is 69% because of the many wrong negative predictions made in case of positive sentiment reviews.

### Sentiment Analysis with SentiWordNet
The WordNet corpus is definitely one of the most popular corpora for the English language used extensively in natural language processing and semantic analysis. WordNet gave us the concept of ***synsets*** or ***synonym sets***. The SentiWordNet lexicon is based on WordNet synsets and can be used for sentiment analysis and opinion mining. The [SentiWordNet](http://sentiwordnet.isti.cnr.it) lexicon typically assigns three sentiment scores for each WordNet synset. These include a positive polarity score, a negative polarity score and an objectivity score. We will be using the nltk library, which provides a Pythonic interface into [SentiWordNet](https://pt.coursera.org/lecture/text-mining-analytics/5-6-how-to-do-sentiment-analysis-with-sentiwordnet-5RwtX). Consider we have the adjective awesome. 
![image](https://player.slideplayer.com/11/3238511/data/images/img18.png)

In [18]:
awesome = list(swn.senti_synsets('awesome', 'a'))[0]
print('Positive Polarity Score:', awesome.pos_score())
print('Negative Polarity Score:', awesome.neg_score())
print('Objective Score:', awesome.obj_score())

Positive Polarity Score: 0.875
Negative Polarity Score: 0.125
Objective Score: 0.0


Let's now build a generic function to extract and aggregate sentiment scores for a complete textual document based on matched synsets in that document. Our function basically takes in a movie review, tags each word with its corresponding POS tag, extracts out sentiment scores for any matched synset token based on its POS tag, and finally aggregates the scores. We can clearly see the predicted sentiment along with sentiment polarity scores and an objectivity score for each sample movie review depicted in formatted dataframes. 

In [19]:
def analyze_sentiment_sentiwordnet_lexicon(review, verbose=False):

    # tokenize and POS tag text tokens
    tagged_text = [(token.text, token.tag_) for token in nlp(review)]
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')):
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')):
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')):
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')):
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0.05 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score, 
                                         norm_neg_score, norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                             ['Predicted Sentiment', 'Objectivity',
                                                              'Positive', 'Negative', 'Overall']], 
                                                             labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        display(sentiment_frame)
        
    return final_sentiment

Let's use this model now to predict the sentiment of samples reviews and compare their results with its actual values.

In [20]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:\n', review)
    print('\nActual Sentiment:', sentiment)
    pred = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)    

REVIEW:
 For those not in the know, the Asterix books are a hugely successful series of comic books about a village of indomitable Gauls who resist Caesar's invasion thanks to a magic potion that renders them invulnerable supermen. There have been several animated features (only one of them, The Twelve Tasks of Asterix really capturing the wit and spirit of the books despite being an original screen story) before a perfectly cast Christian Clavier and Gerard Depardieu took the lead roles in two live action adaptations that proved colossally successful throughout Europe but made no impression whatsoever in the English-speaking world. <br /><br />The uncut French version is great fun, but sadly does not appear to be available in a version with English subtitles outside of the UK DVD. While there's still no sign of a US theatrical or DVD release, the Miramax version of Asterix et Obelix: Mission Cleopatre is also on that DVD (and has played on UK TV), and you'll never guess what - it's be

Unnamed: 0_level_0,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:
Unnamed: 0_level_1,Predicted Sentiment,Objectivity,Positive,Negative,Overall
0,negative,0.84,0.1,0.06,0.04


REVIEW:
 Horrendously acted and completely laughable haunted-house horror flick that has an out of place Anna Paquin playing a neurotic teenager fighting off the "things-that-go-bump-in-the-dark" that are plaguing her and her family shortly after moving to their new home in Spain(?!). Little more than a geographically re-planted rip-off of "The Shining" and most notably "The Others", the weak-plotted "Darkness" is basically your typical run-of-the mill B-horror feature with a few predictable lame scares that can be seen by audiences a mile off (so to speak)! In retrospect I suppose I shouldn't have set my personal expectations quite as high for this movie to actually be good considering the well-known fact that it was shelved for nearly three years before finally being released around Christmas of last year in American cinemas across the country to what was ultimately lukewarm ticket-sales and very harsh reviews from critics. When will filmmakers ever learn that there's more to making 

Unnamed: 0_level_0,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:
Unnamed: 0_level_1,Predicted Sentiment,Objectivity,Positive,Negative,Overall
0,negative,0.84,0.09,0.07,0.02


REVIEW:
 This is the only David Zucker movie that does not spoof anything the first of its kind. The funniest movie of 98 with Night at the Roxbury right behind But I did not think Theres something about mary was funny so that doesnt count except for the frank and beans thing he he. Dont listen to the critics especially Roger Ebert he does not know solid entertainment just look at his reviews.Anyway see it you wont be dissapionted

Actual Sentiment: positive


Unnamed: 0_level_0,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:
Unnamed: 0_level_1,Predicted Sentiment,Objectivity,Positive,Negative,Overall
0,negative,0.85,0.08,0.07,0.01


Let's use this model now to predict the sentiment of all our test reviews and evaluate its performance. A threshold of >=0 has been used for the overall sentiment polarity to be classified as positive and < 0 for negative sentiment.

In [21]:
predicted_sentiments = [analyze_sentiment_sentiwordnet_lexicon(review, verbose=False) for review in norm_test_reviews]

display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, 
                                  target_names=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy:  60.03% 
Precision: 70.24% 
Recall:    60.03% 
F1 Score:  54.51% 

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.56      0.95      0.70      4959
    negative       0.85      0.25      0.39      5041

   micro avg       0.60      0.60      0.60     10000
   macro avg       0.70      0.60      0.55     10000
weighted avg       0.70      0.60      0.55     10000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       4726      233
        negative       3764     1277


We get an overall F1-Score of 55%, which is definitely a step down from our AFINN based model. While we have lesser number of negative sentiment based reviews being misclassified as positive, the other aspects of the model performance have been affected.

### Sentiment Analysis with VADER
The [VADER lexicon](https://www.researchgate.net/publication/275828927_VADER_A_Parsimonious_Rule-based_Model_for_Sentiment_Analysis_of_Social_Media_Text), developed by C.J. Hutto, is a lexicon that is based on a rule-based sentiment analysis framework, specifically tuned to analyze sentiments in social media. VADER stands for Valence Aware Dictionary and Sentiment Reasoner. You can use the library based on nltk's interface under the nltk.sentiment.vader module. Besides this, you can also [download the actual lexicon or install the framework](https://github.com/cjhutto/
vaderSentiment). The file titled vader_lexicon.txt contains necessary sentiment scores associated with words, emoticons and slangs (like wtf, lol, nah, and so on). There were a total of over 9,000 lexical features from which over 7,500 curated lexical features were finally selected in the lexicon with proper validated valence scores. Each feature was rated on a scale from "[-4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)". The process of selecting lexical features was done by keeping all features that had a non-zero mean rating and whose standard deviation was less than 2.5, which was determined by the aggregate of ten independent raters. 
![image](https://image.slidesharecdn.com/capstoneprojectgadatasciencelinm-160512021206/95/sentiment-analysis-of-airline-tweets-15-638.jpg)

Now let's use VADER to analyze our movie reviews! We build our own modeling function as follows. In our modeling function, we do some basic pre-processing but keep the punctuations and emoticons intact. Besides this, we use VADER to get the sentiment polarity and also proportion of the review text with regard to positive, neutral and negative sentiment. We also predict the final sentiment based on a user-input threshold for the aggregated sentiment polarity.

In [22]:
def analyze_sentiment_vader_lexicon(review, threshold=0.1, verbose=False):
    # pre-process text
    review = strip_html_tags(review)
    review = remove_accented_chars(review)
    review = expand_contractions(review)
    
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive, negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       'Positive', 'Negative', 'Neutral']], 
                                                              labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        display(sentiment_frame)
    
    return final_sentiment

Let's see how our model classify our samples and compare with their actual values. Typically, VADER recommends using positive sentiment for aggregated polarity >= 0.5, neutral between [-0.5, 0.5], and negative for polarity < -0.5. We use a threshold of >= 0.4 for positive and < 0.4 for negative in our corpus. The following is the analysis of our sample reviews.

In [23]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=True)    

REVIEW: For those not in the know, the Asterix books are a hugely successful series of comic books about a village of indomitable Gauls who resist Caesar's invasion thanks to a magic potion that renders them invulnerable supermen. There have been several animated features (only one of them, The Twelve Tasks of Asterix really capturing the wit and spirit of the books despite being an original screen story) before a perfectly cast Christian Clavier and Gerard Depardieu took the lead roles in two live action adaptations that proved colossally successful throughout Europe but made no impression whatsoever in the English-speaking world. <br /><br />The uncut French version is great fun, but sadly does not appear to be available in a version with English subtitles outside of the UK DVD. While there's still no sign of a US theatrical or DVD release, the Miramax version of Asterix et Obelix: Mission Cleopatre is also on that DVD (and has played on UK TV), and you'll never guess what - it's bee

Unnamed: 0_level_0,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:
Unnamed: 0_level_1,Predicted Sentiment,Polarity Score,Positive,Negative,Neutral
0,positive,0.51,11.0%,11.0%,78.0%


REVIEW: Horrendously acted and completely laughable haunted-house horror flick that has an out of place Anna Paquin playing a neurotic teenager fighting off the "things-that-go-bump-in-the-dark" that are plaguing her and her family shortly after moving to their new home in Spain(?!). Little more than a geographically re-planted rip-off of "The Shining" and most notably "The Others", the weak-plotted "Darkness" is basically your typical run-of-the mill B-horror feature with a few predictable lame scares that can be seen by audiences a mile off (so to speak)! In retrospect I suppose I shouldn't have set my personal expectations quite as high for this movie to actually be good considering the well-known fact that it was shelved for nearly three years before finally being released around Christmas of last year in American cinemas across the country to what was ultimately lukewarm ticket-sales and very harsh reviews from critics. When will filmmakers ever learn that there's more to making m

Unnamed: 0_level_0,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:
Unnamed: 0_level_1,Predicted Sentiment,Polarity Score,Positive,Negative,Neutral
0,negative,-0.96,5.0%,14.000000000000002%,81.0%


REVIEW: This is the only David Zucker movie that does not spoof anything the first of its kind. The funniest movie of 98 with Night at the Roxbury right behind But I did not think Theres something about mary was funny so that doesnt count except for the frank and beans thing he he. Dont listen to the critics especially Roger Ebert he does not know solid entertainment just look at his reviews.Anyway see it you wont be dissapionted
Actual Sentiment: positive


Unnamed: 0_level_0,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:,SENTIMENT STATS:
Unnamed: 0_level_1,Predicted Sentiment,Polarity Score,Positive,Negative,Neutral
0,positive,0.71,11.0%,7.000000000000001%,82.0%


Let's try out our model on the complete test movie review corpus now and evaluate the model performance.

In [24]:
predicted_sentiments = [analyze_sentiment_vader_lexicon(review, threshold=0.5, verbose=False) for review in test_reviews]

display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, 
                                  target_names=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy:  72.07% 
Precision: 72.79% 
Recall:    72.07% 
F1 Score:  71.81% 

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.77      0.63      0.69      4959
    negative       0.69      0.81      0.75      5041

   micro avg       0.72      0.72      0.72     10000
   macro avg       0.73      0.72      0.72     10000
weighted avg       0.73      0.72      0.72     10000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       3106     1853
        negative        940     4101


We get an overall F1-Score and model accuracy of 72%, which is quite similar to the AFINN based model. The AFINN based model only wins for very little, both models have a similar performance.

## Classifying Sentiment with Supervised Learning

__Introduction:__

We will be building an automated sentiment text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.
1. Prepare train and test datasets (optionally a validation dataset)
2. Pre-process and normalize text documents
3. Feature engineering
4. Model training
5. Model prediction and evaluation

![image](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-1-4842-2388-8_4/MediaObjects/427287_1_En_4_Fig2_HTML.jpg)<center>Blueprint for building an automated text classification system (Source: Text Analytics with Python, Apress 2016)</center>

In our scenario, documents indicate the movie reviews and classes indicate the review sentiments that can either be positive or negative, making it a binary classification problem. 

### Feature Engineering
Our feature engineering techniques will be based on the Bag of Words model and the TF-IDF model.

The ***bag-of-words model*** is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. The core principle is to convert text documents into numeric vectors. The dimension or size of each vector is N where N indicates all possible distinct words across the corpus of documents. Each document once transformed is a numeric vector of size N where the values or weights in the vector indicate the frequency of each word in that specific document. Hence the name bag of words because this model represents unstructured text into a bag of words without taking into account word positions, syntax, or semantics.
![image](https://i1.wp.com/datameetsmedia.com/wp-content/uploads/2017/05/bagofwords.004.jpeg?resize=800%2C203)

There are some potential problems which might arise with the Bag of Words model when it is used on large corpora. Since the feature vectors are based on absolute term frequencies, there might be some terms which occur frequently across all documents and these will tend to overshadow other terms in the feature set. The ***[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) model*** tries to combat this issue by using a scaling or normalizing factor in its computation. TF-IDF stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf). This technique was developed for ranking results for queries in search engines and now it is an indispensable model in the world of information retrieval and text analytics.
![image](https://skymind.ai/images/wiki/tfidf.png?resize=40%2C20)

In [25]:
# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)

# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2), sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)


# transform test reviews into features
cv_test_features = cv.transform(norm_test_reviews)
tv_test_features = tv.transform(norm_test_reviews)

print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (40000, 2340032)  Test features shape: (10000, 2340032)
TFIDF model:> Train features shape: (40000, 2340032)  Test features shape: (10000, 2340032)


### Traditional Supervised Machine Learning Models

We can now use some traditional supervised Machine Learning algorithms which work very well on text classification. We recommend using logistic regression, support vector machines, and multinomial Naïve Bayes models

#### Model Training
The logistic regression is intended for binary (two-class) classification problems, where it will predict the probability of an instance belonging to the default class, which can be snapped into a 0 or 1 classification. In this case, we try to predict the probability that a given movie review will belong to one of the discrete classes.

**<center>P(X) = P(Y=1|X)</center>**

In [26]:
lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
svm = SGDClassifier(loss='hinge', l1_ratio=0.15, max_iter=300, n_jobs=4, random_state=101)

#### Prediction and Performance Evaluation
We will now use our utility function train_predict_model(...) to build a logistic regression model on our training features and evaluate the model performance on our test features.

In [27]:
# Logistic Regression model on BOW features
lr_bow_predictions = train_predict_model(classifier=lr, 
                                         train_features=cv_train_features, train_labels=train_sentiments,
                                         test_features=cv_test_features, test_labels=test_sentiments)
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bow_predictions,
                                  target_names=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy:  90.79% 
Precision: 90.79% 
Recall:    90.79% 
F1 Score:  90.79% 

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.91      0.91      0.91      4959
    negative       0.91      0.91      0.91      5041

   micro avg       0.91      0.91      0.91     10000
   macro avg       0.91      0.91      0.91     10000
weighted avg       0.91      0.91      0.91     10000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       4493      466
        negative        455     4586


We get all metrics as **91%**, which is really excellent! 

We can now build a logistic regression model similarly on our TF-IDF features and see if we can get better results.

In [28]:
# Logistic Regression model on TF-IDF features
lr_tfidf_predictions = train_predict_model(classifier=lr, 
                                           train_features=tv_train_features, train_labels=train_sentiments,
                                           test_features=tv_test_features, test_labels=test_sentiments)
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_tfidf_predictions,
                                  target_names=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy:  89.94% 
Precision: 89.94% 
Recall:    89.94% 
F1 Score:  89.94% 

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.90      0.89      0.90      4959
    negative       0.90      0.91      0.90      5041

   micro avg       0.90      0.90      0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       4427      532
        negative        474     4567


As you can see we get all metrics close to 90%, which which is great but our previous model is still slightly better.

Let's see if we can do better with SVM:

In [29]:
svm_bow_predictions = train_predict_model(classifier=svm, 
                                          train_features=cv_train_features, train_labels=train_sentiments,
                                          test_features=cv_test_features, test_labels=test_sentiments)
print('SVM results with Bow:')
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=svm_bow_predictions,
                                 target_names=['positive', 'negative'])


svm_tfidf_predictions = train_predict_model(classifier=svm, 
                                            train_features=tv_train_features, train_labels=train_sentiments,
                                            test_features=tv_test_features, test_labels=test_sentiments)
print('-'*60)
print('\nSVM results with TF-IDF:')
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=svm_tfidf_predictions,
                                  target_names=['positive', 'negative'])

SVM results with Bow:
Model Performance metrics:
------------------------------
Accuracy:  90.86% 
Precision: 90.87% 
Recall:    90.86% 
F1 Score:  90.86% 

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.91      0.90      0.91      4959
    negative       0.90      0.92      0.91      5041

   micro avg       0.91      0.91      0.91     10000
   macro avg       0.91      0.91      0.91     10000
weighted avg       0.91      0.91      0.91     10000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       4471      488
        negative        426     4615
--------------------------------------------------------------------------------

SVM results with TF-IDF:
Model Performance metrics:
------------------------------
Accuracy:  90.24% 
Precision: 90.26% 
Recall:    90.24% 
F1 Score:  90.24% 

Model

As you can see, again we obtened all scores close to 90%. Let's see if we can improve with apply a DNN model.

### Newer Supervised Deep Learning Models
In this section, we will be building some deep neural networks and train them on some advanced text features based on word embeddings to build a text sentiment classification system.

[![image](http://159.89.224.205/wp-content/uploads/2016/07/tumblr_inline_oabas5sThb1sleek4_540.png)](http://blog.aylien.com/leveraging-deep-learning-for-multilingual/)

#### Prediction class label encoding

The following snippet helps us tokenize our movie reviews and also converts the text-based sentiment class labels into one-hot encoded vectors.

In [30]:
le = LabelEncoder()
num_classes=2  # positive -> 1, negative -> 0

# tokenize train reviews & encode train labels
tokenized_train = [tokenizer.tokenize(text) for text in norm_train_reviews]
y_tr = le.fit_transform(train_sentiments)
y_train = keras.utils.to_categorical(y_tr, num_classes)

# tokenize test reviews & encode test labels
tokenized_test = [tokenizer.tokenize(text) for text in norm_test_reviews]
y_ts = le.fit_transform(test_sentiments)
y_test = keras.utils.to_categorical(y_ts, num_classes)

# print class label encoding map and encoded labels
print('Sentiment class label map:', dict(zip(le.classes_, le.transform(le.classes_))))
print('Sample test label transformation:\n'+'-'*35,
      '\nActual Labels:', test_sentiments[:3], '\nEncoded Labels:', y_ts[:3], 
      '\nOne hot encoded Labels:\n', y_test[:3])

Sentiment class label map: {'negative': 0, 'positive': 1}
Sample test label transformation:
----------------------------------- 
Actual Labels: ['positive' 'positive' 'positive'] 
Encoded Labels: [1 1 1] 
One hot encoded Labels:
 [[0. 1.]
 [0. 1.]
 [0. 1.]]


Thus, we can see from the preceding sample outputs how our sentiment class labels have been encoded into numeric representations, which in turn have been converted into one-hot encoded vectors. 

#### Feature Engineering with word embeddings
Basically, ***word embeddings*** can be used for **feature extraction** and **language modeling**. This representation tries to map each word or phrase into a complete numeric vector such that semantically similar words or terms tend to occur closer to each other and these can be quantified using these embeddings. 

The ***word2vec model*** was built by Google is perhaps one of the most popular neural network based probabilistic language models and can be used to learn distributed representational vectors for words. Word embeddings produced by word2vec involve taking in a corpus of text documents, representing words in a large high dimensional vector space such that each word has a corresponding vector in that space and similar words (even semantically) are located close to one another.

We will be using the gensim framework to implement the same model of word2vec created by Google, on our corpus to extract features. Some of the important parameters in the model are explained briefly as follows:
- **size**: Represents the feature vector size for each word in the corpus when transformed.
- **window**: Sets the context window size specifying the length of the window of words to be taken into account as belonging to a single, similar context when training.
- **min_count**: Specifies the minimum word frequency value needed across the corpus to consider the word as a part of the final vocabulary during training the model.
- **sample**: Used to downsample the effects of words which occur very frequently.

In [31]:
# build word2vec model
w2v_num_features = 500
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=150, min_count=10, sample=1e-3)    

Each word in the corpus with at least 10 counts will essentially now be a vector itself of size 500. 

A question might arise in your mind now that so far, we had feature vectors for each complete document, but now we have vectors for each word. How do we represent entire documents now? We can do that using various aggregation and combinations. A simple scheme would be to use an averaged word vector representation, where we simply sum all the word vectors occurring in a document and then divide by the count of word vectors to represent an averaged word vector for the document. The following code enables us to do the same.

In [32]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

We can now use the previous function to generate averaged word vector representations on our two movie review datasets.

In [None]:
# generate averaged word vector features from word2vec model
avg_wv_train_features = averaged_word2vec_vectorizer(corpus=tokenized_train, model=w2v_model, num_features=w2v_num_features)
avg_wv_test_features = averaged_word2vec_vectorizer(corpus=tokenized_test, model=w2v_model, num_features=w2v_num_features)

We complete our generate embeddings by using the Global Vectors for Word Representation (GloVe) models, is an unsupervised model for obtaining word vector representations. Created at [Stanford University](https://nlp.stanford.edu/pubs/glove.pdf#_blank), this model is trained on various corpora like Wikipedia, Common Crawl, and Twitter and corresponding pre-trained word vectors are available that can be used for our analysis needs. 

The spacy library provided 384-dimensional word vectors trained on the Common Crawl corpus using the GloVe model. They provide a simple standard interface to get feature vectors of size 384 for each word as well as the averaged feature vector of a complete text document. 

Check on the [GloVe project site](https://nlp.stanford.edu/projects/glove) others pre-trained models and examples.

In [None]:
# feature engineering with GloVe model
train_nlp = [nlp(item) for item in norm_train_reviews]
train_glove_features = np.array([item.vector for item in train_nlp])

test_nlp = [nlp(item) for item in norm_test_reviews]
test_glove_features = np.array([item.vector for item in test_nlp])

print('Word2Vec model:> Train features shape:', avg_wv_train_features.shape, ' Test features shape:', avg_wv_test_features.shape)
print('GloVe model:> Train features shape:', train_glove_features.shape, ' Test features shape:', test_glove_features.shape)

#### Modeling with deep neural networks 
[![image](https://www.mdpi.com/algorithms/algorithms-09-00041/article_deploy/html/images/algorithms-09-00041-g002.png)](https://www.mdpi.com/1999-4893/9/2/41/htm)
##### Building Deep neural network architecture
We will be using a fully-connected four layer deep neural network (multi-layer perceptron or deep ANN) for our model. We call this a fully connected deep neural network (DNN) because neurons or units in each pair of adjacent layers are fully pairwise connected. These networks are also known as deep artificial neural networks (ANNs) or Multi-Layer Perceptrons (MLPs) since they have more than one hidden layer. The following function leverages keras on top of tensorflow to build the desired DNN model. We build a Sequential model, which helps us linearly stack our hidden and output layers.

In [None]:
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(1024, activation='relu', input_shape=(num_input_features,)))
    dnn_model.add(Dropout(0.5))
    dnn_model.add(Dense(1024, activation='relu'))
    dnn_model.add(Dropout(0.5))
    dnn_model.add(Dense(512, activation='relu'))
    dnn_model.add(Dropout(0.2))
    dnn_model.add(Dense(2, activation='softmax'))

    dnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return dnn_model

w2v_dnn = construct_deepnn_architecture(num_input_features=500)

We do not count the input layer usually in any deep architecture, hence our model will consist of three hidden layers of 512 neurons or units and one output layer with two units that will be used to either predict a positive or negative sentiment based on the input layer features.

We use 1024, 1024 and 512 units for our hidden layers respectively and the **activation function relu** indicates a ***rectified linear unit***.  This function tries to solve the ***vanishing gradient problem***. This problem occurs when x > 0 and as x increases, where x is typically the input to a neuron, the gradient from sigmoids becomes really small (almost vanishing) but relu prevents this from happening. Besides this, it also ***helps with faster convergence of gradient descent***. 

We also use regularization in the network in the form of ***Dropout layers***. By adding a **dropout rates of 0.5 and 0.2**, it randomly sets 50% and 20% of the input feature units to 0 at each update during training the model. This form of regularization ***helps prevent overfitting the model***.

The final output layer consists of two units with a ***softmax activation function***. The softmax function is basically a generalization of the **logistic function**, which can be used to represent a probability distribution over n possible class outcomes. In our case n = 2 where the class can either be positive or negative and the softmax probabilities will help us determine the same. 

The compile(...) method is used to configure the learning or training process of the DNN model before we actually train it. This involves providing a [***cost*** or ***loss function***](https://keras.io/losses/) in the loss parameter. This will be the goal or objective which the model will try to minimize. 

We will be using ***binary_crossentropy***, which helps us minimize the error or loss from the softmax output. We need an optimizer for helping us converge our model and minimize the loss or error function. Gradient descent or stochastic gradient descent is a popular optimizer. We will be using the ***rmsprop optimizer***, other option is [adam](https://arxiv.org/pdf/1412.6980v8.pdf ) also uses momentum where basically each update is based on not only the gradient computation of the current point but also includes a fraction of the previous update. This helps with faster convergence. 

Let's visualize our deep architecture:

In [None]:
SVG(model_to_dot(w2v_dnn, show_shapes=True, show_layer_names=False, rankdir='TB').create(prog='dot', format='svg'))

##### Model Training, Prediction and Performance Evaluation
We will be using the fit(...) function from keras for the training process and there are some parameters which you should be aware of:
- **epoch**: indicates one complete forward and backward pass of all the training examples through the network. 
- **batch_size**: indicates the total number of samples which are propagated through the DNN model at a time for one backward and forward pass for training the model and updating the gradient. 
- **validation_split**: we use 0.15 to extract 15% of the training data and use it as a validation dataset for evaluating the performance at each epoch. 
- **shuffle**: helps shuffle the samples in each epoch when training the model. 

In [None]:
batch_size = 64
w2v_dnn.fit(avg_wv_train_features, y_train, epochs=15, batch_size=batch_size, shuffle=True, 
            validation_split=0.15, verbose=2)

Let's evaluate our model performance on the test review word2vec features:

In [None]:
y_pred = w2v_dnn.predict_classes(avg_wv_test_features)
predictions = le.inverse_transform(y_pred) 
print('-'*60)
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predictions, 
                                  target_names=['positive', 'negative'])  

The results is close to 88% in all metrics. As you see, the results were improve with the increase of epochs, but you need take care with the overfitting, maybe need try others DNN configurations to you can get better results! 

Let's see how our DNN model perform with our GloVe based features: 

In [None]:
glove_dnn = construct_deepnn_architecture(num_input_features=384)

batch_size = 64
glove_dnn.fit(train_glove_features, y_train, epochs=15, batch_size=batch_size, shuffle=True, 
              validation_split=0.15, verbose=2)

As expected, this model was somewhat lower given the reduction of inputs, however it seems more resistant to overfitting, it allow us to observe the potential in using pre-trained models. 

Let's take a look at the behavior of this model versus the test base:

In [None]:
y_pred = glove_dnn.predict_classes(test_glove_features)
predictions = le.inverse_transform(y_pred) 
print('-'*80)
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predictions, 
                                  target_names=['positive', 'negative'])  

## Advanced Supervised Deep Learning Models

In this section we will use a more advanced models than your regular fully connected deep networks, a recurrent neural networks (RNNs) and long short term memory networks (LSTMs) which also considers the sequence of data (words,events, and so on). More over the Bidirectional lstms keep the contextual information in both directions.

![image](http://thelillysblog.com/images/architecture-nn2.jpg)

### Preparing data

We will start with the procedures to preparing the data for our needs on the RNN and LSTM.

#### Tokenize train & test datasets

The following snippet helps us tokenize our movie reviews.

In [None]:
tokenized_train = [tokenizer.tokenize(text) for text in norm_train_reviews]
tokenized_test = [tokenizer.tokenize(text) for text in norm_test_reviews]

#### Build Vocabulary Mapping (word to index)
For feature engineering, we will be creating word embeddings. Word embeddings tend to vectorize text documents into fixed sized vectors such that these vectors try to capture contextual and semantic information.

For generating embeddings, we will use the Embedding layer from keras, which requires documents to be represented as tokenized and numeric vectors. We already have tokenized text vectors, so we would need to convert them into numeric representations. Besides this, we would also need the vectors to be of uniform size even though the tokenized text reviews will be of variable length due to the difference in number of tokens in each review. For this, one strategy could be to take the length of the longest review (with maximum number of tokens\words) and set it as the vector size, let's call this max_len. Reviews of shorter length can be padded with a PAD term in the beginning to increase their length to max_len.

We would need to create a word to index vocabulary mapping for representing each tokenized text review in a numeric form. Do note you would also need to create a numeric mapping for the padding term which we shall call PAD_INDEX and assign it the numeric index of 0. For unknown terms, in case they are encountered later on in the test dataset or newer, previously unseen reviews, we would need to assign it to some index too. This would be because we will vectorize, engineer features, and build models only on the training data. Hence, if some new term should come up in the future, we will consider it as an out of vocabulary (OOV) term and assign it to a constant index.

In [None]:
# build word to index vocabulary
token_counter = Counter([token for review in tokenized_train for token in review])
vocab_map = {item[0]: index+1 for index, item in enumerate(dict(token_counter).items())}
max_index = np.max(list(vocab_map.values()))
vocab_map['PAD_INDEX'] = 0
vocab_map['NOT_FOUND_INDEX'] = max_index+1
vocab_size = len(vocab_map)

# view vocabulary size and part of the vocabulary map
print('Vocabulary Size:', vocab_size)
print('Sample slice of vocabulary map:\n', dict(list(vocab_map.items())[10:20]))

You may notice that we have used all the terms found in training dataset in our vocabulary. As alternative, you can easily filter and use more relevant terms here, based on their frequency, by using the most_common(count) function from Counter and taking the first count terms from the list of unique terms in the training corpus.

#### Encode and Pad datasets & Encode prediction class labels

The following snippet helps us encode and pad our movie reviews encode the tokenized text reviews based on the previous vocab_map. Also converts the text-based sentiment class labels into one-hot encoded vectors.

In [None]:
# get max length of train corpus and initialize label encoder
le = LabelEncoder()
num_classes=2 # positive -> 1, negative -> 0
max_len = np.max([len(review) for review in tokenized_train])

## Train reviews data corpus
# Convert tokenized text reviews to numeric vectors
train_X = [[vocab_map[token] for token in tokenized_review] for tokenized_review in tokenized_train]
train_X = sequence.pad_sequences(train_X, maxlen=max_len) # pad 
## Train prediction class labels
# Convert text sentiment labels (negative\positive) to binary encodings (0/1)
train_y = le.fit_transform(train_sentiments)

## Test reviews data corpus
# Convert tokenized text reviews to numeric vectors
test_X = [[vocab_map[token] if vocab_map.get(token) else vocab_map['NOT_FOUND_INDEX'] 
           for token in tokenized_review] 
              for tokenized_review in tokenized_test]
test_X = sequence.pad_sequences(test_X, maxlen=max_len)
## Test prediction class labels
# Convert text sentiment labels (negative\positive) to binary encodings (0/1)
test_y = le.transform(test_sentiments)

# view vector shapes
print('Max length of train review vectors:', max_len)
print('Train review vectors shape:', train_X.shape, ' Test review vectors shape:', test_X.shape)

### Build the LSTM Model Architecture
Let's introducing the ***Embedding layer*** and coupling it with the deep network architecture based on ***LSTMs***.

The **Embedding layer** helps us generate the word embeddings from scratch. This layer is also initialized with some weights initially and this gets updated based on our optimizer similar to weights on the neuron units in other layers when the network tries to minimize the loss in each epoch. Thus, the embedding layer tries to optimize its weights such that we get the best word embeddings which will generate minimum error in the model and also capture semantic similarity and relationships among words.

**LSTMs** basically try to overcome the shortcomings of RNN models especially with regard to handling long term dependencies and problems which occur when the weight matrix associated with the units/neurons become too small,***leading to vanishing gradient***, or too large, ***leading to exploding gradient***. The RNN units usually have a chain of repeating modules such that the module has a simple structure of having maybe one layer with the tanh activation. LSTMs are also a special type of RNN, having a similar structure but the LSTM unit has four neural network layers instead of just one. A **Bidirectional LSTM Layer** connects two hidden layers of opposite directions to the same output.
![image](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

In the diagram below, the notation t indicates one time step, C depicts the cell states, and h indicates the hidden states. The gates i, f, o and c̅ help in removing or adding information to the cell state. The gates i, f and o represent the input, output and forget gates respectively and each of them are modulated by the sigmoid layer which outputs numbers from 0 to 1 controlling how much of the output from these gates should pass. Thus this helps is protecting and controlling the cell state.
![image](https://i.stack.imgur.com/aTDpS.png)
For a detailed work flow of how information flows through the LSTM cell consult the [Christopher Olah’s blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

The final layer in our deep network is the Dense layer with 1 unit and the sigmoid activation function. We basically use the binary_crossentropy function with the adam optimizer since this is a binary classification problem and the model will ultimately predict a 0 or a 1, which we can decode back to a negative or positive sentiment prediction with our label encoder. 

In [None]:
EMBEDDING_DIM = 128 # dimension for dense embeddings for each token
LSTM_DIM = 64 # total LSTM units

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM, input_length=max_len))
model.add(SpatialDropout1D(0.3))
#model.add(LSTM(LSTM_DIM, dropout=0.2, recurrent_dropout=0.3))
model.add( Bidirectional( LSTM(lstm_out = 196, dropout_U = 0.2, dropout_W = 0.2)))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

#### Visualize model architecture

In [None]:
print(model.summary())
SVG(model_to_dot(model, show_shapes=True, show_layer_names=False, rankdir='TB').create(prog='dot', format='svg'))

### Train the model
Training LSTMs on CPU is notoriously slow. Of course, a  GPU based Deep Learning environment or a cloud-based environment, like Google Cloud Platform or AWS on GPU, took approximately at least less than four times to train the same model. So I would recommend you choose GPU environment, especially when working with RNNs or LSTM based network architectures.

In [None]:
batch_size = 100
model.fit(train_X, train_y, epochs=5, batch_size=batch_size, shuffle=True, validation_split=0.1, verbose=2)

Based on the preceding output, we can see that just with five epochs we have decent validation accuracy, but like before, validation accuracy was nor better and the training accuracy starts shooting up indicating some over-fitting might be happening. Ways to overcome this include adding more data or by increasing the drouput rate. 

### Predict and Evaluate Model Performance

In [None]:
pred_test = model.predict_classes(test_X)
predictions = le.inverse_transform(pred_test.flatten())

display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predictions, 
                                  target_names=['positive', 'negative'])  

Like in the other deep learning architecture we get close to 88% at all metrics, which is quite good! With more quality data, you can expect to get even better results. Try experimenting with different architectures and see if you get better results!

## Analyzing Sentiment Causation

Business and key stakeholders often perceive Machine Learning models as complex black boxes and poses the question, why should I trust your model? Explaining to them complex mathematical or theoretical concepts doesn't serve the purpose. Is there some way in which we can explain these models in an easy-to-interpret manner?

[![image](https://raw.githubusercontent.com/marcotcr/lime/master/doc/images/video_screenshot.png)](https://www.youtube.com/watch?v=hUnRCxnydCc)

In the analyze sentiment causation, the main idea is to determine the root cause or key factors causing positive or negative sentiment. The first area of focus will be model interpretation, where we will try to understand, interpret, and explain the mechanics behind predictions made by our classification models. The second area of focus is to apply topic modeling and extract key topics from positive and negative sentiment reviews.

### Build Text Classification Pipeline with The Best Model
Let's first build a basic text classification pipeline for the model that worked best for us so far. This is the Logistic Regression model based on the Bag of Words feature model. We will leverage the pipeline module from scikit-learn to build this Machine Learning pipeline using the following code.

In [None]:
# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)

# build Logistic Regression model
lr = LogisticRegression()
lr.fit(cv_train_features, train_sentiments)

# Build Text Classification Pipeline
lr_pipeline = make_pipeline(cv, lr)

# save the list of prediction classes (positive, negative)
classes = list(lr_pipeline.classes_)

We build our model based on norm_train_reviews, which contains the normalized training reviews that we have used in all our earlier analyses. Now that we have our classification pipeline ready, you can actually deploy the model by using pickle or joblib to save the classifier and feature objects 

### Interpreting Predictive Models
There are various ways to interpret the predictions made by our predictive sentiment classification models. We want to understand more into why a positive review was correctly predicted as having positive sentiment or a negative review having negative sentiment. Besides this, no model is a 100% accurate always, so we would also want to understand the reason for mis-classifications or wrong predictions. 

#### Analyze Model Prediction Probabilities
Assuming our pipeline is in production, how do we use it for new movie reviews? Let's try to predict the sentiment for two new sample reviews, which were not used in training the model:

In [None]:
lr_pipeline.predict(['the lord of the rings is an excellent movie', 'i hated the recent movie on tv, it was so bad'])

Our classification pipeline predicts the sentiment of both the reviews correctly! This is a good start, but how do we interpret the model predictions? One way is to typically use the model prediction class probabilities as a measure of confidence. You can use the following code to get the prediction probabilities for our sample reviews.

In [None]:
pd.DataFrame(lr_pipeline.predict_proba(['the lord of the rings is an excellent movie', 
                     'i hated the recent movie on tv, it was so bad']), columns=classes)

Thus we can say that the first movie review has a prediction confidence or probability of 83% to have positive sentiment as compared to the second movie review with a 73% probability to have negative sentiment. 

#### Interpreting Model Decisions
Besides prediction probabilities, we will be using the [skater framework](https://github.com/marcotcr/lime) for easy interpretation of the model decisions. First, to do this we define a helper function which takes in a document index, a corpus, its response predictions, and an explainer object and helps us with the our model interpretation analysis.

In [None]:
explainer = LimeTextExplainer(class_names=classes)
def interpret_classification_model_prediction(doc_index, norm_corpus, corpus, prediction_labels, explainer_obj):
    # display model prediction and actual sentiments
    print("Test document index: {index}\nActual sentiment: {actual}\nPredicted sentiment: {predicted}"
      .format(index=doc_index, actual=prediction_labels[doc_index],
              predicted=lr_pipeline.predict([norm_corpus[doc_index]])))
    # display actual review content
    print("\nReview:", corpus[doc_index])
    # display prediction probabilities
    print("\nModel Prediction Probabilities:")
    for probs in zip(classes, lr_pipeline.predict_proba([norm_corpus[doc_index]])[0]):
        print(probs)
    # display model prediction interpretation
    exp = explainer.explain_instance(norm_corpus[doc_index], 
                                     lr_pipeline.predict_proba, num_features=10, 
                                     labels=[1])
    exp.show_in_notebook()

The preceding snippet leverages skater to explain our text classifier to analyze its decision-making process in a global perspective. This is done by learning the model around the vicinity of the data point of interest X by sampling instances around X and assigning weightages based on their proximity to X. Thus, these locally learned linear models help in explaining complex models in a more easy to interpret way with class probabilities, contribution of top features to the class probabilities that aid in the decision making process. 
[![image](https://raw.githubusercontent.com/marcotcr/lime/master/doc/images/lime.png)](https://arxiv.org/pdf/1602.04938.pdf)

In [None]:
doc_index = 100 
interpret_classification_model_prediction(doc_index=doc_index, norm_corpus=norm_test_reviews,
                                         corpus=test_reviews, prediction_labels=test_sentiments,
                                         explainer_obj=explainer)

The results show us the top 10 features and we can notice that our model performs quite well in this scenario. Besides this, the word great contributed the maximum to the positive probability of 0.16 and in fact if we had removed this word from our review text, the positive probability would have dropped significantly.

Let's see a positive classification case:

In [None]:
doc_index = 2000
interpret_classification_model_prediction(doc_index=doc_index, norm_corpus=norm_test_reviews,
                                         corpus=test_reviews, prediction_labels=test_sentiments,
                                         explainer_obj=explainer)

Based on the content, the reviewer really liked this model and also it was a real cult classic among certain age groups. In our final analysis, we will look at the model interpretation of an example where the model makes a wrong prediction.

In [None]:
doc_index = 347 
interpret_classification_model_prediction(doc_index=doc_index, norm_corpus=norm_test_reviews,
                                         corpus=test_reviews, prediction_labels=test_sentiments,
                                         explainer_obj=explainer)

The results tell us that the reviewer in fact shows signs of positive sentiment in the movie review, especially in parts where he\she tells us that “I loved it. I still think the directing and cinematography are excellent, as is the music... Alan Rickman is great, a bit old perhaps, but he plays the role beautifully. And Elizabeth Spriggs, she is absolutely fantastic as always.” and feature words from the same have been depicted in the top features contributing to positive sentiment. The model interpretation also correctly identifies the aspects of the review contributing to negative sentiment like, “But it's really the script that has over the time started to bother me more and more.”. Hence, this is one of the more complex reviews which indicate both positive and negative sentiment and the final interpretation would be in the reader's hands. 

### Analyzing Topic Models
The main aim of topic models is to extract and depict key topics or concepts which are otherwise latent and not very prominent in huge corpora of text documents. 

For do this we can use some topic modeling technique like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix
factorization. let's proceed with the second one.

#### Extract features from positive and negative reviews
The first step in this analysis is to combine all our normalized train and test reviews and separate out these reviews into positive and negative sentiment reviews. Once we do this, we will extract features from these two datasets using the TF-IDF feature vectorizer. 

In [None]:
# consolidate all normalized reviews
norm_reviews = norm_train_reviews+norm_test_reviews

# get tf-idf features for only positive reviews
positive_reviews = [review for review, sentiment in zip(norm_reviews, sentiments) if sentiment == 'positive']
ptvf = TfidfVectorizer(use_idf=True, min_df=0.05, max_df=0.95, ngram_range=(1,1), sublinear_tf=True)
ptvf_features = ptvf.fit_transform(positive_reviews)

# get tf-idf features for only negative reviews
negative_reviews = [review for review, sentiment in zip(norm_reviews, sentiments) if sentiment == 'negative']
ntvf = TfidfVectorizer(use_idf=True, min_df=0.05, max_df=0.95, ngram_range=(1,1), sublinear_tf=True)
ntvf_features = ntvf.fit_transform(negative_reviews)

# view feature set dimensions
print(ptvf_features.shape, ntvf_features.shape)

From the preceding output dimensions, you can see that we have filtered out a lot of the features we used previously when building our classification models by making min_df to be 0.05 and max_df to be 0.95. This is to speed up the topic modeling process and remove features that either occur too much or too rarely.

#### Topic Modeling on Reviews

The NMF class from scikit-learn will help us with topic modeling. We also use pyLDAvis for building interactive visualizations of topic models. The core principle behind Non-Negative Matrix Factorization (NNMF) is to apply matrix decomposition (similar to SVD) to a non-negative feature matrix X such that the decomposition can be represented as X ≈ WH where W & H are both non-negative matrices which if multiplied should approximately re-construct the feature matrix X. A cost function like L2 norm can be used for getting this approximation. Let’s now apply NNMF to get 15 topics from our positive sentiment reviews. We will also leverage the functions to display the results by topics in a clean format.

In [None]:
pyLDAvis.enable_notebook()
total_topics = 10

# build topic model on positive sentiment review features
pos_nmf = NMF(n_components=total_topics, random_state=101, alpha=0.1, l1_ratio=0.2)
pos_nmf.fit(ptvf_features)

# extract features and component weights
pos_feature_names = ptvf.get_feature_names()
pos_weights = pos_nmf.components_

# extract and display topics and their components
pos_topics = get_topics_terms_weights(pos_weights, pos_feature_names)
print_topics_udf(topics=pos_topics,
                 total_topics=total_topics,
                 num_terms=15,
                 display_weights=True)

#### Visualize topics for positive reviews

You can leverage [pyLDAvis](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) now to visualize these topics in an interactive visualization.

In [None]:
pyLDAvis.sklearn.prepare(pos_nmf, ptvf_features, ptvf, R=15)

From the topics and the terms, we can see terms like movie cast, actors, performance, play, characters, music, wonderful, good, and so on have contributed toward positive sentiment in various topics. This is quite interesting and gives you a good insight into the components of the reviews that contribute toward positive sentiment of the reviews. 

This visualization is completely interactive if you are using the jupyter notebook and you can click on any of the bubbles representing topics in the Intertopic Distance Map on the left and see the most relevant terms in each of the topics in the right bar chart.

The plot on the left is rendered using Multi-dimensional Scaling (MDS). Similar topics should be close to one another and dissimilar topics should be far apart. The size of each topic bubble is based on the frequency of that topic and its components in the overall corpus.

The visualization on the right shows the top terms. When no topic it selected, it shows the top 15 most salient topics in the corpus. A term's saliency is defined as a measure of how frequently the term appears the corpus and its distinguishing factor when used to distinguish between topics. When some topic is selected, the chart changes to shows the top 15 most relevant terms for that topic. The relevancy metric is controlled by λ, which can be changed based on a slider on top of the bar chart.

#### Display and visualize topics for negative reviews

From the topics and the terms, we can see terms like waste, time, money, crap, plot, terrible, acting, and so on have contributed toward negative sentiment in various topics. Of course, there are high chances of overlap between topics from positive and negative sentiment reviews, but there will be distinguishable, distinct topics that further help us with interpretation and causal analysis.

In [None]:
# build topic model on negative sentiment review features
neg_nmf = NMF(n_components=total_topics, random_state=101, alpha=0.1, l1_ratio=0.2)
neg_nmf.fit(ntvf_features)      

# extract features and component weights
neg_feature_names = ntvf.get_feature_names()
neg_weights = neg_nmf.components_

# extract and display topics and their components
neg_topics = get_topics_terms_weights(neg_weights, neg_feature_names)
print_topics_udf(topics=neg_topics,
                 total_topics=total_topics,
                 num_terms=15,
                 display_weights=True) 

pyLDAvis.sklearn.prepare(neg_nmf, ntvf_features, ntvf, R=15)

In [None]:
!pip3 install jinja --user --upgrade