# Sentiment Analysis - Machine Learning and Basic Deep Neural Network Models

We have already discussed that sentiment analysis, also popularly known as opinion analysis or opinion mining is one of the most important applications of NLP. The key idea is to predict the potential sentiment of a body of text based on the textual content. In this sub-unit, we will be exploring supervised learning models. 

Another way to build a model to understand the text content and predict the sentiment of the text based reviews is to use supervised machine learning. To be more specific, we will be using classification models for solving this problem. We will be building an automated sentiment text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

1.	Prepare train and test datasets (optionally a validation dataset)
2.	Pre-process and normalize text documents
3.	Feature Engineering 
4.	Model training
5.	Model prediction and evaluation

In our scenario, documents indicate the movie reviews and classes indicate the review sentiments which can either be positive or negative making it a binary classification problem. We will build models using both traditional machine learning methods and newer deep learning in the subsequent sections. 

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/00/92/a05b76a692ac08d470ae5c23873cf1c9a041532f1ee065e74b374f218306/contractions-0.0.25-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 5.9MB/s 
[?25hCollecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 13.5MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  

True

# Load and View Dataset

In [None]:
import pandas as pd

dataset = pd.read_csv(r'movie_reviews.csv.bz2')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [None]:
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Build Train and Test Datasets

In [None]:
# build train and test datasets
reviews = dataset['review'].values
sentiments = dataset['sentiment'].values

train_reviews = reviews[:35000]
train_sentiments = sentiments[:35000]

test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]

# Text Wrangling & Normalization

In [None]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
import tqdm
import unicodedata


def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm.tqdm(docs):
    doc = strip_html_tags(doc)
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = remove_accented_chars(doc)
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [None]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

100%|██████████| 35000/35000 [00:15<00:00, 2238.12it/s]
100%|██████████| 15000/15000 [00:06<00:00, 2249.39it/s]

CPU times: user 22.2 s, sys: 184 ms, total: 22.3 s
Wall time: 22.3 s





# Traditional Supervised Machine Learning Models

## Feature Engineering

In [None]:
%%time

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=5, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)


# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=5, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)

CPU times: user 43.1 s, sys: 789 ms, total: 43.9 s
Wall time: 43.9 s


In [None]:
%%time

# transform test reviews into features
cv_test_features = cv.transform(norm_test_reviews)
tv_test_features = tv.transform(norm_test_reviews)

CPU times: user 10 s, sys: 16.8 ms, total: 10 s
Wall time: 10.1 s


In [None]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (35000, 194919)  Test features shape: (15000, 194919)
TFIDF model:> Train features shape: (35000, 194919)  Test features shape: (15000, 194919)


## Model Training, Prediction and Performance Evaluation

### Try out Logistic Regression

The logistic regression model is actually a statistical model developed by statistician
David Cox in 1958. It is also known as the logit or logistic model since it uses the
logistic (popularly also known as sigmoid) mathematical function to estimate the
parameter values. These are the coefficients of all our features such that the overall loss
is minimized when predicting the outcome—

In [None]:
%%time

# Logistic Regression model on BOW features
from sklearn.linear_model import LogisticRegression

# instantiate model
lr = LogisticRegression(penalty='l2', max_iter=500, C=1, solver='lbfgs', random_state=42)

# train model
lr.fit(cv_train_features, train_sentiments)

# predict on test data
lr_bow_predictions = lr.predict(cv_test_features)

CPU times: user 1min 10s, sys: 47 s, total: 1min 57s
Wall time: 59.9 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative', 'positive']
print(classification_report(test_sentiments, lr_bow_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, lr_bow_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.91      0.90      0.90      7490
    positive       0.90      0.91      0.90      7510

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Unnamed: 0,negative,positive
negative,6756,734
positive,707,6803


In [None]:
%%time

# Logistic Regression model on TF-IDF features

# train model
lr.fit(tv_train_features, train_sentiments)

# predict on test data
lr_tfidf_predictions = lr.predict(tv_test_features)

CPU times: user 3.24 s, sys: 2.12 s, total: 5.35 s
Wall time: 2.81 s


In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, lr_tfidf_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, lr_tfidf_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.91      0.89      0.90      7490
    positive       0.90      0.91      0.90      7510

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Unnamed: 0,negative,positive
negative,6688,802
positive,666,6844


### Try out Random Forest

Decision trees are a family of supervised machine learning algorithms that can represent
and interpret sets of rules automatically from the underlying data. They use metrics like
information gain and gini-index to build the tree. However, a major drawback of decision
trees is that since they are non-parametric, the more data there is, greater the depth of
the tree. We can end up with really huge and deep trees that are prone to overfitting. The
model might work really well on training data, but instead of learning, it just memorizes
all the training samples and builds very specific rules to them. Hence, it performs really
poorly on the test data. Random forests try to tackle this problem.

A random forest is a meta-estimator or an ensemble model that fits a number of
decision tree classifiers on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting. The sub-sample size is always
the same as the original input sample size, but the samples are drawn with replacement
(bootstrap samples). In random forests, all the trees are trained in parallel (bagging
model/bootstrap aggregation). Besides this, each tree in the ensemble is built from a
sample drawn with replacement (i.e., a bootstrap sample) from the training set. Also,
when splitting a node during the construction of the tree, the split that is chosen is no
longer the best split among all features. Instead, the split that is picked is the best split
among a random subset of the features. T

In [None]:
%%time

# Random Forest model on BOW features
from sklearn.ensemble import RandomForestClassifier

# instantiate model
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# train model
rf.fit(cv_train_features, train_sentiments)

# predict on test data
rf_bow_predictions = rf.predict(cv_test_features)

CPU times: user 3min 38s, sys: 170 ms, total: 3min 39s
Wall time: 1min 52s


In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, rf_bow_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, rf_bow_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.86      0.86      0.86      7490
    positive       0.86      0.86      0.86      7510

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



Unnamed: 0,negative,positive
negative,6410,1080
positive,1060,6450


In [None]:
%%time

# Random Forest model on TF-IDF features

# train model
rf.fit(tv_train_features, train_sentiments)

# predict on test data
rf_tfidf_predictions = rf.predict(tv_test_features)

CPU times: user 3min 9s, sys: 150 ms, total: 3min 10s
Wall time: 1min 37s


In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, rf_tfidf_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, rf_tfidf_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.84      0.86      0.85      7490
    positive       0.86      0.84      0.85      7510

    accuracy                           0.85     15000
   macro avg       0.85      0.85      0.85     15000
weighted avg       0.85      0.85      0.85     15000



Unnamed: 0,negative,positive
negative,6439,1051
positive,1216,6294
