# Hands-On Natural Language Processing: Predicting Text Sentiment
### Big Data Toronto Master Class, June 12, 2018
### Nick Pogrebnyakov

## 1. Imports and setup

Import libraries and download the dataset.

The dataset is here: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

It contains sentences labeled with their sentiment from Amazon, IMDB and Yelp reviews. Each sentence is on a separate line, followed by a tab ('\t') character and its sentiment: 0 negative, 1 positive. For example:

`Wow... Loved this place.	1`

First, import libraries.

In [1]:
# imports
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import os
import zipfile
import urllib
import shutil
import re
useLocal = True  # use locally saved dataset, rather than download it from the internet

Next, download the dataset, extract files and read the IMDB movie reviews into a variable.

In [2]:
if not useLocal:
    print('Downloading dataset...', end = '')
    # download the data: we're using the dataset of sentences labeled with sentiment from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
    urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip", "sentiment.zip")
    print(' done.')

    # unzip the file
    with zipfile.ZipFile('sentiment.zip', 'r') as z:
        z.extractall('sentiment')
    
# get IMDB movie reviews from the dataset and put them into a list
with open(os.path.join('sentiment', 'sentiment labelled sentences', 'imdb_labelled.txt'), 'r') as f:
    moviesSet = f.readlines()

if not useLocal:
    # clean up: remove downloaded files
    shutil.rmtree('sentiment')
    os.remove('sentiment.zip')

## 2. Prepare data for analysis: convert text to features

#### 2.1. Convert to lowercase and extract sentiment

In [3]:
# separate sentiment so we get a list of [sentence, sentiment] for each sentence; also convert to lowercase
movies = [m.rstrip('\n').lower().split('\t') for m in moviesSet]

# print the first item
print(movies[0])

['a very, very, very slow-moving, aimless movie about a distressed, drifting young man.  ', '0']


#### 2.2. Remove non-alphanumeric characters (commas, brackets etc.)

In [4]:
# remove non-alphanumeric characters
movies = [[re.sub(r'\W+', ' ', m[0]), m[1]] for m in movies]

print(movies[0])

['a very very very slow moving aimless movie about a distressed drifting young man ', '0']


#### 2.3. Tokenize the sentence

In [5]:
# tokenize
movies = [[nltk.word_tokenize(m[0]), m[1]] for m in movies]

print(movies[0])

[['a', 'very', 'very', 'very', 'slow', 'moving', 'aimless', 'movie', 'about', 'a', 'distressed', 'drifting', 'young', 'man'], '0']


#### 2.4. Stem

In [6]:
# stem
stemmer = PorterStemmer()
movies = [[[stemmer.stem(x) for x in m[0]], m[1]] for m in movies]

print(movies[0])

[['a', 'veri', 'veri', 'veri', 'slow', 'move', 'aimless', 'movi', 'about', 'a', 'distress', 'drift', 'young', 'man'], '0']


#### 2.5. Convert text to features

In [15]:
# create an instance of TfidfVectorizer; get it to remove stopwords
tf = TfidfVectorizer(stop_words = 'english', use_idf = False, norm = None, binary = True)

# convert text to features
x = tf.fit_transform([' '.join(m[0]) for m in movies])

# get feature names
featureNames = tf.get_feature_names()

print(x[0].toarray().shape)

(1, 2353)


In [8]:
y = [int(m[1]) for m in movies]
print(y[:5])

[0, 0, 0, 0, 1]


## 3. Train the model

#### 3.1. Split into training and test sets

In [9]:
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.3, random_state = 10)

#### 3.2. Classifier: SVM

In [10]:
clf = LinearSVC()
clf.fit(xTrain, yTrain)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

#### 3.3. Test prediction accuracy

In [11]:
yPred = clf.predict(xTest)
accuracy = accuracy_score(yTest, yPred)
print("SVM prediction accuracy:", accuracy)

SVM prediction accuracy: 0.7633333333333333


Prediction accuracy is pretty decent. However, we don't know how the classifier makes its decisions. Which words are associated with positive sentiment and which with negative?

To answer this, we'll run a different classifier.

#### 3.4. Classifier: random forest

Run the random forest and calculate accuracy score.

In [12]:
clfRf = RandomForestClassifier(n_estimators = 1000)
clfRf.fit(xTrain, yTrain)
yPredRf = clfRf.predict(xTest)
accuracyRf = accuracy_score(yTest, yPredRf)
print("Random forest accuracy:", accuracyRf)

Random forest accuracy: 0.7


The accuracy is lower than SVM. However, with random forest we can explore which features are actually characteristic of positive and negative sentiment.

Extract feature importances from the random forest classifier object and merge it with our list of feature names obtained earlier from `TfidfVectorizer`.

In [13]:
featureImportances = list(zip(featureNames, clfRf.feature_importances_))
featureImportances = sorted(featureImportances, key = lambda x: x[1], reverse = True)

coef = clf.coef_[0]

# how many top positive and negative features to return
topFeatures = 10

# sort the coefficients obtained from SVM and get the highest and smallest values
sortedCoef = np.argsort(coef)
topPos = sortedCoef[-topFeatures:]
topNeg = sortedCoef[:topFeatures]

print("Top positive features\n---------------------")
for i in topPos:
    print(featureNames[i], '\t', coef[i])
    
print("\nTop negative features\n---------------------")
for i in topNeg:
    print(featureNames[i], '\t', coef[i])

Top positive features
---------------------
cool 	 0.7368554299985697
actual 	 0.7467719684772607
wonder 	 0.7702247101916363
charismat 	 0.7823762970273515
saw 	 0.8316727047110705
entertain 	 0.8415738319021893
beauti 	 0.8481407258332057
miss 	 0.8667819561461167
soundtrack 	 0.945208490392524
advis 	 0.9965359141672195

Top negative features
---------------------
disappoint 	 -1.2436092133288785
rate 	 -1.1542279462255924
hate 	 -1.0993691256411333
bad 	 -1.0812671036166939
ridicul 	 -1.069634034095794
plot 	 -1.0208036532619846
suck 	 -1.0177028676085733
flick 	 -0.9451023504389604
aw 	 -0.917235790819219
worst 	 -0.8461383377673802
