<img style="float: right;" width="120" src="https://s3.eu-west-1.amazonaws.com/neueda.conygre.com/pydata/images/neueda-logo.jpeg">
<br><br><br>

# Scikit-Learn Sentiment Analysis

In this notebook we will demonstrate sentiment analysis using labelled training data.

This is an example of supervised learning and classification 

#### Step 1: Load, Clean, Prepare the data

In [1]:
import pandas as pd

df = pd.read_csv('https://s3.eu-west-1.amazonaws.com/neueda.conygre.com/pydata/ml_fc/Restaurant_Reviews.tsv', sep='\t')

print(df.shape)
df.head()

(1000, 2)


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Note - for simplicity this preparation function is greatly simplified. We could include further steps such as:
* stemming / lemmatization
* removing stop words
* identification of the most symantically valuable words

In [2]:
# for this situation we will use a regular expression to remove all non-alphabetical characters e.g. '[^a-zA-Z]'
# we will also lower case everything
# we could improve on this function alongside testing resulting models to verify gains in accuracy
import re

def prepare_review(review):
    
    # A VERY simple "clean up" - remove everything non-alphabetical, then lower case
    review = re.sub('[^a-zA-Z]', ' ', review)    
    
    return review.lower()


In [3]:
# test the prepare review function with one of our reviews
prepare_review('Not tasty and the texture was just nasty.')

'not tasty and the texture was just nasty '

In [4]:
# apply the prepare_review function to each review (df.apply)
cleaned_reviews = df['Review'].apply(prepare_review).tolist()

# print the first few "cleaned" reviews
cleaned_reviews[:5]

['wow    loved this place ',
 'crust is not good ',
 'not tasty and the texture was just nasty ',
 'stopped by during the late may bank holiday off rick steve recommendation and loved it ',
 'the selection on the menu was great and so were the prices ']

##### This step is encoding the "Bag of Words" - the CountVectorizer utility does this for us.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [5]:
# use the CountVectorizer from sklearn.feature_extraction.text
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)

#### Step 2: Split into independant and dependant variables

In [6]:
# create the X and y variables from the appropriate columns
# X should be a multi-dimensional element (array/DataFrame)
# y in this case will be a single-dimension on the label column (Liked)
#
# Note: remember to pass reviews through the count vectorizer fit_transform method first
X = cv.fit_transform(cleaned_reviews).toarray()  # Independent variables

y = df['Liked'].values

#### Step 3: Split our data into training and test sets 

In [7]:
# We can use sklearn.model_selection train_test_split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Step 4: Choose and train the model

In [8]:
# we could use any classification model e.g. LogisticRegression
from sklearn.linear_model import LogisticRegression

# create and train the model
model = LogisticRegression()

model.fit(X_train, y_train)

LogisticRegression()

#### Step 5: Examine & Measure the model

In this case as this is a classification problem, we will use a confusion matrix. There are a number of further calculations we might do to extract more metrics from our model.


#### Here we are demonstrating inference / prediction

Let's use the model to predict some "new" reviews

Try making up some reviews and passing them to the model

In [9]:
# simulating a new review
new_review = "Service was awful, tasted revolting!" # should be predicted as negative!

# pass the new review through our "prepare_review" function
prepared_review = prepare_review(new_review)

# pass the cleaned new review to the count vectorizer to create the "number list"
feature_set = cv.transform( [prepared_review] )

# as the model to predict the sentiment
result = model.predict(feature_set.toarray())

# print a simple message
print(result)
print()

if result[0] == 1:
    print('Glad you enjoyed your visit!')
else:
    print('Apologies')

[0]

Apologies


For a more formal measurement we create a confusion matrix for the test data.

This gives us information on where the model is likely to be correct/incorrect.

In [10]:
# sklearn.metrics has a confusion_matrix function
from sklearn.metrics import confusion_matrix

cmatrix = confusion_matrix(y_test, model.predict(X_test))

print(cmatrix)

print('Accuracy: ', (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())

[[79 20]
 [16 85]]
Accuracy:  0.82


#### Further Steps

For reference, how do some other sklearn classification models perform?

In [11]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [12]:
# Try MultinomialNB
MNB_classifier = MultinomialNB()
MNB_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, MNB_classifier.predict(X_test))
print("MNB_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


MNB_classifier accuracy: 0.84


In [13]:
# Try BernoulliNB
BNB_classifier = BernoulliNB()
BNB_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, BNB_classifier.predict(X_test))
print("BNB_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


BNB_classifier accuracy: 0.81


In [14]:
# Try SGDClassifier
SGD_classifier = SGDClassifier()
SGD_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, SGD_classifier.predict(X_test))
print("SGD_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


SGD_classifier accuracy: 0.8


In [15]:
# Try LinearSVC
SVC_classifier = LinearSVC()
SVC_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, SVC_classifier.predict(X_test))
print("SVC_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


SVC_classifier accuracy: 0.8


In [16]:
# Try MLPClassifier
from sklearn.neural_network import MLPClassifier

MLP_classifier = MLPClassifier(hidden_layer_sizes=(20), max_iter=1000)
MLP_classifier.fit(X_train, y_train)

cmatrix = confusion_matrix(y_test, MLP_classifier.predict(X_test))
print("MLP_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())

MLP_classifier accuracy: 0.81
