<h1>Predicting user ratings based on corresponding text reviews</h1>  
<h6>Paul Disbeschl and Timothy Smeets</h6>
<h6>Department of Data Science and Knowledge Engineering, Maastricht University</h6>

This notebook applies pre-processing to the SAR14 dataset of Nguyen et al. and showcases the effectiveness of different techniques used to predict the sentiments of film reviews.
<b><i> DISCLAIMER: As the training and test data split is partially randomized, keep in mind that running the code might result in small deviations from the accuracy scores given.</i></b>

In [1]:
import nltk
import numpy
import matplotlib.pyplot
from nltk.tokenize import word_tokenize
from nltk.text import Text
from nltk.tokenize import wordpunct_tokenize
import pandas as pd
import re
import os
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from collections import Counter
from nltk.corpus import sentiwordnet as sent
from io import StringIO

from sklearn.model_selection import train_test_split #to split data, also for stratification
import random #also to split our shit

df = pd.read_csv('data/sar14.txt', sep=" ", header=None, names=["review", "score"])
df.head()

reviews = df['review']
scores = df['score']
x_tr, x_te, y_tr, y_te = train_test_split(reviews, scores, stratify=scores, test_size=0.2)
#This applies stratification to the dataset and splits it up equally according to score. X = review col, Y = score col

train_x = pd.DataFrame({'review':x_tr})
train_y = pd.DataFrame({'score':y_tr})
train_data = pd.concat([train_x,train_y], axis=1) #train_data is the training data, 80% of the text and scores
train_data.head()

test_x = pd.DataFrame({'review':x_te})
test_y = pd.DataFrame({'score':y_te})
data = pd.concat([test_x,test_y], axis=1) #data is the testing data, 20% of the text and scores from the dataset
data.head()

Unnamed: 0,review,score
136516,I love the movie Rent !!!! . I love the movie...,",10"
88514,Breathtaking visuals . If they gave Oscars fo...,",8"
154144,A Breath of Fresh Air . Some of you are takin...,",10"
53637,Fight Movie Clich s in Spaaaaaaaaaaaaace . . ...,",4"
83508,"Just crap . This movie is a total crap , ther...",",2"


<h1><b> Pre-processing </b>  </h1>

In order to run experiments on our dataframe, it needs to be cleaned first. Our approach for this is to use regular expressions.

The following method is used to clean the dataset using regular expressions.

In [2]:
def clean_text(text):

    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r"-RRB-",'',text)
    text = re.sub(r"-LRB-",'',text)
    text = re.sub(r"\\", "", text)    
    text = re.sub(r"\'", "", text)    
    text = re.sub(r"\"", "", text)    
    text = text.strip().lower()
    
    # replace punctuation characters with spaces
    filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    translate_dict = dict((c, " ") for c in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)

    return text

The following code uses the method above to clean both the training and the test data. This is more for visual purposes so you can see what the cleaned text looks like, as later on we use the clean_text method within the vectorizer functions provided by the sklearn package.

In [3]:
for i in train_data.index:
    train_data.at[i, 'review'] = clean_text(train_data.at[i, 'review'])
    train_data.at[i, 'score'] = clean_text(train_data.at[i,'score'])
    
for i in data.index:
    data.at[i, 'review'] = clean_text(data.at[i, 'review'])
    data.at[i, 'score'] = clean_text(data.at[i,'score'])
    
data.head()

Unnamed: 0,review,score
136516,i love the movie rent i love the movie ...,10
88514,breathtaking visuals if they gave oscars for...,8
154144,a breath of fresh air some of you are taking...,10
53637,fight movie clich s in spaaaaaaaaaaaaace h...,4
83508,just crap this movie is a total crap there...,2


In order to test the data on just positive/negative sentiment, we need to assign sentiment to the dataframe.

In [4]:
train_data.score = pd.to_numeric(train_data.score, errors = 'coerce')
data.score = pd.to_numeric(data.score, errors = 'coerce')

for i in train_data.index:
    if train_data.at[i,'score']> 7:
        train_data.at[i,'binarysentiment'] = 1
    else:
        train_data.at[i,'binarysentiment'] = 0
    
    
for j in data.index:
    if data.at[j,'score']> 7:
        data.at[j,'binarysentiment'] = 1
    else:
        data.at[j,'binarysentiment'] = 0
        
train_data.head()

Unnamed: 0,review,score,binarysentiment
14066,more stupid jerry once again jerry stinks up...,1,0.0
57123,nice little drama good performances from the...,7,0.0
53460,what an amazing love story i adored this fil...,10,1.0
174391,great i was told about the 4400 after the ...,7,0.0
54693,pan is back purportedly steven spielberg is ...,9,1.0


In order to do research on multi-class classification, we can use the "score" column. Later on, we also want to lower the amount of classes, for which we have to create a new column where a value is assigned corresponding to the "score" column.

In [5]:
#assign sentiment with classes
# 1-2 = 1, 3-4 = 2, 7-8 = 3, 9-10 = 4

train_data.score = pd.to_numeric(train_data.score, errors = 'coerce')
data.score = pd.to_numeric(data.score, errors = 'coerce')

for i in train_data.index:
    if train_data.at[i,'score']<= 2:
        train_data.at[i,'sentiment'] = 1
    elif train_data.at[i,'score']== 4 or train_data.at[i,'score']== 3:
        train_data.at[i,'sentiment'] = 2
    elif train_data.at[i,'score']== 7 or train_data.at[i,'score']== 8:
        train_data.at[i,'sentiment'] = 3
    elif train_data.at[i,'score']== 9 or train_data.at[i,'score']== 10:
        train_data.at[i,'sentiment'] = 4
    else:
        train_data.at[i,'sentiment'] = 0

    
for j in data.index:
    if data.at[j,'score']<= 2:
        data.at[j,'sentiment'] = 1
    elif data.at[j,'score']== 4 or data.at[j,'score']== 3:
        data.at[j,'sentiment'] = 2
    elif data.at[j,'score']== 7 or data.at[j,'score']== 8:
        data.at[j,'sentiment'] = 3
    elif data.at[j,'score']== 9 or data.at[j,'score']== 10:
        data.at[j,'sentiment'] = 4
    else:
        data.at[j,'sentiment'] = 0
        
train_data.head()

Unnamed: 0,review,score,binarysentiment,sentiment
14066,more stupid jerry once again jerry stinks up...,1,0.0,1.0
57123,nice little drama good performances from the...,7,0.0,3.0
53460,what an amazing love story i adored this fil...,10,1.0,4.0
174391,great i was told about the 4400 after the ...,7,0.0,3.0
54693,pan is back purportedly steven spielberg is ...,9,1.0,4.0


<h1><b>Tests on binary sentiment (positive/negative)  </b>  </h1>

<b>SVM without tf-idf and without N-grams </b>

77.72% accuracy on binary sentiment

In [11]:
# Transform each review into a vector of word counts
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text)
                             

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

# Training a Linear Support Vector Machine model
model = LinearSVC()
model.fit(training_features, train_data["binarysentiment"])
y_pred = model.predict(test_features)

# Calculating the accuracy
acc = accuracy_score(data["binarysentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 77.72




<b>SVM with tf-idf and with N-grams </b>

84.26% accuracy on binary sentiment

In [12]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["binarysentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["binarysentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 84.26


<b>SVM with tf-idf and without N-grams </b>

Score of 82.11% accuracy

In [None]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text)

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["binarysentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["binarysentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

<b>SVM without tf-idf and with N-grams </b>

Score of 82.11% accuracy

In [None]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["binarysentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["binarysentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

<b> Naive Bayes without tf-idf and without N-grams </b>

83.01% accuracy on binary sentiment

In [8]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text)
                             
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=6)
clf.fit(training_features,train_data["binarysentiment"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["binarysentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 82.30


<b> Naive Bayes with tf-idf and with N-grams </b>

78.46% accuracy on binary sentiment

In [7]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["binarysentiment"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["binarysentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 78.46


<b> Naive Bayes with tf-idf and without N-grams </b>

Score is 81.65

In [None]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text)
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["binarysentiment"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["binarysentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

<b> Naive Bayes without tf-idf and with N-grams </b>

Score is 83.81

In [None]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["binarysentiment"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["binarysentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

<h1><b>Tests on grouped scores </b>  </h1>

Here, scores are grouped together to create fewer classes. The "sentiment" column is used for training and testing.

The groups are as follows: {1,2} , {3,4} , {7,8} , {9,10}

<b>SVM without tf-idf and without N-grams </b>

58.82% accuracy on the grouped scores

In [18]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text)
                             

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["sentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 58.82




<b>SVM with tf-idf and with N-grams </b>

67.84% accuracy on the grouped scores

In [19]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["sentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 67.91


<b>SVM with tf-idf and without N-grams </b>

Score of 64.42%

In [15]:
#SVM WITH TFIDF NO NGRAMS 64.42%
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text)

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["sentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 64.42


<b>SVM without tf-idf and with N-grams </b>

Score of 65.78%

In [16]:
#SVM NO TFIDF WITH NGRAMS 65.78%
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["sentiment"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 65.78




<b> Naive Bayes without tf-idf and without N-grams </b>

69.38% accuracy on the grouped scores

In [9]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text)
                             
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["sentiment"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["sentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 69.38


<b> Naive Bayes with tf-idf and with N-grams </b>

52.04% accuracy on the grouped scores

In [10]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["sentiment"])

y_pred= clf.predict(training_features)

acc = accuracy_score(train_data["sentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 52.04


<b>NB with tf-idf and without N-grams </b>

Score of 56.43%

In [13]:
#NB WITH TFIDF NO NGRAMS 56.43%
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["sentiment"])

y_pred= clf.predict(training_features)

acc = accuracy_score(train_data["sentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 56.43


<b>NB without tf-idf with N-grams </b>

Score of 94.11%

In [19]:
#NB NO TFIDF WITH NGRAMS 94.11%
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["sentiment"])

y_pred= clf.predict(training_features)

acc = accuracy_score(train_data["sentiment"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 94.11


<h1><b>Tests on scores from 1-4 and 7-10 </b>  </h1>

This is the hardest category for the model to train and test on, as it has the most classes. The "score" column is used for training and testing.

<b>SVM without tf-idf and without N-grams </b>

35.88% accuracy on the scores

In [25]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text)
                             

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["score"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["score"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 35.88




<b>SVM with tf-idf and with N-grams </b>

45.98% accuracy on the scores

In [31]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["score"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["score"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 45.98


<b>SVM with tf-idf and without N-grams </b>

Score of 42.02%

In [17]:
#SVM WITH TFIDF NO NGRAMS 42.02%
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text)

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["score"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["score"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 42.02


<b>SVM without tf-idf and with N-grams </b>

Score of 42.97%

In [18]:
#SVM NO TFIDF WITH NGRAMS 42.97%
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams

training_features = vectorizer.fit_transform(train_data["review"])    
test_features = vectorizer.transform(data["review"])

model = LinearSVC()
model.fit(training_features, train_data["score"])
y_pred = model.predict(test_features)

acc = accuracy_score(data["score"], y_pred)

print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 42.97


<b> Naive Bayes without tf-idf and without N-grams </b>

56.25% accuracy on the scores

In [11]:
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text) #This is CountVectorizer, NOT TfidfVectorizer
                             
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["score"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["score"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 56.25


<b> Naive Bayes with tf-idf and with N-grams </b>

28.40 % accuracy on the scores

In [12]:
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["score"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["score"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

Accuracy: 28.42


<b>NB with tf-idf and without N-grams </b>

Score of 78.55%

In [None]:
#NB WITH TFIDF NO NGRAMS 78.55%
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["score"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["score"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))

<b>NB without tf-idf and with N-grams </b>

Score of 33.98%

In [None]:
#NB NO TFIDF WITH NGRAMS 33.98%
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2)) #Set here for unigrams and bigrams
training_features = vectorizer.fit_transform(train_data["review"].values)
test_features = vectorizer.transform(data["review"].values)

clf = MultinomialNB(alpha=1)
clf.fit(training_features,train_data["score"])

y_pred = clf.predict(training_features)

acc = accuracy_score(train_data["score"], y_pred, normalize=True)
print("Accuracy: {:.2f}".format(acc*100))