# Sentiment Analysis Coursework
Alex Dawkins (asd60@bath.ac.uk), python 3.8

The aim of this coursework is to write a sentiment analysis application to classify
movie reviews as either **positive**, or **negative**.

## The Dataset

The dataset is 25,000 highly polar movie reviews. It has already been split into training and testing subsets.
The dataset can be found [here](https://ai.stanford.edu/~amaas/data/sentiment/).


In [87]:
import string
from typing import List

# Import and setup NLTK
import nltk
import os
from nltk.corpus import stopwords
from nltk.lm import Vocabulary
from sklearn.naive_bayes import MultinomialNB
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexdawkins/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alexdawkins/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [124]:
# Load the reviews
import os
n_reviews = 25_000
data_dir = '../../data'
train_dir = f'{data_dir}/aclImdb/train'
test_dir = f'{data_dir}/aclImdb/test'
neg_train_dir = f'{train_dir}/neg'
pos_train_dir = f'{train_dir}/pos'

pos_test_dir = f'{test_dir}/pos'
neg_test_dir = f'{test_dir}/neg'


def load_reviews(fps: List[str], dir_: str, max_n: int = -1) -> List[str]:
    fps_cut = fps
    if max_n != -1:
        fps_cut = fps[:max_n]
    reviews = []
    for fp in fps_cut:
        with open(dir_ + '/' + fp, 'r') as f:
            reviews.append(f.read())

    return reviews

def get_rating(fp: str) -> int:
    try:
        return int(fp.split('_')[1].split('.')[0])
    except (ValueError, IndexError) as e:
        raise Exception(f"Couldn't extract rating from filepath: '{fp}'") from e


print("Finding reviews...")
neg_fps = [fp for fp in os.listdir(neg_train_dir) if fp.endswith('.txt')]
pos_fps = [fp for fp in os.listdir(pos_train_dir) if fp.endswith('.txt')]
test_pos_fps = [fp for fp in os.listdir(pos_test_dir) if fp.endswith('.txt')]
test_neg_fps = [fp for fp in os.listdir(neg_test_dir) if fp.endswith('.txt')]

print("Loading reviews...")
pos_reviews = load_reviews(pos_fps, pos_train_dir, 1000)
neg_reviews = load_reviews(neg_fps, neg_train_dir, 1000)
test_pos_reviews = load_reviews(test_pos_fps, pos_test_dir, 100)
test_neg_reviews = load_reviews(test_neg_fps, neg_test_dir, 100)
# print(f"{test_pos_reviews[0]=}")

print("Extracting ratings...")
pos_ratings = [get_rating(fp) for fp in pos_fps][:len(pos_reviews)]
neg_ratings = [get_rating(fp) for fp in neg_fps][:len(neg_reviews)]
test_pos_ratings = [get_rating(fp) for fp in test_pos_fps][:len(test_pos_reviews)]
test_neg_ratings = [get_rating(fp) for fp in test_neg_fps][:len(test_neg_reviews)]

print(f"Loaded {len(pos_reviews)} positive reviews, and {len(neg_reviews)} negative reviews.\n---")
print(f"{pos_reviews[0]=}")
print(f"{neg_reviews[0]=}")

Finding reviews...
Loading reviews...
Extracting ratings...
Loaded 1000 positive reviews, and 1000 negative reviews.
---
pos_reviews[0]='For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.'
neg_reviews[0]="Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form."


Although the reviews in the dataset are written by many different people, it's possible that there could be some
deviations in the lengths of reviewers' words or sentences, depending on whether they are talking about something
in a positive or a negative manner.

Perhaps more critical reviews are more likely to use longer, more technical words as the reviewers wants to use technical
language to reason their point.

Alternatively, there might be no correlation at all, as any consistency between up to 12,500 reviews is quite unlikely.
This approach is more likely to be effective to distinguish between two different authors.

In [89]:
def lengths(data: str, name: str):
    words = nltk.tokenize.word_tokenize(data)
    num_words = len(words)
    avg_word_len = round(len(data) / num_words)
    avg_sent_len = round(num_words / len(nltk.tokenize.sent_tokenize(data)))
    # average number of times each word occurs uniquely
    avg_n_unique_word = round(num_words / len(set(w.lower() for w in words)))
    print(avg_word_len, avg_sent_len, avg_n_unique_word, name)

# Turn the lists into strings
all_pos_reviews = '\n'.join(pos_reviews)
all_neg_reviews = '\n'.join(neg_reviews)

lengths(all_pos_reviews, 'pos')
lengths(all_neg_reviews, 'neg')

5 28 14 pos
5 26 14 neg


There doesn't appear to be any difference at all! **This technique will not work for categorising the data,** as both
positive and negative reviews have very similar word and sentence length, and amount of unique vocabulary.


As part of the data set, a tokenised list of words (`imdb.vocab`), and the associated expected rating for each token
(`imdbEr.txt`). This list of expected ratings was computed by (Potts, 2011).

We can take the sum of each word's expected rating as the review's expected rating.

In [90]:
def try_make_float(x):
    try:
        return float(x)
    except ValueError:
        return 0.

with open(f'{data_dir}/aclImdb/imdb.vocab', 'r') as f:
        vocab = f.read().split('\n')

# faster than vocab.index()
vocab_index = {word: i for i, word in enumerate(vocab)}

with open(f'{data_dir}/aclImdb/imdbEr.txt', 'r') as f:
    expected_ratings = list(map(try_make_float, f.read().split('\n')))

In [84]:
def evaluate(to_test, *, positive, modifier=None):
    if not modifier:
        def modifier(_):
            return 1
    n_positive = 0
    for i, rev in enumerate(to_test):
        words = [word.lower() for word in nltk.word_tokenize(rev) if word not in '.,\'"']
        expected_rating = 0
        n_words = len(words)
        for j, word in enumerate(words):
            idx = vocab_index.get(word, None)
            if not idx:
                continue
            expected_rating += modifier(j/n_words) * expected_ratings[idx]

        if (expected_rating > 0) == positive:
            # print(n_positive, i)
            n_positive += 1

    print(f"Accuracy: {n_positive}/{len(to_test)} ({n_positive/len(to_test)})")

In [85]:
evaluate(test_pos_reviews, positive=True)

Accuracy: 79/100 (0.79)


In [86]:
evaluate(test_neg_reviews, positive=False)

Accuracy: 75/100 (0.75)


From reading some of the reviews, I found that a lot of them include a lot of pre-amble, describing the context of how
the reviewer watched the movie, their initial thoughts, etc. These words are independent of the sentiment of the whole
review.

For example, `pos/13_9.txt`:

 >I work at a movie theater and every Thursday night we have an employee screening of one movie that comes out the next day...Today it was The Guardian. I saw the trailers and the ads and never expected much from it, and in no way really did i anticipate seeing this...


In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def parabola(x_):
    return (-(1.75 * x_ - 0.875) ** 2) + 1
def positive_gradient(x_):
    return ( 0.5 * x_ ) + 0.5
def negative_gradient(x_):
    return -( 0.5 * x_ ) + 1


x = np.linspace(0, 1, 41)
plt.plot(x, parabola(x),label="Parabola")
plt.plot(x, positive_gradient(x),label="+ve Grad.")
plt.plot(x, negative_gradient(x),label="-ve Grad.")
plt.xlabel('Relative word position')
plt.legend()
plt.show()

In [None]:
evaluate(modifier=parabola)

In [None]:
evaluate(modifier=positive_gradient)

Some reviews get straight to the point, such as `pos/17_8.txt`:
 > Brilliant and moving performances by Tom Courtenay and Peter Finch.

In [None]:
evaluate(modifier=negative_gradient)

Adding weights for the positions of words within the reviews doesn't appear to significantly improve performance, and I
haven't experimented enough to warrant keeping this technique within the solution.

As using Potts' list appears to be very effective, I wanted to try to recreate the list, using the training data.

This code iterates through each word in every review and assigns the word a portion of the rating: if a review is 10
stars, and it contains 2 words, these 2 words are clearly (and assumed equally) positive words, and should be considered
more significant than a word that appears once in a long review.

In [None]:
import json
import re

regex = re.compile('[^a-zA-Z]')

def clean(x):
    return regex.sub(' ', x).lower()

def make_er(train_data, ratings, use_cache=False):
    if use_cache:
        with open('results.txt', 'r') as f:
            return json.load(f)

    vocab_ratings = {}
    vocab_occurrences = {}
    porter = nltk.PorterStemmer()
    n_reviews = len(train_data)
    stop_words = set(stopwords.words('english'))
    for i, rev in enumerate(train_data):
        words = [word for word in nltk.word_tokenize(clean(rev)) if word not in stop_words]
        stems = [porter.stem(word) for word in words]
        n_words = len(words)
        # Convert (1 to 10) to (-5 to 5)
        rating = ratings[i]
        if rating > 5:
            rating -= 5
        else:
            rating -= 6
        rel_rating = rating / n_words
        for word in stems:
            vocab_ratings[word] = vocab_ratings.get(word, 0) + rel_rating
            vocab_occurrences[word] = vocab_occurrences.get(word, 0) + 1
        if i % 1000 == 0:
            print(f"{i}/{n_reviews}...")
    vocab_ratings = {word: sum_ / vocab_occurrences[word] for word, sum_ in vocab_ratings.items()}
    with open('results.txt', 'w') as f:
        f.write(json.dumps(vocab_ratings))

    return vocab_ratings

Unfortunately, this algorithm gives poor results, which are only marginally better than guessing randomly.


In [None]:
print("Generating ERs...")
# expected_ratings_dict = make_er(pos_reviews + neg_reviews, pos_ratings + neg_ratings)
expected_ratings_dict = make_er([], [], use_cache=True)

positive_reviews = True
to_test = test_pos_reviews if positive_reviews else test_neg_reviews
def evaluate():
    porter = nltk.PorterStemmer()
    n_correct = 0
    n_reviews = len(to_test)
    for i, rev in enumerate(to_test):
        words = nltk.word_tokenize(clean(rev))
        stems = [porter.stem(word) for word in words]
        sum_expected_rating = 0
        n_words = len(words)
        for j, word in enumerate(stems):
            expected_rating = expected_ratings_dict.get(word, 0)
            sum_expected_rating += expected_rating

        if (sum_expected_rating > 0) == positive_reviews:
            n_correct += 1

        if i % 1000 == 0 and i != 0:
            print(f"Processed: {i}/{n_reviews}...")
            print(f"Accuracy: {n_correct}/{i} ({n_correct / i})...\n")

    print("Done evaluating.")
    print(f"Accuracy: {n_correct}/{len(to_test)} ({n_correct / len(to_test)})")

    print("Evaluating...")
    evaluate()

So, we must try another method.

After researching popular sentiment analysis methods on the internet, I learnt about BoW and TF-IDF.

## BoW
Bag of Words (BoW) is a vectorisation technique that allows us to store a document in terms of the presence (or absence)
of each word in the global vocabulary.

If our global vocabulary is "the weather is good bad",
we can store "the weather is good" as [1, 1, 1, 1, 0], and "the weather is bad" as [1, 1, 1, 0, 1].

This makes it easier for Machine Learning models to work with the data, as the actual word isn't necessary.

## TF-IDF
TF-IDF means "Term Frequency - Inverse Document Frequency" and is a weight that signifies how important a word is in the
corpus.

TF is how frequently a term occurs in a given document.

`TF(w) = number of occurences of w / number of words in document`

IDF signifies how important a word is. Stop words like "and", "is" and "the" will have a very high
term frequency, but they aren't significant in sentiment analysis. The IDF is low for common terms, and high
for rare ones.

`IDF(w) = log(Number of documents / Number of documents with w in)`

SciKit Learn comes with a built-in TF-IDF vectoriser, but to fully understand the process, I implemented it myself.

In [92]:
import math

ps = nltk.PorterStemmer()
sw = set(stopwords.words())
html_tag = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

def make_tfidf(reviews, vocab):
    # Build TF matrix
    tf = np.ndarray(shape=(len(reviews), len(vocab)))
    idf = []
    for i, review in enumerate(reviews):
        clean_review = re.sub(html_tag, '', review)
        tokens = nltk.word_tokenize(clean_review)
        stems = [ps.stem(token, to_lowercase=True) for token in tokens if token not in sw]

        tf[i] = [stems.count(stem) for stem in sorted(vocab)]

    n_docs = len(reviews)
    # Calculate Inverse Document Frequencies
    for i, word in enumerate(sorted(vocab)):
        if word not in vocab or word == '<UNK>':
            idf.append(0)
            continue
        docs_with_word = 0
        for row in tf:
            if row[i] > 0:
                docs_with_word += 1

        if docs_with_word == 0:
            idf.append(0)
            continue

        x = n_docs / docs_with_word
        assert x >= 1, word

        idf.append(math.log(x))
    return tf * idf

In [93]:
import re
reviews = pos_reviews + neg_reviews
test_reviews = test_pos_reviews + test_neg_reviews

ratings = pos_ratings + neg_ratings
test_ratings = test_pos_ratings + test_neg_ratings

print("Collecting vocab...")
# https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
all_words = []
for i, review in enumerate(reviews):
    # Remove HTML tags
    clean_review = re.sub(html_tag, '', review)
    tokens = nltk.word_tokenize(clean_review)
    stems = [ps.stem(token, to_lowercase=True) for token in tokens if token not in sw]

    # remove all punctuation
    stems = [x for x in ["".join(c for c in s if c not in string.punctuation) for s in stems] if x]
    all_words += stems

vocab = Vocabulary(all_words, unk_cutoff=5)

Collecting vocab...


In [94]:
print("Making TFIDF for training data...")
tfidf = make_tfidf(reviews, vocab)
print("Making TFIDF for testing data...")
tfidf_test = make_tfidf(test_reviews, vocab)

Making TFIDF for training data...
Making TFIDF for testing data...


We can then use Multinomial Naive Bayes (MNB) from scikit learn to create a model for our reviews.
Gaussian Naive Bayes (GNB) assumes the vectors are continuous (such as temperature and time),
but our vectors are integer counts, so we will use MNB, which is designed for counts or relative frequency.

In [95]:
mnb = MultinomialNB()
sentiment = ["pos" if rating > 5 else "neg" for rating in ratings]

print("Fitting model...")
model = mnb.fit(tfidf, sentiment)

Fitting model...


In [96]:
print("Testing model...")
results = mnb.predict(tfidf_test)

n_correct = 0
for i, rating in enumerate(test_ratings):
    actual = "pos" if rating > 5 else "neg"
    predict = results[i]
    if actual == predict:
        n_correct += 1

print(f"{n_correct} correct out of {len(results)}. ({n_correct/len(results)})")

Testing model...
148 correct out of 200. (0.74)


This implementation gives similar results to the Potts 2011 implementation!

I believe my implementation of TFIDF is quite inefficient, so instead let's use the SciKit Learn vectoriser.

the sklearn implementation doesn't clean the data as it goes like mine, so we will clean the reviews before vectorising them.

In [113]:
def clean_reviews(reviews_):
    clean_reviews = []
    for review in reviews_:
        clean_review = re.sub(html_tag, '', review)
        tokens = nltk.word_tokenize(clean_review)
        stems = [ps.stem(token, to_lowercase=True) for token in tokens if token not in sw]
        clean_reviews.append(" ".join(stems))
    return clean_reviews

In [143]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

# To access the global vocabulary, we produce tfidfs of training and testing together.
all_tfidf = vectorizer.fit_transform(clean_reviews(reviews+test_reviews)).toarray()

# We can use either method to split, but as we're comparing to my implementation from earlier, lets use the same data for train / test
# X_train, X_test, y_train, y_test = train_test_split(all_tfidf, (ratings+test_ratings), test_size=0.1, random_state=0)
X_train, X_test = all_tfidf[:len(reviews)], all_tfidf[len(reviews):]
y_train, y_test = ratings, test_ratings

print(vectorizer.get_feature_names_out()[500:550])
print(f"{len(X_train)=}, {len(y_train)=}")
print(f"{len(X_test)=}, {len(y_test)=}")

['aditya' 'adjac' 'adjani' 'adjoin' 'adjunct' 'adjust' 'adjut' 'administ'
 'administr' 'admir' 'admiss' 'admit' 'admitt' 'admittedli' 'adolesc'
 'adolf' 'adopt' 'ador' 'adorn' 'adrenalin' 'adrian' 'adriana' 'adrienn'
 'adul' 'adult' 'adulter' 'adulteri' 'adulthood' 'advanc' 'advani'
 'advantag' 'advantage' 'advent' 'adventist' 'adventur' 'adventuresom'
 'advers' 'adversari' 'advert' 'adverter' 'advertis' 'advertising' 'advic'
 'advis' 'advoc' 'aeon' 'aerial' 'aerodynam' 'aeryn' 'aesthet']
len(X_train)=2000, len(y_train)=2000
len(X_test)=200, len(y_test)=200


In [146]:
mnb = MultinomialNB()

print("Fitting model...")
model = mnb.fit(X_train, y_train)
print("Done")

Fitting model...
Done


In [145]:
print("Testing model...")
y_pred = mnb.predict(X_test)

n_correct = 0
for i, actual in enumerate(y_test):
    predict = y_pred[i]
    if (actual > 5) == (predict > 5):
        n_correct += 1

print(f"{n_correct} correct out of {len(y_test)}. ({n_correct/len(y_test)})")

Testing model...
140 correct out of 200. (0.7)


Using SciKit Learn's Vectorizer gives comparable results to my own implementation (and is a lot faster at calculating the
TF-IDFs!)

## Conslusion

In this report, we have explored the data set, and analysed it for possible features that could be effective
in determining the sentiment. We have implemented a baseline model that uses other research from others, and experimented
with a possible improvement, which didn't add any significant improvement. We then attempted to recreate the dataset
that we used with a naive method, but couldn't produce any useful results.

We then explored a further approach, using a different analysis method, TFIDF, and wrote an implementation that
successfully classifies the reviews as positive or negative, the majority of the time. We then compared our
implementation to another, de-facto implementation.