| Name (Last, First) | Student ID | Section contributed  | Section edited      | Other contributions   |
|--------------------|------------|----------------------|---------------------|-----------------------|
| Hawlader, Antanila | 301332035  | Researched the codes | finding datasets    | provided functions    |
| Long, Jiang        | 200099436  | Inputted the codes   | edited all sections | inputted functions    |
| Savkovic, Sava     | 301397121  | Research codes       | researched reviews  | reviewed all sections |

In [1]:
import csv
import itertools
import math
import nltk

The solution to the assignment is mostly based on the Datacamp tutorial [Python sentiment analysis tutorial](https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python) by Sayak Paul (Sayak, 2021). The tutorial is a "bag-of-words" approach with supervised machine learning using Naive Bayes. We have adapted the tutorial to perform sentimental analysis on a large set of Rotten Tomatoes reviews downloaded from [Rotten Tomatoes movies and critic reviews dataset](https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset) from Kaggle as a large CSV (Leone, 2020).

We have truncated the CSV file so it's less than 100 MB so it will be under GitHub's file size limit. The truncated CSV contains 485,212 movie reviews from Rotten Tomatoes, each labeled 'Fresh' (positive) or 'Rotten' (negative). But we will only work on the first 128,000 movie reviews so the process won't take too long. So we set `max_documents` to 128000.

In [2]:
max_documents = 128000 # 'None' means no limit. Tokenizing all 485,212 reviews will take a LONG TIME

file_path = 'rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_critic_reviews.csv'

# Open the file again for reading
f = open(file_path, encoding='UTF8')
csv_reader = csv.DictReader(f)

We read the CSV file line by line. For each line, we store the record as a set. The first element is tokenized review text, the second element is either 'pos' for positive ('fresh') reviews or 'neg' for negative ('rotten') reviews.

In [3]:
documents = []
documents_text = ''

for line in itertools.islice(csv_reader, 0, max_documents):
    # For some reason '10,000' shows up as an "informative feature", let's ignore comments with it
    if "10,000" in line['review_content']: 
        continue
    if(line['review_type']):
        sentiment = False
        if line['review_type'] == 'Fresh':
            sentiment = 'pos'
        if line['review_type'] == 'Rotten':
            sentiment = 'neg'
        if(sentiment):
            tokens = nltk.word_tokenize(line['review_content'])
            documents_text = documents_text + line['review_content']
            documents.append((tokens, sentiment))

In the mean time, we concatenate all the reviews we looked at in a a single long string `documents_text`, and tokenize it into `documents_tokens`.

In [4]:
documents_tokens = nltk.word_tokenize(documents_text)

print(f"Using {len(documents)} documents.")

Using 127959 documents.


Next, we will retrieve the most frequently occurring 2,000 words in the reviews we looked at. Theses words will be used for feature extraction. We set `max_word_features` to `2000` so we only look at the most common 2,000 words.

In [5]:
max_word_features = 2000

all_words = nltk.FreqDist(w.lower() for w in documents_tokens)

word_features = list(all_words)[:max_word_features]

We then use the 2,000 words to extract features. We do this by going through each review (`document`). For each review, we add a feature `contains(WORD)` for each of the 2,000 `WORD`s: `contains(the)`, `contains(a)`, `contains(it)`, etc. Therefore, each review will have 2,000 features. We store these in a list of sets, the first element in the set being the 2,000 features, and the second element in the set being the class (`pos` or `neg`).

In [6]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

We will split our corpus up into a training set and a test set. The training set contains 90% of the reviews, and the testing set 10% of the reviews.

We will train a Naive Bayes `classifier` on the training set, and try it out on the testing set.

In [7]:
train_set_ratio = 0.9 # 0.9 means that 90% of the data from the corpus is used for training, 10% is for testing

train_set_size = math.floor(len(documents) * train_set_ratio)

train_set, test_set = featuresets[:train_set_size], featuresets[train_set_size:]

classifier = nltk.NaiveBayesClassifier.train(train_set)

We now test the classifier. We tried changing different parameters including `max_documents` and `max_word_features`. We find that as we feed more training data to model, accuracy score will approach around 70% ~ 73%:

| `max_documents` |training set | testing set |   `max_word_features`  |   accuracy           |
|-----------------|-------------|----------|-------------|----------------------|
|   963           |   866       |   97     |   2000      |   0.701030927835052  |
|   1963          |   1766      |   197    |   2000      |   0.644670050761421  |
|   3963          |   3566      |   397    |   2000      |   0.680100755667506  |
|   7963          |   7166      |   797    |   2000      |   0.713927227101631  |
|   15962         |   14365     |   1597   |   2000      |   0.731371321227301  |
|   31962         |   28765     |   3197   |   2000      |   0.717547700969659  |
|   63961         |   57564     |   6397   |   2000      |   0.705486947006409  |
|   127959        |   115163    |   12796  |   2000      |   0.72718036886527   |

In this case we used `max_documents = 128000` and `max_word_features = 2000` and we got an accuracy score of 72.7%.

In [8]:
print(nltk.classify.accuracy(classifier, test_set))

0.7271803688652704


Finally, we show the most significant features that affect the classifier outcome.

In [9]:
classifier.show_most_informative_features(5)

Most Informative Features
       contains(unfunny) = True              neg : pos    =     67.2 : 1.0
      contains(superbly) = True              pos : neg    =     28.3 : 1.0
        contains(deftly) = True              pos : neg    =     24.9 : 1.0
  contains(refreshingly) = True              pos : neg    =     21.1 : 1.0
           contains(gem) = True              pos : neg    =     16.5 : 1.0


## References

Paul, Sayak. (2021, May 17) Python sentiment analysis tutorial. DataCamp Community. (n.d.). Retrieved March 10, 2022, from https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python

Leone, S. (2020, November 4). Rotten tomatoes movies and critic Reviews Dataset. Kaggle. Retrieved March 10, 2022, from https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset 