| Name (Last, First) | Student ID | Section contributed  | Section edited      | Other contributions   |
|--------------------|------------|----------------------|---------------------|-----------------------|
| Hawlader, Antanila | 301332035  | Researched the codes | finding datasets    | provided functions    |
| Long, Jiang        | 200099436  | Inputted the codes   | edited all sections | inputted functions    |
| Savkovic, Sava     | 301397121  | Research codes       | researched reviews  | reviewed all sections |

In [None]:
import csv
import itertools
import math
import nltk

The solution to the assignment is mostly based on the Datacamp tutorial [Python sentiment analysis tutorial](https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python) by Sayak Paul (Sayak, 2021). The tutorial is a "bag-of-words" approach with supervised machine learning using Naive Bayes. We have adapted the tutorial to perform sentimental analysis on a large set of Rotten Tomatoes reviews downloaded from [Rotten Tomatoes movies and critic reviews dataset](https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset) from Kaggle as a large CSV.

We have truncated the CSV file so it's less than 100 MB so it will be under GitHub's file size limit. The truncated CSV contains 485,212 movie reviews from Rotten Tomatoes, each labeled 'Fresh' (positive) or 'Rotten' (negative). But we will only work on the first 128,000 movie reviews so the process won't take too long.

In [None]:
max_documents = 128000 # 'None' means no limit. Tokenizing all 485,212 reviews will take a LONG TIME

file_path = 'rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_critic_reviews.csv'

# Open the file again for reading
f = open(file_path, encoding='UTF8')
csv_reader = csv.DictReader(f)

We read the CSV file line by line. For each line, we store the record as a set. The first element is tokenized review text, the second element is either 'pos' for positive ('fresh') reviews or 'neg' for negative ('rotten') reviews.

In [1]:
documents = []
documents_text = ''

for line in itertools.islice(csv_reader, 0, max_documents):
    # For some reason '10,000' shows up as an "informative feature", let's ignore comments with it
    if "10,000" in line['review_content']: 
        continue
    if(line['review_type']):
        sentiment = False
        if line['review_type'] == 'Fresh':
            sentiment = 'pos'
        if line['review_type'] == 'Rotten':
            sentiment = 'neg'
        if(sentiment):
            tokens = nltk.word_tokenize(line['review_content'])
            documents_text = documents_text + line['review_content']
            documents.append((tokens, sentiment))



Using 127959 documents.


We also concatenate all the reviews we looked at in a a single long string `documents_text`, and tokenize it into `documents_tokens`.

In [None]:
documents_tokens = nltk.word_tokenize(documents_text)
print(f"Using {len(documents)} documents.")

Then, the funtions of features, training set and test sets will determine whether there are more positive or negative words in the dataset.

We will also look at the accuracy representation of the word features .

Next, after determining the positive and negative reviews from the corpus we will begin to tokenize.

This step is known as document classification.

We will retrieve the top 2000 words. This includes both positive and negative data.

This means we will need to make a frequency distribution of the most frequent words.

In [4]:
max_word_features = 2000

all_words = nltk.FreqDist(w.lower() for w in documents_tokens)
word_features = list(all_words)[:max_word_features]

We will try the tags document_features and document_words to distinguish the words 

In [5]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

We will split our corpus up into a training set and a test set. 

To set up a tagger something called training. We will train the tagger on a training set and evaluate it on a test set.

The training set is used to train the initial features

Test test is used for final testing and no features are deprived from this set.

In [6]:
train_set_ratio = 0.9 # 0.9 means that 90% of the data from the corpus is used for training, 10% is for testing

train_set_size = math.floor(len(documents) * train_set_ratio)

train_set, test_set = featuresets[:train_set_size], featuresets[train_set_size:]

classifier = nltk.NaiveBayesClassifier.train(train_set)

We then test the classifier, and show the most important features as interpreted by Naive Bayes.

In [7]:
print(nltk.classify.accuracy(classifier, test_set))
 
classifier.show_most_informative_features(5)

0.7271803688652704
Most Informative Features
       contains(unfunny) = True              neg : pos    =     67.2 : 1.0
      contains(superbly) = True              pos : neg    =     28.3 : 1.0
        contains(deftly) = True              pos : neg    =     24.9 : 1.0
  contains(refreshingly) = True              pos : neg    =     21.1 : 1.0
           contains(gem) = True              pos : neg    =     16.5 : 1.0


## References

Paul, Sayak. (2021, May 17) Python sentiment analysis tutorial. DataCamp Community. (n.d.). Retrieved March 10, 2022, from https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python

Leone, S. (2020, November 4). Rotten tomatoes movies and critic Reviews Dataset. Kaggle. Retrieved March 10, 2022, from https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset 