# Sentiment Analysis of Movie Reviews

## Overview

This dataset contains movie reviews along with their associated binary sentiment polarity labels.

## Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb.

## Data Loading

Data is split into various files spread into two folders
    
    - train
    - test
    
In each of these folders data is further split based on sentiment:

    - pos
    - neg
    
Each of these folders have individual files corresponding to individual reviews and the files names are labelled in this format: <index>_<rating>.txt
    
Granted we are only concerned if the review has a positive sentiment or negative sentiment and not interested at this point what the actual rating is, I will be loading the training data into a list and create another list which contains the polarity of the data.
    
This will therefore be a binary classification problem.

In [1]:
train_set = []
train_labels = []
test_set = []
test_labels = []

def data_sentiment_split(path_to_files, train=True):
    
    for filename in os.listdir(path_to_files):
        rating_lst = filename.split("_")
        rating = int(rating_lst[1].split(".")[0])
        with open(path_to_files + filename, 'r', encoding='utf-8') as f:
            file_line = f.readline()
    
        if train:
            train_set.append(file_line.strip())
            if rating >= 7:
                train_labels.append(1)
            else:
                train_labels.append(0)
        else:
            test_set.append(file_line.strip())
            if rating >= 7:
                test_labels.append(1)
            else:
                test_labels.append(0)

import os

data_sentiment_split('../capstone_01_movie_reviews/aclImdb/train/pos/')
data_sentiment_split('../capstone_01_movie_reviews/aclImdb/train/neg/')
data_sentiment_split('../capstone_01_movie_reviews/aclImdb/test/pos/', train=False)
data_sentiment_split('../capstone_01_movie_reviews/aclImdb/test/neg/', train=False)

## Cleaning the Data

Data when loaded looks pretty messy. For Example:

In [2]:
print(train_set[100])

I was prepared for a turgid talky soap opera cum travelogue, but was pleased to find a fast-paced script, an underlying moral, excellent portrayals from all the actors, especially Peter Finch, amazing special effects, suspense, and beautiful cinematography--there's even a shot of the majestic stone Buddhas recently destroyed by the Taliban. Not to mention Elizabeth Taylor at her most gloriously beautiful and sympathetic, before she gave in to the gaspy hysterics that marred her later work. All the supporting players round it out, and I do wonder who trained all those elephants.<br /><br />Speaking of the stone-Buddha sequence, you really can discern that it's Vivien Leigh in the long shots. Her shape and the way she moves is distinct from Taylor's. The only thing marring that sequence are the poorly done process shots, where the background moves by much too fast for horses at a walk.<br /><br />If you want a thought-provoking film that is beautiful to watch and never boring, spend a fe

As we can see, there are a bunch of \<br /\> tags. Other than that the data probably has several other elements which we need to first get rid of prior to doing anything with the data itself.

In [3]:
import re

def remove_punc_tags(review_list):
    for i, review in enumerate(review_list):
        review = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)").sub("", review.lower())
        review = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", review)
        review_list[i] = review
    return review_list

train_set = remove_punc_tags(train_set)
test_set = remove_punc_tags(test_set)

print(train_set[100])

i was prepared for a turgid talky soap opera cum travelogue but was pleased to find a fast paced script an underlying moral excellent portrayals from all the actors especially peter finch amazing special effects suspense and beautiful cinematography  theres even a shot of the majestic stone buddhas recently destroyed by the taliban not to mention elizabeth taylor at her most gloriously beautiful and sympathetic before she gave in to the gaspy hysterics that marred her later work all the supporting players round it out and i do wonder who trained all those elephants speaking of the stone buddha sequence you really can discern that its vivien leigh in the long shots her shape and the way she moves is distinct from taylors the only thing marring that sequence are the poorly done process shots where the background moves by much too fast for horses at a walk if you want a thought provoking film that is beautiful to watch and never boring spend a few hours with elephant walk


Below I will be extracting all the unique words that are used in each of these reviews. This I plan to use in the form of Word Cloud later.

In [4]:
unique_words = {}
for review in train_set:
    review_words = review.split(" ")
    for word in review_words:
        unique_words[word] = unique_words.get(word, 0) + 1