## Description

This notebook will show how to download the IMDB movie reviews dataset, preprocess movie review texts and build a simple Logistic Regression classifier to automatically predict the sentiment of a review. 

## Imports

In [1]:
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
from sklearn.metrics import classification_report

## Download IMDB movie reviews dataset

Run this cell if you don't yet have the IMDB reviews dataset from ai.stanford.edu

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!gunzip aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar

## Load the texts into a dataframe

In [2]:
def load_train_test_data(folder_name='aclImdb'):
    """
    Load the data into pandas DataFrames.
    
    :param folder_name: name of directory with movie reviews in a string format
    :return: pd.DataFrames with training and testing texts and sentiments
    """
    data = {}
    for mode in ['train', 'test']:
        data[mode] = []
        for sent in ['neg', 'pos']:
            if sent == 'neg':
                sent_score = 0
            else:
                sent_score = 1
            path = os.path.join(folder_name, mode, sent)
            file_names = os.listdir(path)
            for f_name in file_names:
                with open(os.path.join(path, f_name), "r") as f:
                    text = f.read()
                    data[mode].append([text, sent_score])

    data['train'] = pd.DataFrame(data['train'],
                                 columns=['text', 'sentiment'])

    data['test'] = pd.DataFrame(data['test'],
                                columns=['text', 'sentiment'])

    return data['train'], data['test']

In [3]:
train_data, test_data = load_train_test_data(folder_name='aclImdb')

In [4]:
print(train_data.shape)
train_data.head(2)

(25000, 2)


Unnamed: 0,text,sentiment
0,may contain spoilers!!!! so i watched this mov...,0
1,Who in their right mind does anything so stupi...,0


In [5]:
print(test_data.shape)
test_data.head(2)

(25000, 2)


Unnamed: 0,text,sentiment
0,^^contains spoilers^^<br /><br />This movie is...,0
1,Charlie Chaplin responds to open auditions at ...,0


## Text preprocessing

In [6]:
from string import punctuation
import re

def preprocess_text(text):
    """
    Preprocess input text.
    
    :param text: input text
    :return: preprocessed text
    """
    # remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    
    # lowercase the text
    text = text.lower()
    
    # remove punctuation from text
    for p in punctuation:
        text = text.replace(p, '')
    
    return text.strip()

In [7]:
train_data['clean_text'] = train_data['text'].apply(preprocess_text)
train_data.head(2)

Unnamed: 0,text,sentiment,clean_text
0,may contain spoilers!!!! so i watched this mov...,0,may contain spoilers so i watched this movie l...
1,Who in their right mind does anything so stupi...,0,who in their right mind does anything so stupi...


In [8]:
test_data['clean_text'] = test_data['text'].apply(preprocess_text)
test_data.head(2)

Unnamed: 0,text,sentiment,clean_text
0,^^contains spoilers^^<br /><br />This movie is...,0,contains spoilers this movie is utter crap do...
1,Charlie Chaplin responds to open auditions at ...,0,charlie chaplin responds to open auditions at ...


## Vectorization and sentiment classification

In [9]:
# create a simple vectorizer that will transform every text into a vector of word counts
vectorizer = CountVectorizer(stop_words='english', preprocessor=preprocess_text)

# transform train sample into vectors
X_train_vectors = vectorizer.fit_transform(train_data['clean_text'])

# create a Logistic Regression classifier
clf = linear_model.LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42)

# train the classifier
clf.fit(X_train_vectors, train_data['sentiment'])

# transform test sample into vectors
X_test_vectors = vectorizer.transform(test_data['clean_text'])

# predict sentiment for test sample
y_predicted = clf.predict(X_test_vectors)

# observe the classification quality metrics
print(classification_report(test_data['sentiment'], y_predicted))

              precision    recall  f1-score   support

           0       0.86      0.87      0.87     12500
           1       0.87      0.86      0.86     12500

   micro avg       0.86      0.86      0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000



With this simple classification setup we already achive 86% classification quality!