<div align="center">
<a href="https://vbti.nl"><img src="./vbti_logo.png" width="400"></a>
</div>

# IMDB - Preprocessing

In this example we are going to make a model to predict movies scores (negative, positive) based on user reviews.

## Download dataset
For this notebook we're going to use a dataset with movie reviews from the [Internet Movie Database (IMDB)](www.imdb.com). This dataset can be downloaded in different formats from different places. We use the dataset from [this Kaggle page](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). Kaggle is a datascience competition website. To download the dataset you need an account, which can be created freely. Once the dataset is downloaded unzip it. 

To quote from the Kaggle page, the dataset contains the following information:

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. 

### Data fields
- **sentiment** - Sentiment of the review; 'positive' for positive reviews and 'negative' for negative reviews
- **review** - Text of the review

Rename the unzipped csv file to 'IMDB.csv'.

In [1]:
# load some common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# read IMBD reviews
data = pd.read_csv('https://raw.githubusercontent.com/illyakaynov/masterclass-nlp/master/Case-IMBD_reviews/IMDB.csv')

In [3]:
# How does this dataset look like?
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# show size of dataframe
print("We have in total: {} reviews".format(len(data)))
print("We have {} postive reviews".format(len(data[data.sentiment=="positive"])))
print("we have {} negative reviews".format(len(data[data.sentiment=="negative"])))

We have in total: 50000 reviews
We have 25000 postive reviews
we have 25000 negative reviews


# How do the reviews look like?
Movie reviews on the IMBd site are ranked with a 10-points scale, with `0` bad and `10` good. The reviews in our dataset are binary, that is, `0` for a low movie review (less than 5 IMBd score) and `1` for a high movie review (higher than 6 IMBd score).

To see with what data we are dealing with, we inspect some reviews by hand first.

In [5]:
# Let's print a few reviews:
for review in range(3):
    print("\nThe: {}'th review, this review is: {}.".format(review, data.iloc[review]["sentiment"]))
    print(data.iloc[review]["review"])


The: 0'th review, this review is: positive.
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say 

# Cleaning of the data.
As you just saw, the reviews have all kinds of different shapes. We are going to do the following things for preprocessing (for each different review).
 1. Converting the string sentiment to integers (0 for negative and 1 for positive)
 2. Remove the html markup symbols (like `<br/>`). For this purpose we will use the python package `BeautifulSoup`.
 3. Remove the stopwords (e.g. the, is). For this we will use the package `nltk` (Natural Language TookKit)  
 4. Convert all capital letters to lower case. For this we will, also, use the module `nltk`.
 5. remove non-characters and 1- and 2- letter words. For this, we will use the module `re` (regular expressions).

The function `clean_text()` takes as input a review and performs the steps 2-5.

In [6]:
# 1. convert string sentiment labels to integer values
data['sentiment'] = data['sentiment'].apply(lambda sentiment:0 if sentiment=='negative' else 1)

In [7]:
# If these packages are not yet installed you can download and install them by running this cell.
!pip install BeautifulSoup4
!pip install nltk



In [8]:
# impor the required packages:
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup             

In [None]:
#And download the stopswords
nltk.download('stopwords')

In [10]:
def clean_text(s):
    # 2. Remove html markup
    html_free_text = BeautifulSoup(s).get_text()
    
    # 5. Remove non-letters
    letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                          " ",                   # The pattern to replace it with
                          html_free_text )       # The text to search

    # 4. Split into lower-case words
    lower_case = letters_only.lower()
    words = lower_case.split()
    
    # 3. Remove stop words
    # 5. Remove all words with 2 or 1 letters
    words = [w for w in words if (not w in stopwords.words("english")) and (len(w)>2)]
    
    # 6. Return words as a single string again
    return " ".join(words)

In [11]:
# process all reviews
# This operation takes a long time (~20 min) to complete for the whole dataset
# Therefore we only do a part of it
# We have preprocessed the data for you so you do not need to wait
labels = []
reviews = []

for i, (_, row) in enumerate(data.iterrows()):
    if i%1000==0:
        # show progress
        print(f'Doing {i}/{data.shape[0]}')
        break # Remove this line to process the whole dataset
    labels.append(row['sentiment'])
    reviews.append(clean_text(row['review']))

Doing 0/50000


In [12]:
#Let's print some reviews again:
for review in range(3):
    print("\nThe: {}'th review, this review is: {}.".format(review, data.iloc[review]["sentiment"]))
    print(data.iloc[review]["review"])


The: 0'th review, this review is: 1.
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the mai

## Splitting the dataset into a test and train part.
To measure the goodness of a model we will hold back a part of the whole dataset (`test_data`). After a model is trained on the rest of the whole dataset (`train_data`) we can test the accuracy on the unseen `test_data`. 

In [13]:
split = 0.5

train_data = (reviews[:int(split*len(reviews))], labels[:int(split*len(labels))])
test_data  = (reviews[int((1-split)*len(reviews)):], labels[int((1-split)*len(labels)):])

# Best practices, serializing with `pickle`
As you might notice, reading and cleaning the text files takes a long time. In production environments you would like to parallellize the complete task. For now, we are going to save the python tuples `train_data` and `test_data` into a Python pickle object. Python's `pickle` module is used to (de)serialize python objects to file. This way, objects can be shared between different Python programs and/or time is saved to create an object as in our case. A nice tutorial on using the pickle module can be found [here](https://www.datacamp.com/community/tutorials/pickle-python-tutorial).

In [14]:
import pickle

filename = './train_data_small.pickle'
with open(filename, 'wb') as file_object:
    pickle.dump(train_data, file_object)
    
filename = './test_data_small.pickle'
with open(filename, 'wb') as file_object:
    pickle.dump(test_data, file_object)    

We have preprocessed the whole dataset for you. The following cell will download the pickles.

In [None]:
def download_file(url, path):
    """
    Download file and save it to the defined location
    
    https://stackoverflow.com/questions/37573483/progress-bar-while-download-file-over-http-with-requests/37573701
    """
    import requests
    from tqdm.notebook import tqdm
    import os
    
    
    if os.path.exists(path):
        print('File "{}" already exists. Skipping download.'.format(path))
        return
    
    response = requests.get(url, stream=True)
    total_size_in_bytes= int(response.headers.get('content-length', 0))
    block_size = 1024 #1 Kibibyte
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
    with open(path, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()
    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print("ERROR, something went wrong")

download_file('https://github.com/illyakaynov/masterclass-nlp/blob/master/Case-IMBD_reviews/train_data.pickle?raw=true',
             path='train_data.pickle')

download_file('https://github.com/illyakaynov/masterclass-nlp/blob/master/Case-IMBD_reviews/test_data.pickle?raw=true',
             path='test_data.pickle')

To read the pickle files the follow code can be used.

In [15]:
# read pickle files
import pickle

filename = './train_data.pickle'
with open(filename, 'rb') as file_object:
    train_data = pickle.load(file_object)
    
filename = './test_data.pickle'
with open(filename, 'rb') as file_object:
    test_data = pickle.load(file_object)    

In [16]:
train_data[0][0]

'one reviewers mentioned watching episode hooked right exactly happened first thing struck brutality unflinching scenes violence set right word trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance mess around first episode ever saw struck nasty surreal say ready watched developed taste got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street skills prison experience w

In [17]:
test_data[0][0]

'movie bad start purpose movie angela wanted get high body count acting horrible killings acted badly like ally got stuffed toilet guess abandoned cabin end movie comes molly guy cabin see ally angela must gone get part really got black girl angela cabin angela took guitar string chocked one horrible acting two turn around punch bitch molly getting chased angela neigh turn around stab stupid movie sucked'