# 1-Preprocessing

In this notebook, I import the movie-review dataset, clean it, and write the results to a pair of CSV files.

## Import

In [2]:
import pandas as pd

from src.clean import load, clean_text

## Load IMDB Dataset

In [3]:
# Load the 'imdb' dataset into a DataFrame that has two keys:
# 'train' and 'test'.
dataset_name = 'imdb'
df_dict = load(dataset_name)

df_dict.keys()

dict_keys(['train', 'test'])

## Clean 

To clean each text review, I converted all text to lower case, collapsed all whitespace sequences to a single whitespace, and stripped out punctuation.   

In [9]:
# strings needed to specify output filenames
dir = 'data'
tag = 'cleaned'
suffix = 'csv'

# Clean and write to CSV both DataFrames.
for (key, df) in df_dict.items():
    df_clean = df.pipe(clean_text)

    filename = f'{dir}/{dataset_name}-{key}-{tag}.{suffix}'
    df_clean.to_csv(
        filename, 
        sep=',', 
        header=True, 
        index=False
    )

## Test

As a test of the above code, I'll read in the CSV files that were generated and do a little bit of EDA.

In [10]:
train = pd.read_csv(f'{dir}/{dataset_name}-train-{tag}.{suffix}')
test = pd.read_csv(f'{dir}/{dataset_name}-test-{tag}.{suffix}')

print(f'Train shape: {train.shape}')
print(f'Test shape: {test.shape}')


Train shape: (25000, 2)
Test shape: (25000, 2)


In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [25]:
train.sample(5)

Unnamed: 0,text,label
21841,first off let me say i have wanted to see this...,1
397,i saw this movie when i was much younger and i...,0
9714,im guessing the writers have never read a book...,0
21600,this is one of the movies of dev anand who gav...,1
6682,odd slasher movie from producer charles band i...,0


In [18]:
test.sample(5)

Unnamed: 0,text,label
19057,it came as no surprise to me that this was a v...,1
18858,although i totally agree with the previous com...,1
23894,lights of new york was the first alltalking fe...,1
10897,i had a different experience with this movie i...,0
5873,and unfortunately so did i any movie that reli...,0


In [19]:
train['label'].value_counts()

label
0    12500
1    12500
Name: count, dtype: int64

From the training set, I see that there are 12,500 positive reviews and 12,500 negative reviews. Hence, this classification problem is balanced. 