This notebook is an attempt to do custom data preprocessing of the IMDB dataset. Keras provides a nice, tokenized version of the data as in `imdb-sentiment.ipynb` but it's useful to learn to do this stuff on our own. Following preprocessing recipe from: https://www.kaggle.com/code/affand20/imdb-with-pytorch/notebook

In [29]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas()

In [2]:
# download data into data/ from: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
data = pd.read_csv("data/IMDB_Dataset.csv")
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# add binary column
bin_label = [1 if sent == 'positive' else 0 for sent in data['sentiment'].tolist()]
data['label'] = bin_label

In [4]:
# peek into text lengths to figure out truncation
lengths = pd.Series(data['review'].apply(lambda t: len(t.split())))
lengths.describe()

count    50000.000000
mean       231.156940
std        171.343997
min          4.000000
25%        126.000000
50%        173.000000
75%        280.000000
max       2470.000000
Name: review, dtype: float64

### Preprocess text

In [5]:
import re
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/muhammadali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/muhammadali/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Example of how the same would be done in spaCy -- but choosing nltk because it's the simpler pipeline

```
nlp = spacy.load("en_core_web_sm")
doc = nlp(data.review.iloc[0])

def spacy_tokenize(text):
    doc = nlp(text)
    return [tok.lemma_ for tok in doc]

tokens = data.review.apply(spacy_tokenize)
```

#### Text cleaning pipeline

In [6]:
def rm_link(text):
    return re.sub(r'https?://\S+|www\.\S+', '', text)

def rm_inner_punct(text):
    # remove punctuation inside sentences
    return re.sub(r'[\"\#\$\%\&\'\(\)\*\+\/\:\;\<\=\>\@\[\\\]\^\_\`\{\|\}\~]', '', text)

def rm_html(text):
    return re.sub(r'<[^>]+>', '', text)
    
def space_bw_punct(text):
    pattern = r'([.,!?-])'
    s = re.sub(pattern, r' \1 ', text)     # add whitespaces between punctuation
    s = re.sub(r'\s{2,}', ' ', s)        # remove double whitespaces    
    return s

def rm_number(text):
    return re.sub(r'\d+', '', text)

def rm_whitespaces(text):
    return re.sub(r' +', ' ', text)

def rm_nonascii(text):
    return re.sub(r'[^\x00-\x7f]', r'', text)

def space_bt_punct(text):
    pattern = r'([.,!?-])'
    s = re.sub(pattern, r' \1 ', text)     # add whitespaces between punctuation
    s = re.sub(r'\s{2,}', ' ', s)        # remove double whitespaces    
    return s

def rm_punct2(text):
    # return re.sub(r'[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]', ' ', text)
    return re.sub(r'[\"\#\$\%\&\'\(\)\*\+\/\:\;\<\=\>\@\[\\\]\^\_\`\{\|\}\~]', ' ', text)

def spell_correction(text):
    return re.sub(r'(.)\1+', r'\1\1', text)

def rm_emoji(text):
    emojis = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE
    )
    return emojis.sub(r'', text)

def clean_pipeline(text):    
    # concatenate all cleaning operations together
    no_link = rm_link(text)
    no_html = rm_html(no_link)
    space_punct = space_bt_punct(no_html)
    no_punct = rm_punct2(space_punct)
    no_number = rm_number(no_punct)
    no_whitespaces = rm_whitespaces(no_number)
    no_nonasci = rm_nonascii(no_whitespaces)
    no_emoji = rm_emoji(no_nonasci)
    spell_corrected = spell_correction(no_emoji)
    return spell_corrected

#### Tokenizing and lemmatizing pipeline

In [7]:
def tokenize(text):
    return word_tokenize(text)

def rm_stopwords(tokens):
    return [tok for tok in tokens if tok not in stopwords]

def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(tok) for tok in text]
    return rm_stopwords(lemmas)

def preprocess_pipeline(text):
    tokens = tokenize(text)
    tokens = rm_stopwords(tokens)
    lemmas = lemmatize(tokens)
    return " ".join(lemmas)

In [8]:
data['clean'] = data.review.progress_apply(clean_pipeline)
data['processed'] = data.review.progress_apply(preprocess_pipeline)

  0%|          | 0/50000 [00:00<?, ?it/s]

  0%|          | 0/50000 [00:00<?, ?it/s]

In [11]:
# export processed data
data[['clean', 'processed']].to_csv('imdb-processed.csv', index=False, header=True)

### Vectorize data into model-compatible format

In [14]:
# build vocabulary from all texts
reviews = data.processed.values
all_reviews = ' '.join(reviews)
words = all_reviews.split()

print(words[:10])

['One', 'reviewer', 'mentioned', 'watching', '1', 'Oz', 'episode', "'ll", 'hooked', '.']


In [114]:
from collections import Counter

NUM_WORDS = 20000
counter = Counter(words)
# sort words in descending order of frequency (stop words have been removed)
vocab = sorted(counter, key=counter.get, reverse=True)
id2word = dict(enumerate(vocab[:NUM_WORDS], 1))
id2word[0] = '<PAD>'
word2id = {word: id for id, word in id2word.items()}

In [116]:
# vectorize sentences based on vocab
reviews_enc = []
for review in tqdm(reviews):
    # NOTE: this is a whitespace split, the nltk pipeline should not have joined the texts back
    curr_review = []
    for word in review.split():
        try:
            curr_review.append(word2id[word])
        except KeyError:
            # i'm only recording the most frequent 20,000 words
            pass

    reviews_enc.append(curr_review)
    # reviews_enc.append([word2id[word] for word in review.split()])

  0%|          | 0/50000 [00:00<?, ?it/s]

This `reviews_enc` is the form that keras exports its dataset in.

In [120]:
DIM=NUM_WORDS+1

def vectorize_sequences(sequences, dim=DIM):
    matrix = np.zeros((len(sequences), dim))
    for i, sequence in enumerate(sequences):
        for word_id in sequence:
            matrix[i, word_id] = 1.0

    return matrix

reviews_vec = vectorize_sequences(reviews_enc, len(vocab)+1)

In [44]:
# instead of the whole vocab, we can vectorize to a fixed seq_len with padding tokens
def vectorize_pad(reviews, pad_id, seq_len=128):
    # reviews_vec is a [num_reviews, seq_len] shape matrix with <PAD> tokens throughout
    reviews_vec = np.full((len(reviews), seq_len), pad_id, dtype=int)
    # fill the first seq_len tokens with the actual tokens instead of <PAD>
    for i, sequence in enumerate(reviews):
        # what happens if len(sequence) > seq_len?
        reviews_vec[i, :len(sequence)] = np.array(sequence)[:seq_len]

    return reviews_vec

In [121]:
# shuffle before vectorizing
# X = vectorize_pad(reviews_enc, word2id['<PAD>'], seq_len=256)
X = vectorize_sequences(reviews_enc)
y = np.array(data['label'])

**Train-val-test split**

In [123]:
train_size=.8
val_size=.05

train_i = int(train_size * X.shape[0])
X_train = X[:train_i]
y_train = y[:train_i]

val_i = int(val_size * X.shape[0])
X_val = X[train_i:train_i+val_i]
y_val = y[train_i:train_i+val_i]

X_test = X[train_i+val_i:]
y_test = y[train_i+val_i:]

In [101]:
import keras

In [124]:
MLP = keras.Sequential([
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

MLP.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

In [125]:
MLP.fit(
    X_train,
    y_train,
    epochs=20,
    batch_size=512,
    validation_data=(X_val, y_val)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x384c50d90>