## Quiz #0801

### "Text Classification with Keras"

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import os
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, SimpleRNN, LSTM, Embedding
from tensorflow.keras.utils import to_categorical
from keras.preprocessing import sequence
from tensorflow.keras.optimizers import Adam, RMSprop, SGD

#nltk.download('stopwords')

#### Answer the following question by providing Python code:

1). Read in the movie review data from Cornell CS department. Carry out the EDA. <br>
- The data can be found [here](https://www.cs.cornell.edu/people/pabo/movie-review-data). <br>
- Download the “polarity dataset” and unzip. <br>
- Under the "txt_sentoken” folder, there are “pos” and “neg" subfolders. <br>

In [2]:
# Specify the folder and read in the subfolders.
reviews = load_files('txt_sentoken/')
my_docs, y = reviews.data, reviews.target
X = my_docs

In [3]:
type(my_docs)


list

In [4]:
len(my_docs)

2000

In [5]:
np.unique(y, return_counts=True)


(array([0, 1]), array([1000, 1000], dtype=int64))

2). Carry out the data preprocessing: <br>
- Cleaning.
- Stopword removal.

In [6]:
documents = []

for sen in range(0, len(my_docs)):  
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(my_docs[sen]))

    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 

    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    # Converting to Lowercase
    document = document.lower()

    document = document.split()

    document = ' '.join(document)

    documents.append(document)
    


In [7]:
X = documents

In [8]:
#transform our dataframe to df 
d = {'target': y.tolist(), 'text': documents}
df = pd.DataFrame(d)

In [9]:
df.head()

Unnamed: 0,target,text
0,0,arnold schwarzenegger has been an icon for act...
1,1,good films are hard to find these days ngreat ...
2,1,quaid stars as man who has taken up the proffe...
3,0,we could paraphrase michelle pfieffer characte...
4,1,kolya is one of the richest films ve seen in s...


In [10]:
# count dataset samples to know how many of each class we have
df.target.value_counts()

0    1000
1    1000
Name: target, dtype: int64

3). Carry out label encoding by integers (required form by Keras):

In [11]:
#already done

In [12]:
# count dataset samples to know how many of each class we have
df.target.value_counts()

0    1000
1    1000
Name: target, dtype: int64

4). Prepare the data for AI: <br>
- Apply the padding.
- Split the data into training and testing.

#### FIRST : the splitting

In [13]:
train_size = 0.8

# split our data into train set (80%) and test set (20%)
train_data, test_data = train_test_split(df, test_size = 1 - train_size, random_state = 0, stratify = df.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)



#### SECOND : the padding.

In [14]:
# create a tokenizer
tokenizer = Tokenizer()
# fit the tokenizer in the train text
tokenizer.fit_on_texts(train_data.text)

we will take the maximum length from the training dataset to pad all the text sequences 

In [15]:
from keras.preprocessing.sequence import pad_sequences

# get max length of the train data
max_length = max([len(s.split()) for s in train_data.text])

# pad sequences in x_train data set to the max length
x_train = pad_sequences(tokenizer.texts_to_sequences(train_data.text),
                        maxlen = max_length)
# pad sequences in x_test data set to the max length
x_test = pad_sequences(tokenizer.texts_to_sequences(test_data.text),
                       maxlen = max_length)

In [16]:
print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)

print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

x_train shape:  (1600, 2305)
x_test shape:  (400, 2305)
y_train shape: (1600,)
y_test shape: (400,)


In [17]:
# data is padded 

5). Define the AI model (Embedding + LSTM):

#### FIRST : Embedding

The embedding used can be downloaded from GloVe website: https://nlp.stanford.edu/projects/glove/
The following functions were taken from a Jason Brownlee tutorial available at: https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

These functions load the embedding file dowloaded and create the weight matrix that is needed to create the embedding layer in the model.

In [18]:
# load embedding as a dict
def load_embedding(filename):
    # load embedding into memory, skip first line
    file = open(filename,'r',encoding="utf-8")
    lines = file.readlines()
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = np.asarray(parts[1:], dtype='float32')
    return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = np.zeros((vocab_size, embedding_dim))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        vector = embedding.get(word)
        if vector is not None:
            weight_matrix[i] = vector
    return weight_matrix

In [19]:
# contains the index for each word
vocab = tokenizer.word_index
# total number of words in our vocabulary, plus one for unknown words
vocab_size = len(tokenizer.word_index) + 1
# embedding dimensions
embedding_dim = 25

print("Vocab size: ", vocab_size)
print("Max length: ", max_length)

# load embedding from file
raw_embedding = load_embedding('glove.twitter.27B.25d.txt')
# get vectors in the right order
embedding_matrix = get_weight_matrix(raw_embedding, vocab)

Vocab size:  40615
Max length:  2305


In [20]:
print(vocab_size)

40615


In [22]:
# create the embedding layer
embedding_layer = Embedding(vocab_size, 
                            embedding_dim, 
                            weights = [embedding_matrix], 
                            input_length = max_length, 
                            trainable = False)

#### SECOND : Define model

In [23]:
# define model
from keras.layers.core import Dropout
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(LSTM(200, dropout = 0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation = "sigmoid"))



6). Define the optimizer and compile the model:

In [24]:

model.compile(optimizer = "adam", loss = 'binary_crossentropy', metrics = ['accuracy'])

7). Train the model and visualize the summary:

#### model summary

In [25]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2305, 25)          1015375   
_________________________________________________________________
dropout (Dropout)            (None, 2305, 25)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 200)               180800    
_________________________________________________________________
dense (Dense)                (None, 64)                12864     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 1,209,104
Trainable params: 193,729
Non-trainable params: 1,015,375
_________________________________________________________________
None


#### train model

In [26]:
# train model
BATCH_SIZE = 1024
EPOCHS = 15

from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor = 'val_loss', 
                              factor = 0.1,
                              min_lr = 0.01)

history = model.fit(x_train, y_train, batch_size = BATCH_SIZE, epochs = EPOCHS,validation_split = 0.1, verbose = 1, callbacks = [reduce_lr])

Epoch 1/15


ResourceExhaustedError:    OOM when allocating tensor with shape[1024,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[{{node while_26/body/_1/while/MatMul_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[sequential/lstm/PartitionedCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_3313]

Function call stack:
train_function -> train_function -> train_function


8). Display the test result (accuracy):

In [27]:
 #evaluate model
score = model.evaluate(x_test, y_test, batch_size = BATCH_SIZE)
print("Test accuracy:", score[1])


Test accuracy: 0.47999998927116394
