## Quiz #0801

### "Text Classification with Keras"

In [4]:
import numpy as np
import pandas as pd
import re
import nltk
import os
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, LSTM, Embedding
from keras.utils.np_utils import to_categorical
from keras.preprocessing import sequence
from tensorflow.keras.optimizers import Adam, RMSprop, SGD
#nltk.download('stopwords')

#### Answer the following question by providing Python code:

1). Read in the movie review data from Cornell CS department. Carry out the EDA. <br>
- The data can be found [here](https://www.cs.cornell.edu/people/pabo/movie-review-data). <br>
- Download the “polarity dataset” and unzip. <br>
- Under the "txt_sentoken” folder, there are “pos” and “neg" subfolders. <br>

In [5]:
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

--2021-09-02 11:00:52--  https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3127238 (3.0M) [application/x-gzip]
Saving to: ‘review_polarity.tar.gz’


2021-09-02 11:00:52 (10.8 MB/s) - ‘review_polarity.tar.gz’ saved [3127238/3127238]



In [None]:
!tar xvzf review_polarity.tar.gz

In [10]:
# Specify the folder and read in the subfolders.
reviews = load_files('txt_sentoken/')
my_docs, y = reviews.data, reviews.target

In [59]:
y

array([0, 1, 1, ..., 1, 0, 0])

2). Carry out the data preprocessing: <br>
- Cleaning.
- Stopword removal.

In [11]:
import nltk
nltk.download('stopwords')
  

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
#Cleaning
new_docs=[]
stpw = stopwords.words('english')
for review in my_docs:
  review=review.decode()
  #removing special characters/symbols
  review = re.sub(r"\n", " ", review)
  review = re.sub("[\<\[].*?[\>\]]", " ", review)
  #keeping only alphabets
  review = re.sub(r"[^a-z ]", " ", review)
  #lowercase everything
  review = review.lower()
  review = " ".join([x for x in review.split() if x not in stpw])
  new_docs.append(review)
print(review)


remake alfred hitchcock film best uncertain project perfect murder illustrates frankly dial murder one master director greatest efforts ample room improvement unfortunately instead updating script ironing faults speeding pace little perfect murder inexplicably managed eliminate almost everything worthwhile dial murder leaving behind nearly unwatchable wreckage would thriller almost suspense films loaded plot implausibilities best thrillers keep viewers involved enough going flaws logic become apparent long final credits rolled unfortunately perfect murder faults often overt become aware happening bad sign occurrences shatter suspension disbelief astute viewer looking next blunder course case perfect murder least gives audience member something besides concentrating inane plot lifeless cardboard characters perfect murder strict remake dial murder borrow heavily frederick knott play also source material hitchcock version well made tv retelling emily hayes gwyneth paltrow wealthy wife pow

3). Carry out label encoding by integers (required form by Keras):

In [13]:
corpus=new_docs

In [16]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Make a dictionary with the top words.
n_words = 2000  
words = []
for i in range(len(corpus)):
    words += nltk.word_tokenize(corpus[i])
top_words = pd.Series(words).value_counts().index
top_words = top_words[0:n_words]                     # Apply a limitation.
my_dict = {}
my_dict_inv = {}
for i in range(len(top_words)):
    my_dict_inv[i] = top_words[i]                    
    my_dict[top_words[i]] = i



In [33]:
# Convert the corpus into the label encoded form.
corpus_int =[]
for i in range(len(corpus)):
    words = nltk.word_tokenize(corpus[i])
    words2int = []
    for x in words:
        if x in my_dict:
            words2int += [my_dict[x]]
    corpus_int.append(words2int)

4). Prepare the data for AI: <br>
- Apply the padding.
- Split the data into training and testing.

In [36]:
X = np.array(corpus_int)
y = np.array(y)

  """Entry point for launching an IPython kernel.


In [37]:
# Padding: newswire lengths are uniformly matched to maxlen.
X = sequence.pad_sequences(X, maxlen = 100)

# y is already binary. Thus, there is no need to covert to the one-hot-encoding scheme.

In [38]:
#split the data into training and testing
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.2)

5). Define the AI model (Embedding + LSTM):

In [39]:
n_neurons = 100                    # Neurons within each memory cell.
n_input = 500                     # Dimension of the embeding space. 

In [99]:
my_model = Sequential()
my_model.add(Embedding(n_words, output_dim=500, input_length=100))

my_model.add(LSTM(n_neurons, dropout=0.2, recurrent_dropout=0.2,activation='relu'))

my_model.add(Dense(1, activation='sigmoid'))
print(my_model.summary())

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 100, 500)          1000000   
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               240400    
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 101       
Total params: 1,240,501
Trainable params: 1,240,501
Non-trainable params: 0
_________________________________________________________________
None


6). Define the optimizer and compile the model:

In [100]:
n_epochs = 15                      # Number of epochs.
batch_size = 100                    # Size of each batch.
learn_rate = 0.001 

In [101]:
opt=Adam(learning_rate=learn_rate)
my_model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])


7). Train the model and visualize the summary:

In [102]:
from keras.callbacks import EarlyStopping


In [103]:
my_model.fit(X_train,Y_train,batch_size=batch_size,epochs=n_epochs,verbose=1,validation_batch_size=0.1,callbacks=[EarlyStopping(monitor='accuracy', patience=3, min_delta=0.0001)])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f820338fe10>

In [97]:
my_model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 100, 500)          1000000   
_________________________________________________________________
lstm_7 (LSTM)                (None, 100)               240400    
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 101       
Total params: 1,240,501
Trainable params: 1,240,501
Non-trainable params: 0
_________________________________________________________________


8). Display the test result (accuracy):

In [104]:
my_model.evaluate(X_test,Y_test)




[0.6192383766174316, 0.675000011920929]