## WiP D213 Task 2: NLP Deep Learning

### Non-narrative steps

 #### 1. Select dataset
 #### 2. Data EDA/Preprocessing
        - Detection and handling of unusual characters (i.e., emojis and non-English characters)
        - Determine vocabulary size
        - Select word embedding length based on statistical inference
 #### 3. `Tokenization`/normalization
 #### 4. `Padding` (standardize `length of sequences`)
        - Pad as prefix or suffix
        - Pull out at least one example of padded sequence for observation
 #### 5. Determine number of `categories of sentiment` to use and `activation function` to use (specifically in the final Dense layer?)
 #### 6. Determine `train/test split` (<font size=2em>*model.fit(validation_set=`x`)*</font>)
 #### 7. Design the model in `TensorFlow`
        - Number of layers
        - Type of layers
        - Total number of parameters
 #### 8. Choice of `hyperparameters` including:
        - Activation function (see 5.)
              - 'relu'
        - Number of nodes per layer
              - TBD
        - Loss function
              - Probably 'categorical_crossentropy', but backup with reference to the literature
        - Optimizer
              - Probably 'adam', but same as above
        - EarlyStopping criteria
              - Maybe 2? Experiment and decide
        - Evaluation metric
              - MSE or RMSE
 #### 9. `Evaluate` the model
        - Find out how changing EarlyStopping(x) impacts the model vs number of training epochs
        - Make a line graph of the model training both loss metric and validation metric (i.e., MSE)
        - Determine model fitness vs overfitting
        - Quantify accuracy of the model

In [1]:
# Suppress Tensorflow warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# Imports
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

# Initial checks
print(f'''
         TF ver: {tf.__version__}
     TF-hub ver: {hub.__version__}
Eager Execution: {tf.executing_eagerly()}
            GPU: {"available" if tf.config.list_physical_devices("GPU") else "not available"}
''')



         TF ver: 2.12.0
     TF-hub ver: 0.13.0
Eager Execution: True
            GPU: available



In [2]:
# Begin modeling
import numpy as np
import pandas as pd

# Train/test split
train_data, validation_data, test_data = tfds.load(
    name='imdb_reviews',
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

In [3]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))

train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

In [6]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)

hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423194 , -0.01190171,  0.06337537,  0.0686297 , -0.16776839,
        -0.10581177,  0.168653  , -0.04998823, -0.31148052,  0.07910344,
         0.15442258,  0.01488661,  0.03930155,  0.19772716, -0.12215477,
        -0.04120982, -0.27041087, -0.21922147,  0.26517656, -0.80739075,
         0.25833526, -0.31004202,  0.2868321 ,  0.19433866, -0.29036498,
         0.0386285 , -0.78444123, -0.04793238,  0.41102988, -0.36388886,
        -0.58034706,  0.30269453,  0.36308962, -0.15227163, -0.4439151 ,
         0.19462997,  0.19528405,  0.05666233,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201319 , -0.04418665, -0.08550781,
        -0.55847436, -0.2333639 , -0.20782956, -0.03543065, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862677,  0.7753425 , -0.07667087,
        -0.15752274,  0.01872334, -0.08169781, -0.3521876 ,  0.46373403,
        -0.08492758,  0.07166861, -0.00670818,  0.12686071, -0.19326551,
 

In [7]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_2 (KerasLayer)  (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________


In [8]:
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

In [9]:
history = model.fit(train_data.shuffle(10000).batch(512), epochs=10, validation_data=validation_data.batch(512), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

49/49 - 1s - loss: 0.3766 - accuracy: 0.8454 - 964ms/epoch - 20ms/step
loss: 0.377
accuracy: 0.845


In [126]:
# Load in IMDB dataset
df = pd.read_csv('./data/imdb_labelled.txt', sep='\t+', header=None, names=['review', 'sentiment'], engine='python')
df

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
...,...,...
995,I just got bored watching Jessice Lange take h...,0
996,"Unfortunately, any virtue in this film's produ...",0
997,"In a word, it is embarrassing.",0
998,Exceptionally bad!,0


In [127]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [128]:
df.describe(include='all')

Unnamed: 0,review,sentiment
count,1000,1000.0
unique,997,
top,10/10,
freq,2,
mean,,0.5
std,,0.50025
min,,0.0
25%,,0.0
50%,,0.5
75%,,1.0


In [131]:
for i in df.review[:20]:
    print(i)

A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  
Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  
Very little music or anything to speak of.  
The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.  
The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  
Wasted two hours.  
Saw the movie today and thought it was a good effort, good messages for kids.  
A bit predictable.  
Loved the casting of Jimmy Buffet as the science teacher.  
And those baby owls were adorable.  
The movie showed a lot of Florida at it's best, made it look very appealing.  
The Songs Were The Best And The Muppets Were So Hilarious

In [61]:
tf.convert_to_tensor(df.review)

<tf.Tensor: shape=(748,), dtype=string, numpy=
array([b'A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  ',
       b'Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  ',
       b'Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  ',
       b'Very little music or anything to speak of.  ',
       b'The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.  ',
       b"The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  ",
       b'Wasted two hours.  ',
       b'Saw the movie today and thought it was a good effort, good messages for kids.  ',
       b'A bit predictable.  ',
       b'Loved the casting of Jimmy Buffet as the science teacher.  ',
       b'And tho

In [147]:
for i in df.review:
    print(i.strip())

A very, very, very slow-moving, aimless movie about a distressed, drifting young man.
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.
Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.
Very little music or anything to speak of.
The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.
The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.
Wasted two hours.
Saw the movie today and thought it was a good effort, good messages for kids.
A bit predictable.
Loved the casting of Jimmy Buffet as the science teacher.
And those baby owls were adorable.
The movie showed a lot of Florida at it's best, made it look very appealing.
The Songs Were The Best And The Muppets Were So Hilarious.
It Was So Cool.
This i

In [164]:
import re
def strip_punctuation(sentence):
    word_list = []
    for i in sentence:
        w = re.sub(r'[^\w\s]', '', i) #remove everything except words and space
        w = re.sub(r'_', '', w) #how to remove underscore as well
        w = w.lower()
        word_list.append(w)
    return word_list

In [165]:
df.review[5]


"The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  "

In [168]:
re.sub(r'[^\w\s]', '', df.review[5]).lower().strip()

'the rest of the movie lacks art charm meaning if its about emptiness it works i guess because its empty'

In [108]:
import string

# for i in df.review[19:20]:
#     print(i)
print(df.review[19])
# string.Template(df.review[2]).safe_substitute()

 The structure of this film is easily the most tightly constructed in the history of cinema.  	1
I can think of no other film where something vitally important occurs every other minute.  	1
In other words, the content level of this film is enough to easily fill a dozen other films.  	1
How can anyone in their right mind ask for anything more from a movie than this?  	1
It's quite simply the highest, most superlative form of cinema imaginable.  	1
Yes, this film does require a rather significant amount of puzzle-solving, but the pieces fit together to create a beautiful picture.  	1
This short film certainly pulls no punches.  	0
Graphics is far from the best part of the game.  	0
This is the number one best TH game in the series.  	1
It deserves strong love.  	1
It is an insane game.  	1
There are massive levels, massive unlockable characters... it's just a massive game.  	1
Waste your money on this game.  	1
This is the kind of money that is wasted properly.  	1
Actually, the graphic

In [110]:
df.review[20]


"This if the first movie I've given a 10 to in years.  "