## Homework 3: Sentiment Classification

**Jackson Rudoff**

*April 17, 2023*

### The Data, the Models, and Why Bother?

For this week's project, we're looking into ways to model sequential data. The first examples we discussed were, of course, things like video and audio, which have a highly time-oriented structure and can be divided into slices quite easily.

Text classification may not necessarily be something we think of as being inherently sequential. We could, for example, build some decent predictive models that just focus on texts as purely a bag of words, and try to extract meaningful sentiment information with those words as individual units. But words don't really happen in isolation, natural language occurs as a stream, meaning that we may be able to get better predictions by representing this language sequentially. Of course, it is also entirely possible that the sequence doesn't reveal much about the semantics; however, it's definitely worth trying, as there may be some multi-word patterns that are especially valuable for prediction.

With this in mind, let's look at the data that we're working with. First we'll read in the ```aimodelshare``` library and download the competition data:

In [63]:
### Read in modelshare library

! pip install aimodelshare==0.0.189

Defaulting to user installation because normal site-packages is not writeable


In [None]:
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [65]:
# Borrow code from example notebook to get workable data

import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# Set up one-hot encoding
y_train = pd.get_dummies(y_train_labels)


'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .'

Let's take a look at a few observations:

In [70]:
print(X_train[4])
print(X_train[67])
print(X_train[445])

Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .
The production has been made with an enormous amount of affection , so we believe these characters love each other .
This documentary is a dazzling , remarkably unpretentious reminder of what ( Evans ) had , lost , and got back .


So, our observations are not *super* long, and generally don't have a lot of words for us to extract information from. This is perhaps why a sequential approach might be useful, as it provides an additional dimension to accommodate what are relatively brief extracts.


The question remains, what is valuable about a model like this? To begin with, the most obvious and related example would be a review aggregation site. It's possible to do this already using reviewers' provided scores, but the inherent flaw in this system is that people usually provide their own rating arbitrarily. Sites like Letterboxd, for example, are saturated with reviews that are hyperbolic or intended to be comedic, with users attaching 1 or 2 stars based on the point they want to make. But sites like this could easily benefit from having an in-house model that can provide a "true" negative/positive rating based on a classification of the actual review, rather than just the number of stars a user enters. This could maybe help ward off issues like review bombing and trolling, and give site users a secondary rating that goes beyond the typical, skewed star-based ratings.

Generally, these text-based models are extremely useful for tasks involving language assessment. A sentiment classifier can be very useful for research in the field of linguistics. One of the major challenges in the study of language is the difficulty in coding speech beyond just the words a subject uses. If you are trying to do research on how respondents' answers to questions vary based on a certain kind of priming, for example, the biggest hurdle isn't eliciting the response, but coding it. Models incorporating embeddings (like what we're doing here) could provide more operationalized and consistent classifications of a respondent's answers, and offload very tedious work that may otherwise be undermined by qualitative coding issues. 

Let's start experimenting with some models, and see what we can get rolling here.

### MODEL 1: Basic LSTM

The first model we're gonna run is going to be a fairly basic `LSTM()` model. We will first need to tokenize our reviews and construct word embeddings. The code provided in the example notebook will help us get these embeddings computed, but I'm going to tweak some of the parameters here and experiment. We're going to run with a max sample length of 60, and work with 10 output features to start.

In [2]:
# Read libraries
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Set variables for preprocessor and later code
vocab_size = 10000
max_length = 60
features = 10

In [5]:
# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(X_train)

# Use the packaged preprocessor; I'm going to adjust the maxlen 100
def preprocessor(data, maxlen=50, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train, maxlen=max_length, max_words = vocab_size).shape)
print(preprocessor(X_test, maxlen=max_length, max_words = vocab_size).shape)

(6920, 60)
(1821, 60)


We're going to run a fairly minimal example here, with an embeddings layer feeding into an LSTM layer with a recurrent dropout of 0.2 to try and improve its generalizability.

In [71]:
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Dense, Embedding, Flatten
from tensorflow.keras.models import Sequential

vocab_size = 10000

# Define the model architecture
model_one = Sequential([
    Embedding(input_dim=vocab_size, output_dim=features, input_length=max_length),
    LSTM(64, recurrent_dropout=0.2),
    Flatten(),
    Dense(2, activation='softmax')
])

# Compile the model
model_one.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

model_one.summary()


Model: "sequential_36"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_39 (Embedding)    (None, 60, 10)            100000    
                                                                 
 lstm_31 (LSTM)              (None, 64)                19200     
                                                                 
 flatten_15 (Flatten)        (None, 64)                0         
                                                                 
 dense_36 (Dense)            (None, 2)                 130       
                                                                 
Total params: 119,330
Trainable params: 119,330
Non-trainable params: 0
_________________________________________________________________


In [72]:
# Train the model
model_one.fit(preprocessor(X_train, maxlen=max_length, max_words = vocab_size), y_train, validation_split=0.2, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x21dfe1057f0>

From the basic stats, it looks like we did okay for a fairly low-context model. Our validation accuracy got to around 75% at one point; however, the validation loss increased substantially with each epoch, which means we might want to consider some form of monitoring for that. When we submit this to the challenge we'll be able to see how that affected its overall peformance. 

In [73]:
# Save Preprocessor
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"")


# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model_one, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model_one.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())


# Login
from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)


#Instantiate Competition

mycompetition= ai.Competition(apiurl)


#Submit model_one
prediction_column_index=model_one.predict(preprocessor(X_test, maxlen=max_length, max_words = vocab_size)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]


mycompetition.submit_model(model_filepath = "model_one.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)



Your preprocessor is now saved to 'preprocessor.zip'
AI Model Share login credentials set successfully.

Your model has been submitted as model version 383

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


And we can check the leaderboard:

In [74]:
# Get leaderboard

condensed_leaderboard = mycompetition.get_leaderboard()

condensed_leaderboard[condensed_leaderboard['version'] == 383]

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,...,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,timestamp,version
120,0.780461,0.779924,0.783343,0.780516,keras,,True,Sequential,4.0,119330.0,...,1.0,1.0,,str,Adam,478192.0,,jer2240,2023-04-18 00:37:25.382011,383


This actually isn't bad for a first attempt, but we hopefully could improve on this a bit and get our accuracy into the 80s. 

### Model 2: Adding Layers

Like with models already run in this class, we can take a convolution approach to these models to try and extract more information from the embeddings. I'm going to replicate the structure we used previously, with some stacked `Conv1D` layers feeding into `MaxPooling` layers.

In [77]:
import tensorflow as tf
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
from tensorflow.keras.callbacks import ReduceLROnPlateau


# Define the model architecture
model_two = Sequential([
    Embedding(input_dim=vocab_size, output_dim=features, input_length=max_length),
    Conv1D(filters=32, kernel_size=7, padding='same', activation='relu'),
    Conv1D(filters=32, kernel_size=7, padding='same', activation='relu'),
    MaxPooling1D(5),
    Conv1D(filters=64, kernel_size=7, padding='same', activation='relu'),
    Conv1D(filters=64, kernel_size=7, padding='same', activation='relu'),
    GlobalMaxPooling1D(),
    Dense(2, activation='softmax')
])

# Compile the model
model_two.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
plateau_check = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)

model_two.summary()

Model: "sequential_38"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_41 (Embedding)    (None, 60, 10)            100000    
                                                                 
 conv1d_226 (Conv1D)         (None, 60, 32)            2272      
                                                                 
 conv1d_227 (Conv1D)         (None, 60, 32)            7200      
                                                                 
 max_pooling1d_101 (MaxPooli  (None, 12, 32)           0         
 ng1D)                                                           
                                                                 
 conv1d_228 (Conv1D)         (None, 12, 64)            14400     
                                                                 
 conv1d_229 (Conv1D)         (None, 12, 64)            28736     
                                                     

Not a ridiculous amount of depth here, but let's see how it goes. Two other adjustments here: we're going to increase our epochs, add a plateau_check callback, and fix our optimizer to be `RMSprop` instead of `adam`.

In [78]:
model_two.fit(preprocessor(X_train, maxlen=max_length, max_words = vocab_size), y_train, validation_split=0.2, epochs=20, batch_size=32, callbacks=[plateau_check])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 7: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 10: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 13: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 16: ReduceLROnPlateau reducing learning rate to 1.0000001111620805e-07.
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 19: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-08.
Epoch 20/20


<keras.callbacks.History at 0x21de539e2b0>

Looks like we reached an upper limit in our validation accuracy of around 76.45% within our split; I also wish I had put in a feature to recall the best weights. At any rate, let's submit and see how it performed:

In [80]:
# Save model_two

onnx_model = model_to_onnx(model_two, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model_two.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())



#Submit model_two
prediction_column_index=model_two.predict(preprocessor(X_test, maxlen=max_length, max_words = vocab_size)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]


mycompetition.submit_model(model_filepath = "model_two.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)




Your model has been submitted as model version 386

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


And to see how we did:

In [81]:
condensed_leaderboard = mycompetition.get_leaderboard()

condensed_leaderboard[condensed_leaderboard['version'] == 386]

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,...,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,timestamp,version
69,0.793633,0.792933,0.797806,0.793698,keras,,True,Sequential,8.0,152738.0,...,1.0,,4.0,str,RMSprop,612544.0,,jer2240,2023-04-18 00:54:12.859599,386


This was a substantial improvement over our last model, with our accuracy scraping 80% and putting us firmly into the upper quadrant of the leaderboard.

### Model 3: Transferring Weights that Fit Like a Glove

Like with the image classification problems, we can also build models with pre-trained weights. For this challenge, we're going to bring in the "Glove" weights as provided through the example notebook.

In [82]:
### Read in pre-trained Glove weights

import requests
import zipfile
import io
import os
glove_dir = os.getcwd()


# Set the URL of the zip file you want to download
url = "http://nlp.stanford.edu/data/wordvecs/glove.6B.zip"

# Download the zip file and read its contents into memory
r = requests.get(url)
zip_contents = io.BytesIO(r.content)

# Create a ZipFile object from the in-memory contents of the zip file
zip_file = zipfile.ZipFile(zip_contents)

# Extract all the files in the zip file to a folder
zip_file.extractall(glove_dir)

# Close the ZipFile object
zip_file.close()


We then need to rebuild an embeddings index for the new weights, like we did from our initial vectorization of the competition data. 

In [11]:
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


For this model, I'm going to work with the 200 feature version of the embedding weights.

In [12]:
# Build embedding matrix
embedding_dim = 200 # change if you use txt files using larger number of features

word_index = tokenizer.word_index

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < vocab_size:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

Alright, let's get things fed into the model. I really want to keep working with the `LSTM` to get practice with optimizing it, so I am going to make some more tweaks. First, I'm going to add a checkpoint to keep the best performing weights in terms of validation accuracy. We're also going to increase the number of units and keep the whole sequences.

In [112]:
import tensorflow as tf
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Dropout, Reshape
from tensorflow.keras.callbacks import ModelCheckpoint


# Define the model architecture
model_three_transfer = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length,
              weights=[embedding_matrix], trainable=False),
    LSTM(128, recurrent_dropout=0.2),
    Flatten(),
    Dense(2, activation='softmax')
])

# Compile the model
model_three_transfer.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

# add checkpoints
checkpoint = ModelCheckpoint('best_model_three.h5', monitor='val_accuracy', mode='max', save_best_only=True, verbose=1)

model_three_transfer.summary()

Model: "sequential_58"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_63 (Embedding)    (None, 60, 100)           1000000   
                                                                 
 lstm_59 (LSTM)              (None, 128)               117248    
                                                                 
 flatten_34 (Flatten)        (None, 128)               0         
                                                                 
 dense_58 (Dense)            (None, 2)                 258       
                                                                 
Total params: 1,117,506
Trainable params: 117,506
Non-trainable params: 1,000,000
_________________________________________________________________


In [113]:
model_three_transfer.fit(preprocessor(X_train, maxlen=max_length, max_words = vocab_size), y_train, validation_split=0.2, epochs=20, batch_size=32, callbacks = [checkpoint])

Epoch 1/20
Epoch 1: val_accuracy improved from -inf to 0.84393, saving model to best_model_three.h5
Epoch 2/20
Epoch 2: val_accuracy did not improve from 0.84393
Epoch 3/20
Epoch 3: val_accuracy did not improve from 0.84393
Epoch 4/20
Epoch 4: val_accuracy did not improve from 0.84393
Epoch 5/20
Epoch 5: val_accuracy did not improve from 0.84393
Epoch 6/20
Epoch 6: val_accuracy did not improve from 0.84393
Epoch 7/20
Epoch 7: val_accuracy did not improve from 0.84393
Epoch 8/20
Epoch 8: val_accuracy did not improve from 0.84393
Epoch 9/20
Epoch 9: val_accuracy did not improve from 0.84393
Epoch 10/20
Epoch 10: val_accuracy did not improve from 0.84393
Epoch 11/20
Epoch 11: val_accuracy did not improve from 0.84393
Epoch 12/20
Epoch 12: val_accuracy did not improve from 0.84393
Epoch 13/20
Epoch 13: val_accuracy did not improve from 0.84393
Epoch 14/20
Epoch 14: val_accuracy did not improve from 0.84393
Epoch 15/20
Epoch 15: val_accuracy did not improve from 0.84393
Epoch 16/20
Epoch 16

<keras.callbacks.History at 0x21e1dfd4610>

In [117]:
from tensorflow.keras.models import load_model

# Save model_three

model_three_best = load_model("best_model_three.h5")

onnx_model = model_to_onnx(model_three_best, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model_three.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())



#Submit model_two
prediction_column_index=model_three_transfer.predict(preprocessor(X_test, maxlen=max_length, max_words = vocab_size)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]


mycompetition.submit_model(model_filepath = "model_three.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)




Your model has been submitted as model version 392

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [120]:
# Get raw data in pandas data frame
condensed_leaderboard = mycompetition.get_leaderboard()

condensed_leaderboard[condensed_leaderboard['version'] == 392]

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,...,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,timestamp,version
54,0.798024,0.797847,0.799148,0.798058,keras,,True,Sequential,4.0,1117506.0,...,1.0,1.0,,function,RMSprop,4470896.0,,jer2240,2023-04-18 01:50:35.620872,392


A little more improvement! After some discussion, we will hopefully have a better idea on what to tweak next.

### Model 4: Mixing and Matching Architectures

After talking with my teammates, we all found a few things to be true across our projects. One, models using `Conv1D` layers appeared to be performing the best, followed closely behind by my `LSTM` model with Glove weights. Two, paying attention to callbacks is very important to maximize your model's performance. And lastly, `RMSprop` is deinifitely the way to go for an optimizer. 

Interestingly, we had mixed success using the pre-trained weights. One teammate found them to be very effective, another said they re-ran their Glove model a few times and never got great results, albeit with the same number of available weights. Something worth exploring here is whether fewer weights does serve to improve accuracy. We also generally weren't sure of whether the number of embedding outputs and LSTM/RNN units was something to increase or decrease, since we used the same values for the most part. We all agreed that adding more units would certainly help depth and immediate accuracy, but could just be opening up the opportunity for overfitting. 

All of us moved forward with the same few parameters in mind. For one, we were all going to keep adding `Dropout` layers or parameters, since they seemed to be key to achieving consistent generalizability. Additionally, concluding a `Conv1D`-based model with a `GlobalMaxPooling` layer performed very well. I also wanted to experiment with `EarlyStopping`, as this proved very effective for one teammate in making sure they didn't incur too much validation loss.

Model four's major architectural change is going to be greater depth from stacking together more `Conv1D` layers. I'm also going to work in the early stopping, and go back to our original competition data to embeddings weights. 

In [126]:
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.callbacks import EarlyStopping

max_length = 60
embedding_dim = 7

# Define the model architecture
model_four = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'),
    Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'),
    MaxPooling1D(3),
    Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'),
    Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'),
    MaxPooling1D(3),
    Conv1D(filters=64, kernel_size=5, padding='same', activation='relu'),
    Conv1D(filters=64, kernel_size=5, padding='same', activation='relu'),
    MaxPooling1D(3),
    Conv1D(filters=64, kernel_size=5, padding='same', activation='relu'),
    Conv1D(filters=64, kernel_size=5, padding='same', activation='relu'),
    GlobalMaxPooling1D(),
    Dropout(0.33),
    Dense(2, activation='softmax')
])

# Compile the model
model_four.compile(optimizer='RMSprop', loss='binary_crossentropy', metrics=['accuracy'])

# add checkpoints
early_stopping = EarlyStopping(monitor='val_loss', patience=4)

model_four.summary()

Model: "sequential_62"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_67 (Embedding)    (None, 60, 7)             70000     
                                                                 
 conv1d_254 (Conv1D)         (None, 60, 32)            1152      
                                                                 
 conv1d_255 (Conv1D)         (None, 60, 32)            5152      
                                                                 
 max_pooling1d_118 (MaxPooli  (None, 20, 32)           0         
 ng1D)                                                           
                                                                 
 conv1d_256 (Conv1D)         (None, 20, 32)            5152      
                                                                 
 conv1d_257 (Conv1D)         (None, 20, 32)            5152      
                                                     

In [127]:
model_four.fit(preprocessor(X_train, maxlen=max_length, max_words = vocab_size), y_train, validation_split=0.2, epochs=30, batch_size=32, callbacks = [early_stopping])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30


<keras.callbacks.History at 0x21e18e28190>

Looks pretty good, let's see how it fares with the real stuff:

In [129]:
# Save model_four


onnx_model = model_to_onnx(model_four, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model_four.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())



#Submit model_four
prediction_column_index=model_four.predict(preprocessor(X_test, maxlen=max_length, max_words = vocab_size)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]


mycompetition.submit_model(model_filepath = "model_four.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)



Your model has been submitted as model version 393

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [133]:
# Get raw data in pandas data frame
condensed_leaderboard = mycompetition.get_leaderboard()

condensed_leaderboard[condensed_leaderboard['version'] == 393]

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,...,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,timestamp,version
11,0.81449,0.814457,0.814676,0.814476,keras,,True,Sequential,15.0,158674.0,...,1.0,,8.0,str,RMSprop,637376.0,,jer2240,2023-04-18 02:08:08.549804,393


### Model 5: Bringing Back the Weights, Adding More Depth

We've managed to home in on a great `Conv1D` model, so what if we tried tweaking our Glove + LSTM approach to maximize performance within that architecture. We'll bring down our maximum length and go down a stage of Glove weights to 100. We'll have to rebuild the embeddings index:

In [130]:
max_length = 50
# Build new embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

word_index = tokenizer.word_index

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < vocab_size:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

And then we're going to try our redesigned `LSTM` model (reducing the units), with our adjusted Glove parameter and a slightly higher dropout rate. We're also again going to including early stopping again, with a slightly higher patience.

In [143]:
from tensorflow.keras.callbacks import EarlyStopping

# Define the model architecture
model_five_transfer = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length, weights=[embedding_matrix], trainable=False),
    LSTM(32, recurrent_dropout=.5),
    Flatten(),
    Dense(2, activation='softmax')
])

# Compile the model
model_five_transfer.compile(optimizer='RMSprop', loss='binary_crossentropy', metrics=['accuracy'])

# add checkpoints
early_stoppping = EarlyStopping(patience=6, monitor='val_loss', verbose=1)

model_five_transfer.summary()

Model: "sequential_66"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_71 (Embedding)    (None, 50, 100)           1000000   
                                                                 
 lstm_63 (LSTM)              (None, 32)                17024     
                                                                 
 flatten_38 (Flatten)        (None, 32)                0         
                                                                 
 dense_66 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,017,090
Trainable params: 17,090
Non-trainable params: 1,000,000
_________________________________________________________________


In [144]:
model_five_transfer.fit(preprocessor(X_train, maxlen=max_length, max_words = vocab_size), y_train, validation_split=0.2, epochs=10, batch_size=32, callbacks = [early_stopping])

Epoch 1/10
Epoch 1: val_accuracy did not improve from 0.80347
Epoch 2/10
Epoch 2: val_accuracy did not improve from 0.80347
Epoch 3/10
Epoch 3: val_accuracy did not improve from 0.80347
Epoch 4/10
Epoch 4: val_accuracy did not improve from 0.80347
Epoch 5/10
Epoch 5: val_accuracy did not improve from 0.80347
Epoch 6/10
Epoch 6: val_accuracy did not improve from 0.80347
Epoch 7/10
Epoch 7: val_accuracy did not improve from 0.80347
Epoch 8/10
Epoch 8: val_accuracy did not improve from 0.80347
Epoch 9/10
Epoch 9: val_accuracy did not improve from 0.80347
Epoch 10/10
Epoch 10: val_accuracy did not improve from 0.80347


<keras.callbacks.History at 0x21e9606ba60>

In [145]:
# Save model_five

onnx_model = model_to_onnx(model_five_transfer, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model_five.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())


#Submit model_five
prediction_column_index=model_five_transfer.predict(preprocessor(X_test, maxlen=max_length, max_words = vocab_size)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]


mycompetition.submit_model(model_filepath = "model_five.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)




Your model has been submitted as model version 394

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [148]:
# Get leaderboard

condensed_leaderboard = mycompetition.get_leaderboard()

condensed_leaderboard[condensed_leaderboard['version'] == 394]

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,...,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,timestamp,version
97,0.780461,0.77665,0.801346,0.780605,keras,,True,Sequential,4.0,1017090.0,...,1.0,1.0,,str,RMSprop,4069232.0,,jer2240,2023-04-18 02:23:42.259191,394


### Final Model: Packing It All In

In [161]:


model_final = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length, weights=[embedding_matrix], trainable=False),
    
    LSTM(64, dropout=0.2, recurrent_dropout=0.2, return_sequences=True),
    LSTM(32, dropout=0.2, recurrent_dropout=0.2, return_sequences=True),
    LSTM(16),
    Flatten(),
    Dense(2, activation='softmax')
])

# Compile the model
model_final.compile(optimizer='RMSprop', loss='binary_crossentropy', metrics=['accuracy'])

# Add plateau-checker
reduce_lr = ReduceLROnPlateau(monitor='val_accuracy', factor=0.2, patience=5, min_lr=0.0001, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1, restore_best_weights=True)



model_final.summary()

Model: "sequential_71"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_76 (Embedding)    (None, 50, 100)           1000000   
                                                                 
 lstm_75 (LSTM)              (None, 50, 64)            42240     
                                                                 
 lstm_76 (LSTM)              (None, 50, 32)            12416     
                                                                 
 lstm_77 (LSTM)              (None, 16)                3136      
                                                                 
 flatten_43 (Flatten)        (None, 16)                0         
                                                                 
 dense_71 (Dense)            (None, 2)                 34        
                                                                 
Total params: 1,057,826
Trainable params: 57,826
Non-

In [162]:
model_final.fit(preprocessor(X_train, maxlen=max_length, max_words = vocab_size), y_train, validation_split=0.2, epochs=30, batch_size=32, callbacks = [early_stopping, reduce_lr])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30

Epoch 14: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 14: early stopping


<keras.callbacks.History at 0x21e97ab0040>

In [163]:
# Save final model

onnx_model = model_to_onnx(model_final, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model_final.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())


#Submit model_two
prediction_column_index=model_final.predict(preprocessor(X_test, maxlen=max_length, max_words = vocab_size)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]


mycompetition.submit_model(model_filepath = "model_final.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)




Your model has been submitted as model version 396

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [164]:
# Get leaderboard

condensed_leaderboard = mycompetition.get_leaderboard()

condensed_leaderboard[condensed_leaderboard['version'] == 396]

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,...,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,timestamp,version
107,0.78595,0.78594,0.785985,0.785943,keras,,True,Sequential,6.0,1057826.0,...,1.0,3.0,,str,RMSprop,4232976.0,,jer2240,2023-04-18 03:05:23.562487,396


### Conclusions


There is a lot to be learned from this little experiments when it comes to using text data with a classifier. 

First, although it's definitely a valuable approach worth exploring, the complexity of an `LSTM` will not necessarily be the best approach for all language problems. The sequential aspect of the text data may not have been that important for classifying the overall sentiment of each review/headline. Everyone on my team had fairly effective Conv1D models and did not find an LSTM structure that was able to improve on it, though maybe more time to exlore hyperparameters would have revealed something in the smaller details.

Additonally, all of these models benefitted greatly from callbacks and the "in-between" layers/parameters. For example, adding `Dropoff` both in the `LSTM` and the `Conv1D` models were essential for good generalizing to the competition Test data. Similarly, structuring the `Conv1D` models around a `GlobalMaxPooling` instead of a `Flatten()` proved to bring out greater performance. Though this was a valuable lesson from the previous project as well, these test models really demonstrated how important it is to not just get caught up in feeding the depth of your model, but also *extending* this depth to be as flexible as possible. 

Generally, this task was far more demanding of a design that was well thought-through. Working with Dropoff, keeping sequences, and learning rate reducers didn't just automatically make the model better; in the case of learning rate, for example, it may have conflicted with the function of `EarlyStopping` and prevented that from being as useful as possible. EarlyStopping was also a valuable design lesson, as it required a bit of tweaking to reach the balance between overfit prevention and sacrificed training time. Given that there are differing approaches you could take with representing this data, being mindful of these kinds of trade-offs will help you construct a model architecture that isn't subtley working against itself. With my final model, for example, I actually re-ran it's initial training period a few times to see how different EarlyStopping timings affected its performance; I observed that there definitely was a distinct possibility that too early of a cutoff will prevent your model from learning everything it needs to know about your data. 

I have a background in linguistics, and spent a lot of time learning much more qualitative attempts at modeling language. Implementing these models was fascinating because it made me consider both how much information you could extract from words alone, but also what you're still losing by being constrained to them. I can definitely see the advantages of an `LSTM` when it comes to trying to capture consistent multi-word patterns in how we express appreciation or contempt. But I wonder if capturing these patterns may sometimes create noise, especially when you have fairly condensed text extracts like the ones we were training with. 