# Homework 3

## Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.

The SST-2 (Stanford Sentiment Treebank 2) competition dataset is a popular benchmark dataset for sentiment analysis, containing a large number of movie reviews labeled with their corresponding sentiment (positive or negative). The dataset consists of approximately 62,000 movie reviews, where each review is represented as a parse tree with each node in the tree labeled with a sentiment label. The reviews are split into training and testing sets, with 5,000 reviews in the testing set.

Building a predictive model using the SST-2 dataset can be practically useful in a number of ways. One of the most straightforward applications is to use such a model to automatically classify the sentiment of new movie reviews, allowing businesses to monitor customer feedback and sentiment in near-real-time. 

In addition to businesses, a sentiment analysis model based on the SST-2 dataset can also benefit consumers by providing them with more personalized and relevant recommendations. For example, a movie streaming platform can use sentiment analysis to suggest movies to users based on their previous viewing history and the sentiment of the movies they have enjoyed in the past.

## Run at least three prediction models to try to predict the SST sentiment dataset well.

In [16]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [22]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [19]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

##2.   Preprocess data using keras tokenizer / Write and Save Preprocessor function


In [76]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


In [21]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


Model 1: Use an Embedding layer and Conv1d layers in at least one model



**Model version 292**

Accuracy: 80.68%	
f1-score: 80.56%

I used a vocabulary of 10000 and embedding size 16 with one Conv1D layer of 32 units. This model performed very well.

In [32]:
from tensorflow.keras.layers import Conv1D, Dense, Embedding, GlobalMaxPooling1D
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(Conv1D(32, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 40, 16)            160000    
                                                                 
 conv1d (Conv1D)             (None, 36, 32)            2592      
                                                                 
 global_max_pooling1d (Globa  (None, 32)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_2 (Dense)             (None, 2)                 66        
                                                                 
Total params: 162,658
Trainable params: 162,658
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [7]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [85]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [86]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

In [10]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 292

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Model 2: Use an Embedding layer and LSTM layers in at least one model




**Model version: 293**

Accuracy: 59.17%	
f1-score: 52.78%

I used an embedding size of 16 and vocabulary size of 10000. I used 64 LSTM units in the first layer and 32 in the next. Perhaps the model is too complex and may benefit from less LSTM units to produce better results. I will try this in my next set of models.

In [12]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model2 = Sequential()
model2.add(Embedding(10000, 16, input_length=40))
model2.add(LSTM(64, return_sequences=True, dropout=0.2))
model2.add(LSTM(32, dropout=0.2))
model2.add(Flatten())
model2.add(Dense(2, activation='softmax'))

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.2)



In [13]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [14]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 293

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Model 3: Use transfer learning with glove embeddings for at least one of these models

**Model version: 306**

Accuracy: 68.61%	
f1-score: 68.26%

In [1]:
# What if we wanted to use a matrix of pretrained embeddings?  Same as transfer learning before, but now we are importing a pretrained Embedding matrix:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip


--2023-04-17 20:29:41--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-17 20:29:41--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-17 20:29:41--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [2]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [5]:
# Extract embedding data for 100 feature embedding matrix
import os
import numpy as np
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [78]:
# Build embedding matrix

embedding_dim = 100 # change if you use txt files using larger number of features
max_words = 10000 
word_index = tokenizer.word_index


embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [82]:
# Set up same model architecture as before and then import Glove weights to Embedding layer:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten



model3 = tf.keras.Sequential()
model3.add(tf.keras.layers.Embedding(10000, embedding_dim, input_length=40))
model3.add(tf.keras.layers.Flatten())
model3.add(tf.keras.layers.Dense(32, activation='relu'))
model3.add(tf.keras.layers.Dense(2, activation='sigmoid'))
model3.summary()

Model: "sequential_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_20 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_14 (Flatten)        (None, 4000)              0         
                                                                 
 dense_34 (Dense)            (None, 32)                128032    
                                                                 
 dense_35 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,128,098
Trainable params: 1,128,098
Non-trainable params: 0
_________________________________________________________________


In [83]:
# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False



model3.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])
history = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)
model3.save_weights('pre_trained_glove_model.h5')


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [88]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

    #Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 3 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 306

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Looking at all three models, the Conv1d model performed best by far. For the next part of this report, I will use advice from my teammates to improve my model.

Namely I will, 

1.  Increase number of epochs for the Conv1d model
2.  Decrease my first LSTM layer to 32 units
3.  Add another dense layer to the transfer learning model



Model 4: New Conv1d model

**Model version 309**

Accuracy: 79.80%	
f1-score: 79.69%

In [91]:
from tensorflow.keras.layers import Conv1D, Dense, Embedding, GlobalMaxPooling1D
from tensorflow.keras.models import Sequential

model4 = Sequential()
model4.add(Embedding(10000, 16, input_length=40))
model4.add(Conv1D(32, 5, activation='relu'))
model4.add(GlobalMaxPooling1D())
model4.add(Dense(2, activation='softmax'))
model4.summary()

model4.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model4.fit(preprocessor(X_train), y_train,
                    epochs=15,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_24"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_22 (Embedding)    (None, 40, 16)            160000    
                                                                 
 conv1d_6 (Conv1D)           (None, 36, 32)            2592      
                                                                 
 global_max_pooling1d_6 (Glo  (None, 32)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_37 (Dense)            (None, 2)                 66        
                                                                 
Total params: 162,658
Trainable params: 162,658
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Ep

In [94]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model4, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

    #Submit Model 4: 

#-- Generate predicted y values (Model 4)
prediction_column_index=model4.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 4 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 309

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Model 5: New LSTM model

**Model version 310**

Accuracy: 79.39%	
f1-score: 80.78%

In [93]:
model5 = Sequential()
model5.add(Embedding(10000, 16, input_length=40))
model5.add(LSTM(32, return_sequences=True, dropout=0.2))
model5.add(LSTM(32, dropout=0.2))
model5.add(Flatten())
model5.add(Dense(2, activation='softmax'))

model5.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model5.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [95]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model5, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model5.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

    #Submit Model 5: 

#-- Generate predicted y values (Model 5)
prediction_column_index=model5.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 5 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model5.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 310

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Model 6: New transfer learning model

**Model version: 311**

Accuracy: 69.15%	
f1-score: 69.15%

In [96]:
model6 = tf.keras.Sequential()
model6.add(tf.keras.layers.Embedding(10000, embedding_dim, input_length=40))
model6.add(tf.keras.layers.Flatten())
model6.add(tf.keras.layers.Dense(32, activation='relu'))
model6.add(tf.keras.layers.Dense(32, activation='relu'))
model6.add(tf.keras.layers.Dense(2, activation='sigmoid'))
model6.summary()

Model: "sequential_26"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_24 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_16 (Flatten)        (None, 4000)              0         
                                                                 
 dense_39 (Dense)            (None, 32)                128032    
                                                                 
 dense_40 (Dense)            (None, 32)                1056      
                                                                 
 dense_41 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,129,154
Trainable params: 1,129,154
Non-trainable params: 0
_________________________________________________________________


In [97]:
model6.layers[0].set_weights([embedding_matrix])
model6.layers[0].trainable = False



model6.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])
history = model6.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)
model6.save_weights('pre_trained_glove_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [98]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model6, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model6.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

    #Submit Model 6: 

#-- Generate predicted y values (Model 6)
prediction_column_index=model6.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 5 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model6.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 311

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Overall, the hyperparameter changes I derived from my groupmates did help model improvement to some extent.

For the Conv1D model, increasing epochs actually slightly decreased accuracy and f1-score. This could be due to overfitting. Eitherways, the Conv1D model on a whole was already my best performing one.

For the other two models, the modifications improved the model performance. However, these still did not perform as well as the Conv1D model.

Hence, I would conclude that some important hyperparameters would be using a Conv1D(32, 5, activation='relu') layer and trained using 10 epochs and batch size of 32.