Ivy Wong

xw2860

Repo: https://github.com/ivster/adv_ml_proj

Your final report should be written up in a Jupyter notebook.  It should be posted to a public Github repo as an ipynb  submitted to this assignment via courseworks.  Please include the link to your Github repo in this ipynb file.

Use the deep learning and sklearn example ipynb notebooks from the Week 11 folder for example submission code.

In [4]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [2]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

In [3]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


In [30]:
y_train

Unnamed: 0,Negative,Positive
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
6915,1,0
6916,1,0
6917,0,1
6918,1,0


In [4]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


In [5]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [6]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

## Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.


#### The dataset used in this assignment is the Stanford Sentiment Treebank (SST), a dataset that contains movie reviews and is used for sentiment analysis. The reviews are broken down into two categories, either having positive or negative sentiment. Building a predictive model with this data is useful particularly for people from this specific industry, like media/ film businesses, because the model allows for them to analyze reviews and comments and gain an understanding of customer sentiment. In addition, understanding the type of sentiment customers are using also helps businesses in their product development stages. By analyzing the type of sentiment displayed by their cusomters, they can allocate resources necessarily. For example, using movies with positive sentiment as a baseline for the types of movies being produced.

## Run at least three prediction models to try to predict the SST sentiment dataset well.
* Use an Embedding layer and LSTM layers in at least one model

* Use an Embedding layer and Conv1d layers in at least one model

* Use transfer learning with glove embeddings for at least one of these models

* Submit your best three models to the leader board for the SST Model Share competition.


### Embedding + LSTM Layer

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten, GlobalMaxPooling2D

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(LSTM(64, return_sequences=True, dropout=0.2))
model.add(LSTM(32, return_sequences=True, dropout=0.2))
model.add(LSTM(32, dropout=0.2))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))


model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(preprocessor(X_train), y_train,
                    epochs=3,
                    batch_size=64,
                    validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)
with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 186

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Embedding + Conv1D Layer

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

model2 = Sequential()
model2.add(Embedding(input_dim=10000, output_dim=128, input_length=40))
model2.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model2.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model2.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model2.add(Conv1D(filters=16, kernel_size=3, activation='relu'))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(units=16, activation='softmax'))
model2.add(Dense(units=2, activation='softmax'))


model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=3,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model2 = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model2.SerializeToString())

In [None]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index2=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels2 = [y_train.columns[i] for i in prediction_column_index2]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels2)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 187

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Transfer Learning + Glove Embedding on Model 2

In [7]:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-18 00:51:39--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-18 00:51:39--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-18 00:51:39--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [8]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
replace glove.6B.100d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: glove.6B.100d.txt       
replace glove.6B.200d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace glove.6B.300d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [9]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [10]:
# Determine number of unique tokens, and then use number of tokens in the embedding matrix
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 13835 unique tokens.


In [11]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features
max_words = 13835
maxlen = 40

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [12]:
embedding_matrix.shape

(13835, 100)

In [13]:
len(tokenizer.word_index)

13835

In [26]:
# Update Model 2 to change input_dim and weights according to tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Flatten, Dense
import numpy as np

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=40, weights = [np.array([embeddings_index[word] if word in embeddings_index else np.zeros(100) for word in tokenizer.word_index.keys()])]))
model3.add(LSTM(64, return_sequences=True, dropout=0.2))
model3.add(LSTM(32, dropout=0.2))
model3.add(Flatten())
model3.add(Dense(2, activation='softmax'))


In [27]:
# Add weights as transfer learning 

model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False

# Compile the model
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model3.fit(preprocessor(X_train), y_train, validation_split = 0.2, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb4186abc10>

In [19]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model3 = model_to_onnx(model3, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model3.SerializeToString())

In [20]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index3=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels3 = [y_train.columns[i] for i in prediction_column_index3]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels3)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 342

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


## Discuss which models performed better and point out relevant hyper-parameter values for successful models.


#### Based on the 3 models ran above, the best performing model of the 3 is Model 3, where glove embedding and transfer learning layers were added onto Model 1. In this model, there was 1 Embedding model (where the weights were updated with GloVe embeddings), 2 LSTM models (each with 64 and 32 neurons respectively), and outputs to 2 categories: positive and negative sentiment

## After you submit your first three models, describe your best model with your team via your team slack channel
* Fit and submit up to three more models after learning from your team.

* Discuss results

### Transfer Learning + Glove Embedding (more layers)

In [25]:
# Update Model 3 with more layers and neurons
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Flatten, Dense
import numpy as np

model4 = Sequential()
model4.add(Embedding(max_words, embedding_dim, input_length=40, weights = [np.array([embeddings_index[word] if word in embeddings_index else np.zeros(100) for word in tokenizer.word_index.keys()])]))
model4.add(LSTM(128, return_sequences=True, dropout=0.5))
model4.add(LSTM(64, dropout=0.5))
model4.add(Flatten())
model4.add(Dense(2, activation='softmax'))


model4.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model4.fit(preprocessor(X_train), y_train,
                    epochs=5,
                    batch_size=64,
                    validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [28]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model4 = model_to_onnx(model4, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model4.SerializeToString())

In [29]:
#Submit Model 4: 

#-- Generate predicted y values (Model 4)
prediction_column_index4=model4.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels4 = [y_train.columns[i] for i in prediction_column_index4]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels4)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 351

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Embedding + LSTM (with more layers)

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten, GlobalMaxPooling2D

model5 = tf.keras.Sequential()
model5.add(keras.layers.Embedding(10000, 16, input_length=40))
model5.add(LSTM(256, return_sequences=True, dropout=0.2)) # Add layer with more neurons
model5.add(LSTM(128, return_sequences=True, dropout=0.2)) 
model5.add(LSTM(128, return_sequences=True, dropout=0.2))
model5.add(LSTM(64, return_sequences=True, dropout=0.2))
model5.add(LSTM(64, return_sequences=True, dropout=0.2))
model5.add(LSTM(32, dropout=0.2))
model5.add(Flatten())
model5.add(Dense(2, activation='softmax'))

In [None]:
model5.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model5.fit(preprocessor(X_train), y_train, epochs=3, batch_size=64, validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model5 = model_to_onnx(model5, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model5.onnx", "wb") as f:
    f.write(onnx_model5.SerializeToString())

In [None]:
#Submit Model 5: 

#-- Generate predicted y values (Model 5)
prediction_column_index5=model5.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels5 = [y_train.columns[i] for i in prediction_column_index5]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model5.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels5)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 338

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Embedding + Conv1D (with more layers)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Dense

model6 = Sequential()
model6.add(Embedding(input_dim=10000, output_dim=128, input_length=40))
model6.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model6.add(MaxPooling1D(pool_size=2))
model6.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model6.add(MaxPooling1D(pool_size=2))
model6.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model6.add(MaxPooling1D(pool_size=2))
model6.add(Conv1D(filters=16, kernel_size=3, activation='relu'))
model6.add(GlobalMaxPooling1D())
model6.add(Dense(units=16, activation='softmax'))
model6.add(Dense(units=2, activation='softmax'))


In [None]:
model6.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model6.fit(preprocessor(X_train), y_train,
                    epochs=3,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model6 = model_to_onnx(model6, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model6.onnx", "wb") as f:
    f.write(onnx_model6.SerializeToString())

In [None]:
#Submit Model 6: 

#-- Generate predicted y values (Model 6)
prediction_column_index6=model6.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels6 = [y_train.columns[i] for i in prediction_column_index6]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model6.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels6)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 340

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Transfer Learning + Glove Embedding on Embedding and Dense layers (with higher neurons)

In [63]:
# Create a new preprocessor by taking one-hot encoding of the labels after tokenizing and padding the text data. 

def preprocessor2(data, labels, maxlen=40, max_words=10000):
    sequences = tokenizer.texts_to_sequences(data)
    X = pad_sequences(sequences, maxlen=maxlen)
    y = pd.get_dummies(labels)
    return X, y

X_train_processed, y_train_processed = preprocessor2(X_train, y_train_labels)


In [85]:
model7 = tf.keras.Sequential()
model7.add(keras.layers.Embedding(input_dim=len(tokenizer.word_index),
                           output_dim=100,
                           input_length=maxlen,
                           trainable=False,
                           weights=[np.array([embeddings_index[word] if word in embeddings_index else np.zeros(100) for word in tokenizer.word_index.keys()])]))
model7.add(tf.keras.layers.Flatten())
model7.add(tf.keras.layers.Dense(6412, activation='relu'))  # Added additional Dense layer with 64 neurons and ReLU activation
model7.add(tf.keras.layers.Dropout(0.5))  # Added Dropout layer with dropout rate of 0.5
model7.add(tf.keras.layers.Dense(3214, activation='relu'))  # Added additional Dense layer with 32 neurons and ReLU activation
model7.add(tf.keras.layers.Dense(2, activation='softmax'))
model7.summary()

model7.layers[0].set_weights([embedding_matrix])
model7.layers[0].trainable = False

# Compile the model
model7.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model7.fit(X_train_processed, y_train_processed, validation_split=0.2, epochs=10, batch_size=32)


Model: "sequential_20"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_19 (Embedding)    (None, 40, 100)           1383500   
                                                                 
 flatten_12 (Flatten)        (None, 4000)              0         
                                                                 
 dense_30 (Dense)            (None, 6412)              25654412  
                                                                 
 dropout_1 (Dropout)         (None, 6412)              0         
                                                                 
 dense_31 (Dense)            (None, 3214)              20611382  
                                                                 
 dense_32 (Dense)            (None, 2)                 6430      
                                                                 
Total params: 47,655,724
Trainable params: 46,272,224

<keras.callbacks.History at 0x7f8d82a39640>

In [24]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model7 = model_to_onnx(model7, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model7.onnx", "wb") as f:
    f.write(onnx_model7.SerializeToString())

In [87]:
#Submit Model 7: 

#-- Generate predicted y values (Model 7)
prediction_column_index7=model7.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels7 = [y_train.columns[i] for i in prediction_column_index7]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model7.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels7)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 323

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


## Discuss which models you tried and which models performed better and point out relevant hyper-parameter values for successful models.


#### After submitting the new set of models again, this time the best performing model is the Model 6, the model with embedding and Conv1D layers. In this model, more layers and max pooling were added to improve the performance of the model. This model has 1 Embedding layer, 4 Conv1D layers, 3 Max Pooling layers, and 1 Global Max Pooling layer. This model outputs to 2 categories: positive or negative sentiment.