# Task: Classify a movie review based on sentiment

04/17/2023

[Link to GitHub repo](https://github.com/lprockop/TextClassification-StanfordSST)

## Import modules, import data, instantiate competition, create preprocessor

### Import modules and data

In [None]:
#install aimodelshare library
! pip install aimodelshare==0.0.189
#RESTART RUNTIME

In [None]:
#import data

# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 

# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()


Data downloaded successfully.


0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

In [None]:
#Set credentials using modelshare.org username/password
from aimodelshare.aws import set_credentials
import aimodelshare as ai

apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m"
set_credentials(apiurl=apiurl)

#Instantiate Competition
mycompetition= ai.Competition(apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


### Preprocess

In [None]:
#preprocessing

# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


In [None]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


## Question 1: Discuss dataset in general terms and the purpose of building this model. Who could benefit?

The feature in this dataset is text (movie reviews) and the target is a negative or positive sentiment classification. Predicting sentiment has a variety of applications, including generating overall impression scores (for example, a movie's composite rating could be gleaned from the percentage of reviews that are positive or negative) or identifying potential customer support issues for a company by identifying negative reviews.

In [None]:
X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

In [None]:
y_train.head()

Unnamed: 0,Negative,Positive
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


## Question 2: build 3 prediction models to predict SST sentiment well & Question 4: Submit to competiton

### 2.1: Embedding layer and LSTM layers

In [None]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model2 = Sequential()
model2.add(Embedding(10000, 16, input_length=40))
model2.add(LSTM(32, return_sequences=True, dropout=0.2))
model2.add(LSTM(32, dropout=0.2))
model2.add(Flatten())
model2.add(Dense(2, activation='softmax'))

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model2.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
#submit model

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata = {'team': 7})

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 252

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### 2.2: Embedding layer and Conv1d layers

In [None]:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import SimpleRNN, LSTM,Embedding

model3 = Sequential()
model3.add(Embedding(10000, 16, input_length=40))
model3.add(layers.Conv1D(16, 4, activation='relu')) 
model3.add(layers.MaxPooling1D(8))
model3.add(layers.Conv1D(16, 4, activation='relu'))
model3.add(layers.GlobalMaxPooling1D())
model3.add(layers.Dense(4))
model3.add(Flatten())
model3.add(Dense(2, activation='softmax'))
model3.add(Dense(2, activation='softmax'))
model3.add(Dense(2, activation='softmax'))
model3.summary()

model3.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 40, 16)            160000    
                                                                 
 conv1d (Conv1D)             (None, 37, 16)            1040      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 4, 16)            0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, 1, 16)             1040      
                                                                 
 global_max_pooling1d (Globa  (None, 16)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                

In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
# Submit model

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata = {'team': 7})

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 253

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### 2.3: Transfer learning with glove embeddings

In [None]:
# What if we wanted to use a matrix of pretrained embeddings?  Same as transfer learning before, but now we are importing a pretrained Embedding matrix:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-17 01:22:21--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-17 01:22:21--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-17 01:22:21--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [None]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [None]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# Tokenize the data into one hot vectors
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words in sequence
training_samples = 10000  # We will be training on 10000 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset

Found 400001 word vectors.


In [None]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

sequences = tokenizer.texts_to_sequences(X_train) # converts words in each text to each word's numeric index in tokenizer dictionary.

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(y_train['Negative'])

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Found 13835 unique tokens.
Shape of data tensor: (6920, 100)
Shape of label tensor: (6920,)


In [None]:
# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

X_train_1 = data[:training_samples] #100 words
y_train_1 = labels[:training_samples]
X_val_1 = data[training_samples: training_samples + validation_samples]
y_val_1 = labels[training_samples: training_samples + validation_samples]

In [None]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
# Set up same model architecture as before and then import Glove weights to Embedding layer:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential
import tensorflow as tf

model4 = tf.keras.Sequential()
model4.add(tf.keras.layers.Embedding(max_words, embedding_dim, input_length=maxlen))
model4.add(tf.keras.layers.Flatten())
model4.add(tf.keras.layers.Dense(32, activation='relu'))
model4.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model4.summary()

# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model4.layers[0].set_weights([embedding_matrix])
model4.layers[0].trainable = False

model4.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model4.fit(X_train_1, y_train_1,
                    epochs=10,
                    batch_size=32,
                    validation_data=(X_val_1, y_val_1))
model4.save_weights('pre_trained_glove_model.h5')

# Training data small to speed up training. Increase for better fit.

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_2 (Flatten)         (None, 10000)             0         
                                                                 
 dense_5 (Dense)             (None, 32)                320032    
                                                                 
 dense_6 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model4, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
# submit model

#using tokenizer object we fit to test data above
sequences = tokenizer.texts_to_sequences(X_test)
X_test_1 = pad_sequences(sequences, maxlen=maxlen)

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model4.predict(X_test_1).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata = {'team': 7})

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 255

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


## Question 3: Which models performed better? Relevant hyper params

My best model was a Sequential model with a depth of 5, including 1 embedding layer, 2 LSTM layers, and no conv1d layers. It did not use transfer learning. The optimizer was RMSprop. The f1 score for this model was 79.06%.
(Note: code below to get leaderboard was re-run after submitting later models; model number 55 (index 48) was the best-performing from the first 3 models I ran.)

In [None]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

my_data = data[data['username'] == 'lprockop']

# Stylize leaderboard data
mycompetition.stylize_leaderboard(my_data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,simplernn_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,concatenate_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,batchnormalization_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
46,79.36%,79.25%,80.03%,79.37%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,252
48,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
99,77.50%,77.41%,77.97%,77.50%,keras,,True,Sequential,3.0,161282.0,1.0,,,,,1.0,,,,,,,1.0,,,1.0,,,str,RMSprop,645600.0,,lprockop,44
117,76.40%,76.23%,77.18%,76.41%,keras,,True,Sequential,5.0,161294.0,1.0,,,,,1.0,,,,,,,3.0,,,3.0,,,str,RMSprop,646160.0,,lprockop,47
160,70.47%,70.02%,71.82%,70.49%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,56
166,69.92%,69.73%,70.46%,69.93%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,253
176,68.72%,68.20%,70.04%,68.73%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,53
184,65.42%,63.11%,70.65%,65.45%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,,lprockop,52
226,50.05%,33.36%,25.03%,50.00%,keras,,True,Sequential,4.0,1320065.0,1.0,,,,,1.0,,,,,,,2.0,,1.0,,,1.0,str,RMSprop,5281008.0,7.0,lprockop,54
227,50.05%,33.36%,25.03%,50.00%,keras,True,True,Sequential,4.0,1320065.0,1.0,,,,,1.0,,,,,,,2.0,,1.0,,,1.0,str,RMSprop,5281008.0,7.0,lprockop,57


In [None]:
#show differentiating model metrics to share with team

my_data_smp = my_data[['accuracy', 'f1_score', 'transfer_learning', 'depth', 'conv1d_layers', 'maxpooling1d_layers', 'lstm_layers']]

my_data_smp

Unnamed: 0,accuracy,f1_score,transfer_learning,depth,conv1d_layers,maxpooling1d_layers,lstm_layers
46,0.793633,0.792511,,5.0,,,2.0
48,0.792536,0.790582,,5.0,,,2.0
99,0.774973,0.774057,,3.0,,,
117,0.763996,0.762342,,5.0,,,
160,0.70472,0.700189,,10.0,2.0,1.0,
166,0.699232,0.697288,,10.0,2.0,1.0,
176,0.687157,0.682001,,10.0,2.0,1.0,
184,0.654226,0.631121,,5.0,,,2.0
226,0.500549,0.333577,,4.0,,,
227,0.500549,0.333577,True,4.0,,,


## Question 5: build 3 more models after team feedback


### 5.1: Embedding layer and LSTM layers, more epochs

In [None]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model5 = Sequential()
model5.add(Embedding(10000, 16, input_length=40))
model5.add(LSTM(32, return_sequences=True, dropout=0.2))
model5.add(LSTM(32, dropout=0.2))
model5.add(Flatten())
model5.add(Dense(2, activation='softmax'))

model5.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model5.fit(preprocessor(X_train), y_train,
                    epochs=100,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model5, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model5.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
#submit model

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model5.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model5.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata = {'team': 7})

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 256

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### 5.2: Embedding layer and Conv1d layers, more epochs

In [None]:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import SimpleRNN, LSTM,Embedding

model6 = Sequential()
model6.add(Embedding(10000, 16, input_length=40))
model6.add(layers.Conv1D(16, 4, activation='relu')) 
model6.add(layers.MaxPooling1D(8))
model6.add(layers.Conv1D(16, 4, activation='relu'))
model6.add(layers.GlobalMaxPooling1D())
model6.add(layers.Dense(4))
model6.add(Flatten())
model6.add(Dense(2, activation='softmax'))
model6.add(Dense(2, activation='softmax'))
model6.add(Dense(2, activation='softmax'))
model6.summary()

model6.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model6.fit(preprocessor(X_train), y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.1)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 40, 16)            160000    
                                                                 
 conv1d_2 (Conv1D)           (None, 37, 16)            1040      
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 4, 16)            0         
 1D)                                                             
                                                                 
 conv1d_3 (Conv1D)           (None, 1, 16)             1040      
                                                                 
 global_max_pooling1d_1 (Glo  (None, 16)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_8 (Dense)             (None, 4)                

In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model6, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model6.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
# Submit model

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model6.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model6.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata = {'team': 7})

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 258

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### 5.3: Transfer learning with glove embeddings, more epochs

In [None]:
# What if we wanted to use a matrix of pretrained embeddings?  Same as transfer learning before, but now we are importing a pretrained Embedding matrix:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

In [None]:
! unzip glove.6B.zip 

In [None]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# Tokenize the data into one hot vectors
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words in sequence
training_samples = 10000  # We will be training on 10000 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset

Found 400001 word vectors.


In [None]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

sequences = tokenizer.texts_to_sequences(X_train) # converts words in each text to each word's numeric index in tokenizer dictionary.

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(y_train['Negative'])

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Found 13835 unique tokens.
Shape of data tensor: (6920, 100)
Shape of label tensor: (6920,)


In [None]:
# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

X_train_1 = data[:training_samples] #100 words
y_train_1 = labels[:training_samples]
X_val_1 = data[training_samples: training_samples + validation_samples]
y_val_1 = labels[training_samples: training_samples + validation_samples]

In [None]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
# Set up same model architecture as before and then import Glove weights to Embedding layer:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential
import tensorflow as tf

model7 = tf.keras.Sequential()
model7.add(tf.keras.layers.Embedding(max_words, embedding_dim, input_length=maxlen))
model7.add(tf.keras.layers.Flatten())
model7.add(tf.keras.layers.Dense(32, activation='relu'))
model7.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model7.summary()

# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model7.layers[0].set_weights([embedding_matrix])
model7.layers[0].trainable = False

model7.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model7.fit(X_train_1, y_train_1,
                    epochs=50,
                    batch_size=16,
                    validation_data=(X_val_1, y_val_1))
model7.save_weights('pre_trained_glove_model.h5')

# Training data small to speed up training. Increase for better fit.

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_5 (Flatten)         (None, 10000)             0         
                                                                 
 dense_12 (Dense)            (None, 32)                320032    
                                                                 
 dense_13 (Dense)            (None, 1)                 33        
                                                                 
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch

In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model7, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model7.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
# submit model

#using tokenizer object we fit to test data above
sequences = tokenizer.texts_to_sequences(X_test)
X_test_1 = pad_sequences(sequences, maxlen=maxlen)

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model7.predict(X_test_1).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model7.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata = {'team': 7})

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 259

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


## Question 6: Discuss results & Question 7: Discuss which models I tried, which performed better, relevant hyper-params

My team recommended I try to improve model performance by increasing the number of epochs on the models I had already run. 

When increasing the epochs in the Sequential model (depth of 5, including 1 embedding layer and 2 LSTM layers and no conv1d layers' RMS optimizer) from 10 to 100, the f1 score improved from 79.09% to 79.25%. This architecture remained my best-performing model.

In [None]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

my_data = data[data['username'] == 'lprockop']

# Stylize leaderboard data
mycompetition.stylize_leaderboard(my_data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,simplernn_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,concatenate_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,batchnormalization_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
46,79.36%,79.25%,80.03%,79.37%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,252
47,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
89,77.61%,77.39%,78.73%,77.62%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,256
101,77.50%,77.41%,77.97%,77.50%,keras,,True,Sequential,3.0,161282.0,1.0,,,,,1.0,,,,,,,1.0,,,1.0,,,str,RMSprop,645600.0,,lprockop,44
119,76.40%,76.23%,77.18%,76.41%,keras,,True,Sequential,5.0,161294.0,1.0,,,,,1.0,,,,,,,3.0,,,3.0,,,str,RMSprop,646160.0,,lprockop,47
162,70.47%,70.02%,71.82%,70.49%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,56
168,69.92%,69.73%,70.46%,69.93%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,253
178,68.72%,68.20%,70.04%,68.73%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,53
185,65.42%,63.11%,70.65%,65.45%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,,lprockop,52
187,67.40%,67.32%,67.59%,67.40%,keras,,True,Sequential,10.0,162170.0,1.0,2.0,1.0,,,1.0,,,,,1.0,,4.0,,,3.0,,2.0,str,RMSprop,650480.0,7.0,lprockop,258
