This notebook compares the performance of a convolutional neural network (CNN) model using three different sets of text vector embeddings on a binary text classification problem. Using a dataset of Amazon reviews, I compare a CNN that trains its own embeddings against pretrained vector embeddings from GloVe and FastText. The self-trained embeddings result in the best model validation performance. I then use Keras Tuner to perform hyperparameter tuning of the best model. 



# Preparing the data



In [3]:
# Load dependencies
import tarfile
import numpy as np
import pandas as pd
from google.colab import drive
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Dropout, Embedding, Flatten
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, SpatialDropout1D

In [4]:
# Mount drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
# Load data
# Xiang Zhang's Amazon Reviews Polarity dataset
# Available for download here: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz

# Unpack tar file
tar = tarfile.open('gdrive/My Drive/Colab Notebooks/Data/amazon_review_polarity_csv.tar.gz')
tar_train = tar.extractfile('amazon_review_polarity_csv/train.csv')
tar_test = tar.extractfile('amazon_review_polarity_csv/test.csv')

# Read csv
full_train = pd.read_csv(tar_train, header=None, names=['label', 'title', 'text'])
full_eval = pd.read_csv(tar_test, header=None, names=['label', 'title', 'text'])

In [7]:
# Look at the data
full_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600000 entries, 0 to 3599999
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   label   int64 
 1   title   object
 2   text    object
dtypes: int64(1), object(2)
memory usage: 82.4+ MB


In [8]:
full_train.head()

Unnamed: 0,label,title,text
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,2,Amazing!,This soundtrack is my favorite music of all ti...
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."


In [9]:
full_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   label   400000 non-null  int64 
 1   title   399990 non-null  object
 2   text    400000 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.2+ MB


In [10]:
full_eval.head()

Unnamed: 0,label,title,text
0,2,Great CD,My lovely Pat has one of the GREAT voices of h...
1,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
2,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
3,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
4,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...


In [11]:
# Classes are evenly balanced
full_train['label'].value_counts()

2    1800000
1    1800000
Name: label, dtype: int64

In [12]:
full_eval['label'].value_counts()

2    200000
1    200000
Name: label, dtype: int64

In [13]:
# Look at two training examples
full_train['text'][0]

'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [14]:
full_train['text'][1]

"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."

In [15]:
# Look at two training labels
full_train['label'][:2]

0    2
1    2
Name: label, dtype: int64

In [16]:
# Recode labels as binary
full_train['label'] = full_train['label'] - 1
full_eval['label'] = full_eval['label'] - 1

In [17]:
full_train['label'].value_counts()

1    1800000
0    1800000
Name: label, dtype: int64

In [18]:
full_eval['label'].value_counts()

1    200000
0    200000
Name: label, dtype: int64

In [19]:
# Full dataset is 3.6 million rows and takes prohibitively long to tokenize and model 
# Take a 10% sample - still 360,000 rows!
small_train = full_train.sample(frac=.1, random_state=42).reset_index(drop=True)
small_eval = full_eval.sample(frac=.1, random_state=42).reset_index(drop=True)

In [20]:
# Save copies of prepared training and eval data for easy recall
X_train, X_eval, y_train, y_eval = small_train['text'], small_eval['text'], small_train['label'], small_eval['label']

# Convolutional Neural Network (CNN) Sentiment Classifier

In [21]:
# Hyperparameters

# Training
epochs = 4
batch_size = 128

# Vector-space embedding
n_dim = 64
n_unique_words = 5000
max_review_length = 400
oov_token = 'OOV'
pad_type = trunc_type = 'pre'
drop_embed = 0.2

# Convolutional layer architecture
n_conv = 256 # filters
k_conv = 3 # kernel length

# Dense layer architecture
n_dense = 256
dropout = 0.2

In [22]:
# Preprocess text 

# Reload prepared raw data
X_train, X_eval, y_train, y_eval = small_train['text'], small_eval['text'], small_train['label'], small_eval['label']

# Instantiate tokenizer
tokenizer = Tokenizer(num_words=n_unique_words, oov_token=oov_token)

# Fit on training
tokenizer.fit_on_texts(X_train)

# Convert training and eval data to sequences
X_train = tokenizer.texts_to_sequences(X_train)
X_eval = tokenizer.texts_to_sequences(X_eval)

# Padding
X_train = pad_sequences(X_train, maxlen=max_review_length, 
                        padding=pad_type, truncating=trunc_type, value=0)
X_eval = pad_sequences(X_eval, maxlen=max_review_length, 
                        padding=pad_type, truncating=trunc_type, value=0)

In [20]:
# Convolutional model architecture
model = Sequential()

# Vector-space embedding
model.add(Embedding(n_unique_words, n_dim, input_length=max_review_length))
model.add(SpatialDropout1D(drop_embed))

# Convolutional layer
model.add(Conv1D(n_conv, k_conv, activation='relu'))
model.add(GlobalMaxPooling1D())

# Dense layer
model.add(Dense(n_dense, activation='relu'))
model.add(Dropout(dropout))

# Output layer
model.add(Dense(1, activation='sigmoid'))

# Compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summarize
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 64)           320000    
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 400, 64)           0         
_________________________________________________________________
conv1d (Conv1D)              (None, 398, 256)          49408     
_________________________________________________________________
global_max_pooling1d (Global (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2

In [21]:
# Train the model and evaluate

model.fit(X_train, y_train, 
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(X_eval, y_eval))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fbd883467d0>

# Result: CNN best validation accuracy: 0.9119

# CNN with Pretrained embeddings (GloVe)

In [30]:
# Hyperparameters are the same as the first CNN model with these 2 changes

n_dim = 100 # To match 100D GloVe embeddings
n_unique_words = len(tokenizer.word_index) + 1 # +1 is for padding token

## Prepare the GloVe embedding layer

In [23]:
# Download the GloVe embeddings

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2021-07-25 18:34:41--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-07-25 18:34:41--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-07-25 18:34:42--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [24]:
# Load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt')

for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs

f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [31]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((n_unique_words, n_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [32]:
# Define GloVe embedding layer
glove_embeddings = Embedding(n_unique_words, 
                             n_dim, 
                             weights=[embedding_matrix], 
                             input_length=max_review_length, 
                             trainable=False)

## Model


In [33]:
# Convolutional model architecture
model = Sequential()

# Vector-space embedding
model.add(glove_embeddings) # Pretrained GloVe embeddings
model.add(SpatialDropout1D(drop_embed))

# Convolutional layer
model.add(Conv1D(n_conv, k_conv, activation='relu'))
model.add(GlobalMaxPooling1D())

# Dense layer
model.add(Dense(n_dense, activation='relu'))
model.add(Dropout(dropout))

# Output layer
model.add(Dense(1, activation='sigmoid'))

# Compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summarize
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 400, 100)          26350500  
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 400, 100)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 398, 256)          77056     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                

In [34]:
# Train the model and evaluate
model.fit(X_train, y_train, 
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(X_eval, y_eval))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fbd70211f90>

# Result: CNN with GloVe embeddings best validation accuracy: 0.8914

# CNN with Pretrained FastText Embeddings

In [23]:
# Hyperparameters are the same as the first CNN model with these 2 changes

n_dim = 300 # To match 300D FastText embeddings
n_unique_words = len(tokenizer.word_index) + 1 # +1 is for padding token

## Prepare FastText embedding layer

In [1]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
!unzip -q wiki-news-300d-1M.vec.zip

--2021-07-25 20:35:45--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 172.67.9.4, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip.2’


2021-07-25 20:35:57 (53.3 MB/s) - ‘wiki-news-300d-1M.vec.zip.2’ saved [681808098/681808098]

replace wiki-news-300d-1M.vec? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
y




In [3]:
# Load the whole FastText embedding into memory

embeddings_index = dict()
f = open('wiki-news-300d-1M.vec')

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 999995 word vectors.


In [24]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((n_unique_words, n_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [25]:
# Define GloVe embedding layer
fasttext_embeddings = Embedding(n_unique_words, 
                             n_dim, 
                             weights=[embedding_matrix], 
                             input_length=max_review_length, 
                             trainable=False)

# Model



In [27]:
# Convolutional model architecture
model = Sequential()

# Vector-space embedding
model.add(fasttext_embeddings) # Pretrained FastText embeddings
model.add(SpatialDropout1D(drop_embed))

# Convolutional layer
model.add(Conv1D(n_conv, k_conv, activation='relu'))
model.add(GlobalMaxPooling1D())

# Dense layer
model.add(Dense(n_dense, activation='relu'))
model.add(Dropout(dropout))

# Output layer
model.add(Dense(1, activation='sigmoid'))

# Compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summarize
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 300)          79051200  
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 400, 300)          0         
_________________________________________________________________
conv1d (Conv1D)              (None, 398, 256)          230656    
_________________________________________________________________
global_max_pooling1d (Global (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2

In [28]:
# Train the model and evaluate

model.fit(X_train, y_train, 
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(X_eval, y_eval))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f73fb361190>

# Result: CNN with FastText best validation accuracy: 0.9111

# Tune Hyperparameters with Keras Tuner

In [1]:
# Install keras tuner and import
!pip install keras-tuner -q
from keras_tuner import RandomSearch

[?25l[K     |███▍                            | 10 kB 24.6 MB/s eta 0:00:01[K     |██████▉                         | 20 kB 31.3 MB/s eta 0:00:01[K     |██████████▏                     | 30 kB 21.8 MB/s eta 0:00:01[K     |█████████████▋                  | 40 kB 17.6 MB/s eta 0:00:01[K     |█████████████████               | 51 kB 7.7 MB/s eta 0:00:01[K     |████████████████████▍           | 61 kB 8.9 MB/s eta 0:00:01[K     |███████████████████████▊        | 71 kB 8.3 MB/s eta 0:00:01[K     |███████████████████████████▏    | 81 kB 9.2 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 92 kB 9.7 MB/s eta 0:00:01[K     |████████████████████████████████| 96 kB 4.0 MB/s 
[?25h  Building wheel for kt-legacy (setup.py) ... [?25l[?25hdone


In [2]:
# Default hyperparameters

# Training
epochs = 4
batch_size = 128

# Vector-space embedding
n_dim = 64
n_unique_words = 5000
max_review_length = 400
oov_token = 'OOV'
pad_type = trunc_type = 'pre'
drop_embed = 0.2

# Convolutional layer architecture
n_conv = 256 # filters
k_conv = 3 # kernel length

# Dense layer architecture
n_dense = 256
dropout = 0.2

In [87]:
# Create model function
def build_model(hp):
    model = Sequential()

    # Embedding layer
    model.add(Embedding(n_unique_words, n_dim, input_length=max_review_length))
    model.add(SpatialDropout1D(drop_embed))

    # Convolutional layer
    model.add(Conv1D(filters=hp.Choice('n_filters', values=[64, 128, 256]),
                     kernel_size=k_conv, activation='relu'))
    model.add(GlobalMaxPooling1D())

    # Dense layer
    model.add(Dense(units=hp.Choice("n_dense", values=[128, 256, 512]), activation='relu'))
    model.add(Dropout(rate=hp.Choice("dense_dropout_rate", values=[0.0, 0.2, 0.5])))

    # Output layer
    model.add(Dense(1, activation='sigmoid'))

    # Compile
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [89]:
# Create the tuner

tuner = RandomSearch(
    build_model,
    objective="val_accuracy",
    max_trials=3,
    executions_per_trial=2,
    overwrite=True,
)

In [84]:
tuner.search_space_summary()

Search space summary
Default search space size: 3
n_filters (Choice)
{'default': 64, 'conditions': [], 'values': [64, 128, 256], 'ordered': True}
n_dense (Choice)
{'default': 32, 'conditions': [], 'values': [32, 64, 128, 256, 512], 'ordered': True}
dense_dropout_rate (Choice)
{'default': 0.0, 'conditions': [], 'values': [0.0, 0.2, 0.5], 'ordered': True}


In [85]:
tuner.search(X_train, y_train, epochs=2, verbose=1, validation_data=(X_eval, y_eval))

Trial 3 Complete [00h 08m 21s]
val_loss: 0.22162976115942

Best val_loss So Far: 0.22162976115942
Total elapsed time: 00h 23m 48s
INFO:tensorflow:Oracle triggered exit


In [86]:
# Best set of hyperparameters
tuner.results_summary()

Results summary
Results in ./untitled_project
Showing 10 best trials
Objective(name='val_loss', direction='min')
Trial summary
Hyperparameters:
n_filters: 256
n_dense: 64
dense_dropout_rate: 0.2
Score: 0.22162976115942
Trial summary
Hyperparameters:
n_filters: 128
n_dense: 128
dense_dropout_rate: 0.5
Score: 0.2265392318367958
Trial summary
Hyperparameters:
n_filters: 64
n_dense: 64
dense_dropout_rate: 0.5
Score: 0.2386171743273735
