#15.773 Homework 2 (Spring 2025): Natural Langugage Processing

-----
**IMPORTANT: Choose a T-4 GPU with High-RAM or an A100 GPU**
-----

In [None]:
import os
os.environ['KERAS_BACKEND'] = 'tensorflow'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import keras

keras.utils.set_random_seed(42)

# Introduction

This homework assignment will ask you to build models using BoW and BERT.

We will work with a famous dataset in natural langauge processing called **20 Newsgroup**,  which consists of posts from an online forum under certain topics such as politics, religion, sports...etc. As the name suggests, there are a total of 20 topics in this dataet. The 20 Newsgroup dataset is a popular benchmark for text classification algorithms.

The entire dataset is quite large. To ensure training takes a reasonable amount of time, we will only choose 6 out of the 20 topics, including topics like religion, space (astronomy) and medicine.

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', 'sci.med', 'rec.autos'])
newsgroups_test = fetch_20newsgroups(subset='test', categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', 'sci.med', 'rec.autos'])

train_df = pd.DataFrame({'text': newsgroups_train.data, 'label': newsgroups_train.target})
test_df = pd.DataFrame({'text': newsgroups_test.data, 'label': newsgroups_test.target})

print(f"""
Train samples: {train_df.shape[0]}
Test samples: {test_df.shape[0]}
""")

train_df.head(10)


Train samples: 3222
Test samples: 2145



Unnamed: 0,text,label
0,From: boylan@pi.eai.iastate.edu (Terran Boylan...,1
1,From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...,0
2,From: snichols@adobe.com (Sherri Nichols)\nSub...,3
3,From: johnm@spudge.lonestar.org (John Munsch)\...,1
4,From: Nanci Ann Miller <nm0w+@andrew.cmu.edu>\...,0
5,From: nsmca@aurora.alaska.edu\nSubject: 30826\...,4
6,From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...,3
7,From: higgins@fnalf.fnal.gov (Bill Higgins-- B...,4
8,From: mmm@cup.portal.com (Mark Robert Thorson)...,3
9,From: daniel@lclark.edu (Daniel Snodgrass)\nSu...,1


The distribution of labels across these 6 classes is fairly balanced.



In [None]:
train_df['label'].value_counts() / train_df.shape[0]

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
2,0.184358
3,0.184358
4,0.184047
1,0.181254
0,0.148976
5,0.117008


Let's convert our dependent variable into a 1-hot-encoded vector.

In [None]:
# Let's turn the target into a dummy vector
y_train = pd.get_dummies(train_df['label'], dtype="int").to_numpy()
y_test = pd.get_dummies(test_df['label'], dtype="int").to_numpy()

y_train[:10]

array([[0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0]])

# Problem 1: Bag-of-Words (BoW) Model [50 Points]

In this problem, we will build a bag-of-words model using the text vectorization capabilities of Keras. We will then change some of the parameters of this vectorization process and see how it changes the performance.



## Part (a): Build a Base Model [15 Points]

**Text Vectorization**

Please fill in the code in the following cell. We would like to create a text vectorization layer which uses:

* Maximum of 2000 tokens.
* Unigrams
* Outputs a multi-hot encoded BoW encoding
* Converts text to lower case and strips punctuation

In [None]:
# Set the maximum number of tokens (which is the size of the vocabulary) to 1000
max_tokens = 2000

# Configure the text vectorization layer
text_vectorization = keras.layers.TextVectorization(
     max_tokens=max_tokens,
     output_mode='multi_hot',
     ngrams=1,
     standardize='lower_and_strip_punctuation',
)

# Let's adapt the Text Vectorization layer using the training corpus
text_vectorization.adapt(train_df['text'])

# We vectorize our input with the adapted Text Vectorization layer
X_train = text_vectorization(train_df['text'])
X_test = text_vectorization(test_df['text'])

pd.DataFrame(X_train, columns = text_vectorization.get_vocabulary())

Unnamed: 0,[UNK],the,of,to,a,and,in,is,i,that,...,loving,learning,jaegerbuphybuedu,hour,helps,glutamate,finding,delta,careful,bother
0,1,1,1,1,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,...,0,0,1,0,0,0,0,0,0,0
2,1,1,0,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3217,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3218,1,1,1,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3219,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3220,1,1,1,1,1,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In the following cell, please build a simple Neural Network with a single hidden layer of 128 neurons and a ReLu activation function. Be sure to specify the shape of the input correctly and use the appropriate activation function for the output. Your model should have 256902 parameters. The code for compiling has been written for you already.

In [None]:
### YOUR CODE BELOW ###
inputs = keras.layers.Input(shape=(max_tokens,))
x = keras.layers.Dense(128, activation='relu')(inputs)
outputs = keras.layers.Dense(6, activation='softmax')(x)

### YOUR CODE ABOVE ###

bow_model = keras.Model(inputs, outputs)
bow_model.summary()

bow_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy']
)

A theme that we will be investigating in this homework is the **impact of the number of training examples on the different models**. As a result, in addition to training the model on all 3000+ training examples, we will also test how our model performs with only 200 examples.

The code below shows a fitting process in which we first fit model `bow_model` to the first 200 examples from the training set `(X_train[:200], y_train[:200]`). We print the accuracy of the model on the test set. Then, we run the training procedure for another 20 epochs, this time with the full data set `(X_train, y_train`). You just need to run the cell.

In [None]:
# Fit model on the training data with 10 epochs and batch size of 32
bow_model.fit(
    x=X_train[:200], y=y_train[:200],
    epochs=20, batch_size=32,
    verbose=1,
)

print(f"*** Test accuracy with 200 examples:{bow_model.evaluate(x=X_test, y=y_test)[1]:.2%} ***")


# Fit model on the training data with 10 epochs and batch size of 32
bow_model.fit(
    x=X_train, y=y_train,
    epochs=20, batch_size=32,
    verbose=1,
)

print(f"*** Test accuracy with ALL examples:{bow_model.evaluate(x=X_test, y=y_test)[1]:.2%} ***")

Epoch 1/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 79ms/step - accuracy: 0.2368 - loss: 1.8083
Epoch 2/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9361 - loss: 1.1549 
Epoch 3/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9828 - loss: 0.7803 
Epoch 4/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 1.0000 - loss: 0.5037 
Epoch 5/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 1.0000 - loss: 0.3188 
Epoch 6/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 1.0000 - loss: 0.2032 
Epoch 7/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 1.0000 - loss: 0.1325 
Epoch 8/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 1.0000 - loss: 0.0893 
Epoch 9/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

**<font color='red'>Please Answer Below</font>**

BoW Base Model
* Test accuracy with 200 Examples: _63.17__%
* Test accuracy with All Examples: _82.80__%

What is the baseline performance by predicting the most frequent class?

16.7% (1/6) but not all equally likely so depends on what class (range of 11.7-18%)

## Part (b): Explore Hyperparameters [35 Points]

Now, let us try changing some of the hyperparameters that in the text vectorization process.

* Use bigrams instead of unigrams.
* Increase the maximum number of tokens from 2000 to 5000
* Use count vectorization instead of multi-hot.

For each of the above modifications,
* modify the code cell below (feel free to copy-paste the code from above)
* run the code cell and
* report the test accuracy.

<font color='red'>Note that you should make each change independent of the other two changes. When you are done, you should have 3 versions of the code cell below.</font>




In [None]:
# Use bigrams instead of unigrams.
max_tokens = 2000

text_vectorization = keras.layers.TextVectorization(
     max_tokens=max_tokens,
     output_mode='multi_hot',
     ngrams=2,
     standardize='lower_and_strip_punctuation',
)

text_vectorization.adapt(train_df['text'])

X_train = text_vectorization(train_df['text'])
X_test = text_vectorization(test_df['text'])

### YOUR CODE BELOW ###
inputs = keras.layers.Input(shape=(max_tokens,))
x = keras.layers.Dense(128, activation='relu')(inputs)
outputs = keras.layers.Dense(6, activation='softmax')(x)

### YOUR CODE ABOVE ###

bow_model = keras.Model(inputs, outputs)
bow_model.summary()

bow_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy']
)

bow_model.fit(
    x=X_train[:200], y=y_train[:200],
    epochs=20, batch_size=32,
    verbose=1,
)
print("\n*** Test accuracy with 200 Examples: %.4f ***\n" % bow_model.evaluate(x=X_test, y=y_test)[1])

bow_model.fit(
    x=X_train, y=y_train,
    epochs=20, batch_size=32,
    verbose=1,
)
print("\n*** Test accuracy with All Examples % .4f ***\n" % bow_model.evaluate(x=X_test, y=y_test)[1])

Epoch 1/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 80ms/step - accuracy: 0.2398 - loss: 1.7987
Epoch 2/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9473 - loss: 1.0136 
Epoch 3/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9981 - loss: 0.6463 
Epoch 4/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9931 - loss: 0.3915 
Epoch 5/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 1.0000 - loss: 0.2407 
Epoch 6/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 1.0000 - loss: 0.1516 
Epoch 7/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 1.0000 - loss: 0.0987 
Epoch 8/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 1.0000 - loss: 0.0671 
Epoch 9/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

In [None]:
# Increase the maximum number of tokens from 2000 to 5000

max_tokens = 5000

text_vectorization = keras.layers.TextVectorization(
     max_tokens=max_tokens,
     output_mode='multi_hot',
     ngrams=1,
     standardize='lower_and_strip_punctuation',
)

text_vectorization.adapt(train_df['text'])

X_train = text_vectorization(train_df['text'])
X_test = text_vectorization(test_df['text'])

### YOUR CODE BELOW ###
inputs = keras.layers.Input(shape=(max_tokens,))
x = keras.layers.Dense(128, activation='relu')(inputs)
outputs = keras.layers.Dense(6, activation='softmax')(x)

### YOUR CODE ABOVE ###

bow_model = keras.Model(inputs, outputs)
bow_model.summary()

bow_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy']
)

bow_model.fit(
    x=X_train[:200], y=y_train[:200],
    epochs=20, batch_size=32,
    verbose=1,
)
print("\n*** Test accuracy with 200 Examples: %.4f ***\n" % bow_model.evaluate(x=X_test, y=y_test)[1])

bow_model.fit(
    x=X_train, y=y_train,
    epochs=20, batch_size=32,
    verbose=1,
)
print("\n*** Test accuracy with All Examples % .4f ***\n" % bow_model.evaluate(x=X_test, y=y_test)[1])

Epoch 1/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 84ms/step - accuracy: 0.2527 - loss: 1.7625
Epoch 2/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9718 - loss: 0.9889 
Epoch 3/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9963 - loss: 0.5551 
Epoch 4/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.2937 
Epoch 5/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.1575 
Epoch 6/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.0895 
Epoch 7/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.0545 
Epoch 8/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.0358 
Epoch 9/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

In [None]:
# Use count vectorization instead of multi-hot.

max_tokens = 2000

text_vectorization = keras.layers.TextVectorization(
     max_tokens=max_tokens,
     output_mode='count',
     ngrams=1,
     standardize='lower_and_strip_punctuation',
)

text_vectorization.adapt(train_df['text'])

X_train = text_vectorization(train_df['text'])
X_test = text_vectorization(test_df['text'])

### YOUR CODE BELOW ###
inputs = keras.layers.Input(shape=(max_tokens,))
x = keras.layers.Dense(128, activation='relu')(inputs)
outputs = keras.layers.Dense(6, activation='softmax')(x)

### YOUR CODE ABOVE ###

bow_model = keras.Model(inputs, outputs)
bow_model.summary()

bow_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy']
)

bow_model.fit(
    x=X_train[:200], y=y_train[:200],
    epochs=20, batch_size=32,
    verbose=1,
)
print("\n*** Test accuracy with 200 Examples: %.4f ***\n" % bow_model.evaluate(x=X_test, y=y_test)[1])

bow_model.fit(
    x=X_train, y=y_train,
    epochs=20, batch_size=32,
    verbose=1,
)
print("\n*** Test accuracy with All Examples % .4f ***\n" % bow_model.evaluate(x=X_test, y=y_test)[1])

Epoch 1/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 82ms/step - accuracy: 0.1418 - loss: 2.5006
Epoch 2/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6095 - loss: 1.3074 
Epoch 3/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8299 - loss: 0.8777 
Epoch 4/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9531 - loss: 0.6360 
Epoch 5/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9858 - loss: 0.4607 
Epoch 6/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9963 - loss: 0.3299 
Epoch 7/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 1.0000 - loss: 0.2411 
Epoch 8/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 1.0000 - loss: 0.1763 
Epoch 9/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

**<font color='red'>Please Answer Below</font>**

Fill in the blanks with the test accuracies you obtained from re-running Problem 1(a)'s code with the relevant modifications.

For each of the 3 modifications, briefly comment on why one might expect such a change to be beneficial. Using the numbers above, comment on whether it improved, hurt or didn't affect the model's performance.

* Bigrams instead of unigrams
  * 200 Training Examples: 0.57
  * All Training Examples: 0.78
  * Comment: Two words instead of 1. Hurt model performance.
* Increasing max_tokens to 5000.
  * 200 Training Examples: 0.66
  * All Training Examples: 0.87
  * Comment: Larger run. Improved model performance.
* Count instead of multi-hot
  * 200 Training Examples: 0.6275
  * All Training Examples: 0.827
  * Comment: Count numbers vs multi-hot. Hurt model performance.

# Problem 2: BERT Transformer Model [50 Points]

In this problem, you will demonstrate how using a pre-trained model like [Bert](https://en.wikipedia.org/wiki/BERT_(language_model)) can yield high accuracy with as little as 200 examples.

## Setup

In [None]:
pip install --upgrade keras-hub

Collecting keras-hub
  Downloading keras_hub-0.19.1-py3-none-any.whl.metadata (7.7 kB)
Downloading keras_hub-0.19.1-py3-none-any.whl (704 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m704.8/704.8 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keras-hub
  Attempting uninstall: keras-hub
    Found existing installation: keras-hub 0.18.1
    Uninstalling keras-hub-0.18.1:
      Successfully uninstalled keras-hub-0.18.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-nlp 0.18.1 requires keras-hub==0.18.1, but you have keras-hub 0.19.1 which is incompatible.[0m[31m
[0mSuccessfully installed keras-hub-0.19.1


In [None]:
import keras_hub

We will download Bert from the Keras Hub model repository. This will take a minute or two the first time you do it.

In [None]:
bert = keras_hub.models.BertClassifier.from_preset(
    "bert_base_en_uncased",
    activation="softmax",
    num_classes=6
)

Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/config.json...


100%|██████████| 457/457 [00:00<00:00, 846kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/model.weights.h5...


100%|██████████| 418M/418M [00:27<00:00, 15.9MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/tokenizer.json...


100%|██████████| 761/761 [00:00<00:00, 1.35MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 226k/226k [00:00<00:00, 285kB/s]


Take a look at Bert's architecture.

In [None]:
bert.summary()

109 million parameters!!

In lecture 7 (Monday, Feb 24th), we show how to finetune this model end-to-end. Here, we show a different way to use BERT: we will take the BERT backbone, run our data through it and get the embedding for the CLS token for each data point, and then use that as the input for our own, little MLP.

Here's the backbone.

In [None]:
bert.backbone.summary()

Notice that what comes out of the backbone is a 768-long vector. This is the embedding of the CLS token.

This will be the input to our MLP.

## Run datasets through the BERT backbone

Step 1: We need to run our data through the BERT backbone and get the embedding for the CLS token for each data point.

The code cell below does this for the training dataset.

In [None]:
tokenized_input = bert.preprocessor(train_df.text)
bert_output = bert.backbone.predict(tokenized_input)['sequence_output'][:, 0, :]

[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 151ms/step


Your turn: Do what we did above, for the test set.

In [None]:
test_tokenized_input = bert.preprocessor(test_df.text)
test_bert_output = bert.backbone.predict(test_tokenized_input)['sequence_output'][:, 0, :]

[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 191ms/step


## Finetune on 200 examples

Define a neural network with 2 dense layers of 128 neurons each using a ReLU activation. Each dense layer should be followed by a Dropout layer with probability of 0.1.

In [None]:
### YOUR CODE BELOW ###

inputs = keras.layers.Input(shape=(768,))  # Input shape should match BERT output dimensionality
x1 = keras.layers.Dense(128, activation='relu')(inputs)  # First hidden layer
x1 = keras.layers.Dropout(0.1)(x1) # Dropout for regularization
x2 = keras.layers.Dense(128, activation='relu')(x1) # Second hidden layer
x2 = keras.layers.Dropout(0.1)(x2) # Dropout for regularization
outputs = keras.layers.Dense(6, activation='softmax')(x2) # Output layer

### YOUR CODE ABOVE ###


# Model
bert_finetuned = keras.Model(inputs, outputs) # Create the Keras model

In [None]:
bert_finetuned.summary()

**How many parameters does your model have: ____115,718____?**

*Answer: 115,718*

Let's compile and train it.

In [None]:
bert_finetuned.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

bert_finetuned.fit(x=bert_output[:200],
                   y=y_train[:200],
                   epochs=20,
                   batch_size=32,verbose=1,
)

accuracy_200 = bert_finetuned.evaluate(x=test_bert_output, y=y_test)[1]
print(f"*** Test accuracy with 200 Examples: {accuracy_200:.2%} ***")

Epoch 1/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 186ms/step - accuracy: 0.2410 - loss: 1.7789
Epoch 2/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.4343 - loss: 1.5154 
Epoch 3/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.5610 - loss: 1.3195 
Epoch 4/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6427 - loss: 1.1467 
Epoch 5/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7122 - loss: 0.9684 
Epoch 6/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7319 - loss: 0.8077 
Epoch 7/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8224 - loss: 0.7089 
Epoch 8/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7664 - loss: 0.6446 
Epoch 9/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

## Fine-tune on ALL examples

Your turn: Compile and train the model on ALL examples. Use the same batch size and number of epochs as above.

In [None]:
# Model
bert_finetuned = keras.Model(inputs, outputs) # Create the Keras model

## YOUR CODE BELOW ##

bert_finetuned.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

### YOUR CODE ABOVE ###

bert_finetuned.fit(x=bert_output,
                   y=y_train,
                   epochs=20,
                   batch_size=32,verbose=1,
)

## YOUR CODE ABOVE ##

accuracy_ALL = bert_finetuned.evaluate(x=test_bert_output, y=y_test)[1]
print(f"*** Test accuracy with ALL Examples: {accuracy_ALL:.2%} ***")

Epoch 1/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.9640 - loss: 0.0909
Epoch 2/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9699 - loss: 0.0853
Epoch 3/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9692 - loss: 0.0801
Epoch 4/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9691 - loss: 0.0863
Epoch 5/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9667 - loss: 0.0780
Epoch 6/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9776 - loss: 0.0648
Epoch 7/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9734 - loss: 0.0883
Epoch 8/20
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9806 - loss: 0.0521
Epoch 9/20
[1m101/101[0m [32m━━━━━━━

<font color='red'>**Please Answer Below**</font>

Fill in the table below with the accuracies from above.

The base BoW Model:
* Test accuracy with 200 Examples: _63.17__%
* Test accuracy with All Examples: _82.80__%


BERT Model Accuracy
* Test accuracy with 200 Examples: 74.08%
* Test accuracy with All Examples: 78.93%


Comment on the performance of the BERT-based model relative to the BoW model.

200 examples: BERT model outperforms BoW model

All examples: BoW model outperforms BERT model

Conclusion: BERT performs better on lower amount of samples, leading to less compute needed