Portfolio Assignment 20-1
This notebook demonstrates how to compile a model using TensorFlow and Keras.
```python
import tensorflow as tf  # type: ignore --- IGNORE ---
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,
    pooling='avg',
    classifier_activation='softmax'
)
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))
```
This code snippet loads the CIFAR-100 dataset, initializes a ResNet50 model, compiles it with the Adam optimizer and sparse categorical crossentropy loss, and trains it for 5 epochs.


In [None]:
#%pip uninstall protobuf -y
#%pip install protobuf==3.20.1
# After running the above commands, please restart the kernel BEFORE running the import statements below.
# conda install pandas
#%pip install tensorflow-macos
#%pip install tensorflow-metal
#%pip install tensorflow-datasets
#%pip install tensorflow
#%pip install scikit-learn
import pandas                      as pd
import tensorflow                  as tf  # type: ignore
from   tensorflow              import keras # type: ignore
from   sklearn.model_selection import train_test_split

tf.random.set_seed(42)

## IMDB movie reviews

## Retrieving and preparing the Data

We will work with the IMDb movie reviews data.

In [2]:
# question 1
# Read in the IMDB Dataset into "data". Do not set an index column
data = pd.read_csv('files/IMDB Dataset.csv')

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# question 2
# Replace all "negative" and "positive" sentiment values with 0 and 1 respectively.
# You can use a simple logical operator instead of label encoding.
data['sentiment'] = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

In [5]:
# question 3
# Get the dependent data and assign to y
y = data['sentiment']   
print(y[0:10])

0    1
1    1
2    1
3    0
4    1
5    1
6    1
7    0
8    0
9    1
Name: sentiment, dtype: int64


In [6]:
# question 4
# Split the X data (data['review']) and y data into X_train, X_test, y_train, and y_test
# With a test size of 0.2 and a random_state of 42
X_train, X_test, y_train, y_test = train_test_split(data['review'], y, test_size=0.2, random_state=42)

In [None]:
print(f"""
Train samples: {X_train.shape[0]}
Test samples : {X_test.shape[0]}
"""
)


Train samples: 40000
Test samples: 10000



In [8]:
y_train

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64

Inspect the frequence of each sentiment in the traning dataset (it is balanced!)

In [9]:
# question 5
# Calculate the training data's frequency and assign the output to "frequency"
frequency = y_train.value_counts() / y_train.shape[0]
print(frequency)

sentiment
0    0.500975
1    0.499025
Name: count, dtype: float64


In [10]:
# cell 6
# Let's turn the target into a dummy vector
y_train = pd.get_dummies(y_train).to_numpy()
y_test  = pd.get_dummies(y_test).to_numpy()

In [11]:
y_train.shape

(40000, 2)

## Unigram Multi-hot Encoding Baseline

Next, let us see the performance of a neural net that is trained from the scratch using multi-hot encoding. 

In [12]:
from tensorflow.keras.layers import TextVectorization # type: ignore

# Set the maximum number of tokens to 2412. 
# Also set up our Text Vectorization layer using multi-hot encoding
max_tokens      = 2412
text_vectorization = TextVectorization(max_tokens  = max_tokens, 
                                    output_mode = 'multi_hot') 

In [13]:
# The vocabulary that will be indexed is given by the text corpus on our train dataset
text_vectorization.adapt(X_train)

In [14]:
# Question 7
# We vectorize our input
X_train = text_vectorization(X_train)
X_test  = text_vectorization(X_test)

In [15]:
# Question 8
# Now create your model. start with 32 dense relu layers, a dropout layer of 0.5, and a final softmax layer
inputs  = keras.Input(shape=(max_tokens, ))
x       = keras.layers.Dense(32, activation="relu")(inputs)
x       = keras.layers.Dropout(0.5)(x)
outputs = keras.layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs, outputs)
model.summary()

In [16]:
# Compile your model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [17]:
# Fit model
# Use one-hot encoded y for training and testing
model.fit(
    x              = X_train, 
    y              = y_train.astype('float32'), 
    validation_data= (X_test, 
                      y_test.astype('float32')),
    epochs         = 5
)


Epoch 1/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 633us/step - accuracy: 0.7783 - loss: 0.4543 - val_accuracy: 0.8770 - val_loss: 0.2829
Epoch 2/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 566us/step - accuracy: 0.8754 - loss: 0.2952 - val_accuracy: 0.8800 - val_loss: 0.2792
Epoch 3/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 575us/step - accuracy: 0.8918 - loss: 0.2672 - val_accuracy: 0.8782 - val_loss: 0.2836
Epoch 4/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 574us/step - accuracy: 0.9005 - loss: 0.2457 - val_accuracy: 0.8777 - val_loss: 0.2972
Epoch 5/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 577us/step - accuracy: 0.9072 - loss: 0.2317 - val_accuracy: 0.8774 - val_loss: 0.3072


<keras.src.callbacks.history.History at 0x16c6ef950>

In [18]:
# Question 9
# Evaluate your model. You should be able to get your model to 85% at this point
# Use one-hot encoded y for evaluation as well
model.evaluate(x=X_test, y=y_test.astype('float32'))[1] > 0.85

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 352us/step - accuracy: 0.8752 - loss: 0.3083


True

## Extend Baseline Model

Let's create more complex models to increase the accuracy on our test sample. Try combining different models by changing:
- Number of hidden units
- Adding another hidden layer.
- Changing the number of epochs.
- Using bigrams instead of unigrams.

To guide your search for the best parameters, note how the accuracy changes on both train and test data.

In [19]:
embedding_dim = 64
hidden_units  = 32
num_classes   = y_train.shape[1]  # Use y_train, which is one-hot encoded
model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(max_tokens,)),  # Use max_tokens as input shape
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(num_classes, activation='softmax')
])

In [20]:
# Compile the model
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top = True,
    weights     = None,
    input_shape = (32, 32, 3),
    classes     = 100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m273s[0m 338ms/step - accuracy: 0.0573 - loss: 5.0049
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m272s[0m 348ms/step - accuracy: 0.0828 - loss: 4.6745
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 339ms/step - accuracy: 0.1209 - loss: 4.0045
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 337ms/step - accuracy: 0.1757 - loss: 3.6470
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 335ms/step - accuracy: 0.1754 - loss: 3.7303


<keras.src.callbacks.history.History at 0x32e106390>