<span style="color: cyan;">

### ```Portfolio-Assignment-20-1```

This notebook demonstrates how to compile a model using TensorFlow and Keras.

This code snippet loads the CIFAR-100 dataset, initializes a ResNet50 model, compiles it with the Adam optimizer and sparse categorical crossentropy loss, and trains it for 5 epochs.
```python
import tensorflow as tf
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,
    pooling='avg',
    classifier_activation='softmax'
)
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))
```

This code imports several essential libraries for data analysis and machine learning in Python. `pandas` is imported as `pd` for data manipulation and handling tabular data. `tensorflow` is imported as `tf`, and its `keras` module is also imported, both of which are used for building and training deep learning models. The `train_test_split` function from `sklearn.model_selection` is included to split datasets into training and testing sets, which is a common step in machine learning workflows. Finally, `tf.random.set_seed(42)` sets the random seed for TensorFlow, ensuring reproducibility of results by making random operations deterministic.

In [1]:
import pandas                      as pd
import tensorflow                  as tf                # type: ignore
from   tensorflow              import keras             # type: ignore
from   tensorflow.keras.layers import TextVectorization # type: ignore
from   sklearn.model_selection import train_test_split  # type: ignore
tf.random.set_seed(42)

<span style="color: cyan;">

IMDB movie reviews

Retrieving and preparing the Data

We will work with the IMDb movie reviews data.

In [2]:
# question 1
# Read in the IMDB Dataset into "data". Do not set an index column
data = pd.read_csv('files/IMDB Dataset.csv')

<span style="color: cyan;">

The code `data.head()` displays the first five rows of the DataFrame `data`. This is a common practice in data analysis to quickly inspect the initial entries of a dataset. By viewing these rows, you can verify that the data has been loaded correctly, check the column names, and get an initial sense of the structure and contents of the DataFrame before proceeding with further analysis or processing.

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<span style="color: cyan;">

This code transforms the `'sentiment'` column in the DataFrame `data` from text labels to numeric values. It uses the `apply` method with a lambda function to convert each entry: if the sentiment is `'positive'`, it assigns a value of `1`; otherwise, it assigns `0`. This process is known as binary encoding and is commonly used to prepare categorical data for machine learning models, which typically require numeric input. By converting `'positive'` and `'negative'` sentiments to `1` and `0`, the data becomes suitable for classification algorithms.

In [4]:
# question 2
# Replace all "negative" and "positive" sentiment values with 0 and 1 respectively.
# You can use a simple logical operator instead of label encoding.
data['sentiment'] = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

In [5]:
# question 3
# Get the dependent data and assign to y
y = data['sentiment']   
y[0:10]

0    1
1    1
2    1
3    0
4    1
5    1
6    1
7    0
8    0
9    1
Name: sentiment, dtype: int64

<span style="color: cyan;">

This code splits the dataset into training and testing sets using the `train_test_split` function from scikit-learn. It takes the `'review'` column from the DataFrame `data` as the feature set and `y` as the target variable. The `test_size=0.2` argument specifies that 20% of the data should be reserved for testing, while the remaining 80% is used for training. The `random_state=42` parameter ensures that the split is reproducible by setting a fixed seed for the random number generator. The result is four variables: `X_train` and `y_train` for training, and `X_test` and `y_test` for testing, which are commonly used in machine learning workflows to evaluate model performance.

In [6]:
# question 4
# Split the X data (data['review']) and y data into X_train, X_test, y_train, and y_test
# With a test size of 0.2 and a random_state of 42
X_train, X_test, y_train, y_test = train_test_split(data['review'], y, test_size=0.2, random_state=42)

<span style="color: cyan;">

This code prints the number of samples in the training and testing datasets. It uses an f-string to format the output, displaying the counts of `X_train` and `X_test` by accessing their `.shape[0]` attributes, which represent the number of rows (samples) in each set. This is useful for quickly verifying the sizes of your splits after using `train_test_split`, ensuring that the data was divided as expected and that you have the correct number of samples for model training and evaluation.

In [7]:
print(f"""
Train samples: {X_train.shape[0]}
Test samples : {X_test.shape[0]}
"""
)


Train samples: 40000
Test samples : 10000



<span style="color: cyan;">

The code `y_train` displays the contents of the variable `y_train`, which represents the target values for the training dataset. In this context, `y_train` contains the sentiment labels (such as 0 for negative and 1 for positive) corresponding to each review in the training set. Viewing `y_train` allows you to inspect the distribution and encoding of the target variable, which is useful for verifying that the data preparation steps have been performed correctly before training a machine learning model.

In [8]:
y_train

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64

<span style="color: cyan;">

Inspect the frequency of each sentiment in the training dataset (it is balanced!)

This code calculates the relative frequency of each sentiment class in the training dataset. The `value_counts()` method counts the occurrences of each unique value in `y_train`, which represents the sentiment labels. By dividing these counts by the total number of samples (`y_train.shape[0]`), the code computes the proportion of each class within the training set. Assigning the result to `frequency` provides a quick way to check if the dataset is balanced or if one class is more prevalent than the other. Displaying `frequency` helps you understand the class distribution, which is important for evaluating and improving model performance.

In [9]:
# question 5
# Calculate the training data's frequency and assign the output to "frequency"
frequency = y_train.value_counts() / y_train.shape[0]
frequency

sentiment
0    0.500975
1    0.499025
Name: count, dtype: float64

<span style="color: cyan;">

This code converts the target variables `y_train` and `y_test` from integer labels to one-hot encoded arrays. The `pd.get_dummies()` function creates dummy variables for each class, turning each label into a binary vector where only the index corresponding to the class is set to 1 and all others are 0. The `.to_numpy()` method then converts the resulting DataFrame into a NumPy array, which is the format required by many machine learning models, especially neural networks. This transformation is essential for multi-class classification tasks, as it allows the model to output probabilities for each class.

In [10]:
# question 6
# Let's turn the target into a dummy vector
y_train = pd.get_dummies(y_train).to_numpy()
y_test  = pd.get_dummies(y_test).to_numpy()

<span style="color: cyan;">

The code `y_train.shape` returns the dimensions of the `y_train` array. In this context, after converting `y_train` to a one-hot encoded NumPy array, `y_train.shape` will output a tuple indicating the number of samples and the number of classes. This is useful for verifying that the target variable has been correctly transformed and matches the expected input shape for machine learning models, especially neural networks that require specific input dimensions.

In [11]:
y_train.shape

(40000, 2)

<span style="color: cyan;">

Unigram Multi-hot Encoding Baseline

Next, let us see the performance of a neural net that is trained from the scratch using multi-hot encoding. 

This code sets up text preprocessing for a neural network model. The variable `max_tokens` is assigned the value 2412, which defines the maximum vocabulary size—the largest number of unique words the model will consider from the dataset. The `TextVectorization` layer from Keras is then initialized with this vocabulary limit and configured to use `multi_hot` encoding. With `multi_hot` encoding, each input text is converted into a binary vector of length `max_tokens`, where each position indicates whether a specific word from the vocabulary appears in the text (1) or not (0). This representation is useful for feeding text data into machine learning models, as it transforms raw text into a fixed-size, numeric format that captures word presence.

In [12]:
# Set the maximum number of tokens to 2412. 
# Also set up our Text Vectorization layer using multi-hot encoding
max_tokens      = 2412
text_vectorization = TextVectorization(max_tokens  = max_tokens, 
                                       output_mode = 'multi_hot') 

<span style="color: cyan;">

The code `text_vectorization.adapt(X_train)` prepares the `TextVectorization` layer by analyzing the training data. This step builds the vocabulary from the text samples in `X_train`, allowing the layer to learn which words are present and how to map them to indices in the output vectors. Adapting on the training set ensures that the vocabulary reflects the data the model will learn from, which helps prevent information leakage from the test set and improves generalization. This is a crucial preprocessing step before transforming text data for machine learning models.

In [13]:
# The vocabulary that will be indexed is given by the text corpus on our train dataset
text_vectorization.adapt(X_train)

<span style="color: cyan;">

This code applies the `TextVectorization` layer to both the training and testing datasets. By calling `text_vectorization(X_train)` and `text_vectorization(X_test)`, each text sample in `X_train` and `X_test` is transformed into a multi-hot encoded vector, where each element indicates the presence or absence of a specific word from the vocabulary. This step converts raw text data into a fixed-size, numeric format suitable for input into machine learning models, such as neural networks. Applying the same transformation to both sets ensures consistency in how the data is represented during training and evaluation.

In [14]:
# Question 7
# We vectorize our input
X_train = text_vectorization(X_train)
X_test  = text_vectorization(X_test)

<span style="color: cyan;">

This code defines a simple neural network model using Keras for text classification. The first line creates an input layer that expects vectors of length `max_tokens`, matching the size of the multi-hot encoded text data. The next line adds a dense (fully connected) layer with 32 units and ReLU activation, which helps the model learn non-linear relationships in the data. A dropout layer with a rate of 0.5 follows, randomly setting half of the input units to zero during training to help prevent overfitting. The output layer is another dense layer with 2 units and softmax activation, producing probabilities for each of the two sentiment classes (positive or negative). The model is then constructed by specifying the input and output layers, and `model.summary()` displays a summary of the model architecture, including the layers and the number of parameters.

In [15]:
# Question 8
# Now create your model. start with 32 dense relu layers, a dropout layer of 0.5, and a final softmax layer
inputs  = keras.Input(shape=(max_tokens, ))
x       = keras.layers.Dense(32, activation="relu")(inputs)
x       = keras.layers.Dropout(0.5)(x)
outputs = keras.layers.Dense(2, activation="softmax")(x)
model   = keras.Model(inputs, outputs)
model.summary()

<span style="color: cyan;">

This code compiles the Keras model, specifying how it should be trained. The `optimizer='adam'` argument selects the Adam optimization algorithm, which is widely used for its efficiency and adaptive learning rate. The `loss='categorical_crossentropy'` argument sets the loss function to categorical cross-entropy, which is appropriate for multi-class classification problems with one-hot encoded targets. The `metrics=['accuracy']` argument tells Keras to track accuracy during training and evaluation, providing a straightforward measure of model performance. Compiling the model with these settings prepares it for the training process.

In [16]:
# Compile your model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

<span style="color: cyan;">

This code trains the neural network model using the `fit` method from Keras. The training data (`X_train` and `y_train`) is provided, with the target labels converted to `float32` for compatibility with the model. The `validation_data` argument supplies the test set (`X_test` and `y_test`, also as `float32`), allowing the model to evaluate its performance on unseen data after each epoch. The `epochs=5` parameter specifies that the model will train for five complete passes through the training dataset. This setup helps monitor both training and validation accuracy, making it easier to detect overfitting or underfitting during the learning process.

In [17]:
# Fit model
# Use one-hot encoded y for training and testing
model.fit(
    x              = X_train, 
    y              = y_train.astype('float32'), 
    validation_data= (X_test, 
                      y_test.astype('float32')),
    epochs         = 5
)

Epoch 1/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 610us/step - accuracy: 0.7822 - loss: 0.4466 - val_accuracy: 0.8791 - val_loss: 0.2823
Epoch 2/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 570us/step - accuracy: 0.8799 - loss: 0.2928 - val_accuracy: 0.8816 - val_loss: 0.2805
Epoch 3/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 556us/step - accuracy: 0.8928 - loss: 0.2648 - val_accuracy: 0.8790 - val_loss: 0.2870
Epoch 4/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 659us/step - accuracy: 0.9014 - loss: 0.2467 - val_accuracy: 0.8805 - val_loss: 0.2938
Epoch 5/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 700us/step - accuracy: 0.9088 - loss: 0.2280 - val_accuracy: 0.8780 - val_loss: 0.2970


<keras.src.callbacks.history.History at 0x31d9b50d0>

<span style="color: cyan;">

This code evaluates the trained model's performance on the test dataset. The `model.evaluate` function computes the loss and accuracy using the test features (`X_test`) and one-hot encoded test labels (`y_test`), which are converted to `float32` for compatibility. The result is a list where the second element (`[1]`) represents the accuracy. The expression checks if this accuracy is greater than 0.85, returning `True` if the model achieves at least 85% accuracy on the test set. This is a quick way to verify if the model meets a desired performance threshold.

In [18]:
# Question 9
# Evaluate your model. You should be able to get your model to 85% at this point
# Use one-hot encoded y for evaluation as well
model.evaluate(x=X_test, y=y_test.astype('float32'))[1] > 0.85

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 396us/step - accuracy: 0.8788 - loss: 0.2946


True

<span style="color: cyan;">

Extend Baseline Model

Let's create more complex models to increase the accuracy on our test sample. Try combining different models by changing:
- Number of hidden units
- Adding another hidden layer.
- Changing the number of epochs.
- Using bigrams instead of unigrams.

To guide your search for the best parameters, note how the accuracy changes on both train and test data.

This code sets up a neural network model for text classification using TensorFlow's Keras API. The variables `embedding_dim` and `hidden_units` are defined for potential use in model configuration, though only `hidden_units` is referenced in the dense layer. `num_classes` is determined by the number of columns in `y_train`, which is one-hot encoded, representing the number of output classes.

The model is built using the `Sequential` API. It starts with an input layer that expects vectors of length `max_tokens`, matching the size of the multi-hot encoded text data. The first hidden layer is a dense (fully connected) layer with 128 units and ReLU activation, which helps the model learn complex patterns in the data. A dropout layer with a rate of 0.5 follows, randomly dropping half of the units during training to reduce overfitting. The output layer is a dense layer with `num_classes` units and softmax activation, producing a probability distribution over all possible classes. This architecture is suitable for multi-class classification tasks and is commonly used as a baseline for text classification problems.

In [19]:
embedding_dim = 64
hidden_units  = 32
num_classes   = y_train.shape[1]  # Use y_train, which is one-hot encoded
model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(max_tokens,)),  # Use max_tokens as input shape
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(num_classes, activation='softmax')
])

<span style="color: cyan;">

This code demonstrates how to set up and train a deep learning model using TensorFlow and Keras on the CIFAR-100 image dataset. First, the CIFAR-100 dataset is loaded, splitting the data into training and testing sets. The model is defined using the ResNet50 architecture, a popular convolutional neural network for image classification. The model is initialized without pre-trained weights (`weights=None`), and the input shape is set to match CIFAR-100 images (32x32 pixels with 3 color channels). The number of output classes is set to 100, corresponding to the dataset's categories.

The loss function is specified as sparse categorical cross-entropy, suitable for integer class labels. The model is compiled with the Adam optimizer and configured to track accuracy during training. Finally, the model is trained for 5 epochs with a batch size of 64, using the training data. This process allows the model to learn to classify images into one of 100 categories.

In [20]:
# Compile the model
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top = True,
    weights     = None,
    input_shape = (32, 32, 3),
    classes     = 100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 326ms/step - accuracy: 0.0498 - loss: 5.0023
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m256s[0m 328ms/step - accuracy: 0.0758 - loss: 4.6368
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m254s[0m 325ms/step - accuracy: 0.1159 - loss: 4.1135
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m255s[0m 327ms/step - accuracy: 0.1780 - loss: 3.6378
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m254s[0m 325ms/step - accuracy: 0.1901 - loss: 3.6509


<keras.src.callbacks.history.History at 0x32e3490d0>

In [23]:
model.evaluate(x=x_test, y=y_test.astype('float32'))[1] > 0.00

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 21ms/step - accuracy: 0.0401 - loss: 2115.0334


True