## Assignment 4

In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, Input, MaxPooling2D

In class on Wednesday, March 26 we trained two neural networks models for classifying fashion MNIST.
We'll call them **Model A** and **Model B**.

In [6]:
# Model A
modelA = Sequential([Input((28, 28)),
                    Flatten(),
                    Dense(256, activation="relu"),
                    Dropout(0.2),
                    Dense(128, activation="relu"),
                    Dense(64, activation="relu"),
                    Dropout(0.2),
                    Dense(10, activation="softmax")], name="Model A")
modelA.summary()
modelA.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

Model: "Model A"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_14 (Dense)             (None, 256)               200960    
_________________________________________________________________
dropout_8 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 128)               32896     
_________________________________________________________________
dense_16 (Dense)             (None, 64)                8256      
_________________________________________________________________
dropout_9 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_17 (Dense)             (None, 10)                650 

In [7]:
# Model B
modelB = Sequential(name="Model B")
modelB.add(Conv2D(64, (7, 7), activation="relu", padding="same", input_shape=(28, 28, 1))) # zero-padding
modelB.add(MaxPooling2D(2))
modelB.add(Conv2D(128, (3, 3), activation="relu", padding="same"))
modelB.add(Conv2D(128, (3, 3), activation="relu", padding="same"))
modelB.add(MaxPooling2D(2))
modelB.add(Conv2D(256, (3, 3), activation="relu", padding="same"))
modelB.add(Conv2D(256, (3, 3), activation="relu", padding="same"))
modelB.add(MaxPooling2D(2))
modelB.add(Flatten())
modelB.add(Dense(128, activation="relu"))
modelB.add(Dropout(0.5))
modelB.add(Dense(64, activation="relu"))
modelB.add(Dropout(0.5))
modelB.add(Dense(10, activation="softmax"))
modelB.summary()
modelB.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

Model: "Model B"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_10 (Conv2D)           (None, 28, 28, 64)        3200      
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 14, 14, 128)       73856     
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 14, 14, 128)       147584    
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 7, 7, 128)         0         
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 7, 7, 256)         295168    
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 7, 7, 256)         5900

### 1.

**Train** each model for 20 epochs, just like we did in class and then **save** the models. (See [the docs](https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model) or page 314 of G&eacute;ron.)

### 2.

Do this exercise using **either Model A or Model B**.

Construct the confusion matrix $C$ of the test set, i.e., the $10\times 10$ matrix whose $(i,j)$-entry is the number of images of class $i$ mistakenly classified as class $j$.
Produce a nice visualization of the confusion matrix in the spirit of [this example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) from the scikit-learn docs.

The **symmetric confusion score** for an unordered pair $\{i,j\}$ of indices is the sum $C_{ij} + C_{ji}$. It's the number of times images of classes $i$ and $j$ were mistaken for one another. Which pairs of features have the three highest symmetrized confusion scores?

### 3.

Do this exercise using **either Model A or Model B**.

For each ordered pair of indices $(i,j)$, find the image of class $i$ mistakenly classified as class $j$ with the lowest/highest
- categorical crossentropy loss is the lowest/highest. Loss functions are located in the `tensorflow.keras.losses` module.

- predicted probability of belonging to class $j$. These predicted probabilities are the output of `model.predict`.

Plot these extremal predictions in grids; four grids in total corresponding to the four possible choices of lowest/highest and loss/probability. Assuming you're using `matplotlib`, I suggest either [`subplots`](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplot.html) or [`ImageGrid`](https://matplotlib.org/3.1.1/api/_as_gen/mpl_toolkits.axes_grid1.axes_grid.ImageGrid.html#mpl_toolkits.axes_grid1.axes_grid.ImageGrid).

### 4.

We use the architecture and weights of a neural network trained on a large data set as initialization values for retraining the same network on a much smaller data set of a similar nature. This approach is known as **transfer learning**.

Load the **MNIST handwritten digit data** (`keras.datasets.mnist.load_data`), both training and testing sets.
For $n=100, 200, ...$, retrain your models A and B on a $n$ randomly selected training images and record the accuracy on the full test set.
How large does $n$ need to be to achieve 95% accuracy on the test set? Make sure you reload your weights between training runs!

For $n=100, 200, ...$, train models A and B from scratch (i.e., with random initializations) on a $n$ randomly selected training images and record the accuracy on the full test set.
How large does $n$ need to be to achieve 95% accuracy on the test set? Make sure you reinitialize your weights between training runs!

Comment on the effectiveness of transfer learning in the context of your computaitons.

