In [None]:
#pip install keras
#pip install tensorflow
#pip install matplotlib
#pip install sklearn
#pip install numpy
#pip install seaborn

## Dataset

The dataset we will work with today is a famous AI starter dataset called the MNIST database of handwritten digits. More documentation can be found here: http://yann.lecun.com/exdb/mnist/. It is important to know what you are working with, so I highly recommend reading over the documentation and learning a few things about your data format and meaning.

First, we will want to load the data, then we will explore it and try to train some classifiers!

The code below will load the dataset in from the existing keras python library. There are a few different datasets here that can be used by the public like image classifications, movie reviews and Boston housing prices. Keras does some work for us here in separating our data into training and test sets like we talked about in the lecture. 

If we did not want to use keras, we could also use sklearn's train_test_split function (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)) or write a method by ourselves!

For today, we will just use the pre-existing functionality but feel free to try other alternatives if you want.

In [None]:
from keras.datasets import mnist
import tensorflow as tf

(train_X, train_y), (test_X, test_y) = mnist.load_data()
train_X = train_X.astype('float32')
test_X = test_X.astype('float32')
train_X /= 255
test_X /= 255
train_y = tf.keras.utils.to_categorical(train_y, 10)
test_y = tf.keras.utils.to_categorical(test_y, 10)

### Questions

One of the toughest parts with Artificial Intelligence is learning your data and making it as useful as possible. There is a whole industry centered around this called Data Science.

This code below should get you started on analyzing the data. We first want to visualize it and see the pixels, but we also want to know how the data is being stored. Dig deeper into the actual contents now.

In [None]:
from matplotlib import pyplot

_, axs = pyplot.subplots(3, 3, figsize=(5, 5))
pyplot.subplots_adjust(hspace=0.5)
axs = axs.flatten()
for i, ax in zip(range(9), axs):
    ax.imshow(train_X[i], cmap=pyplot.get_cmap('gray'))
pyplot.show()

Play around with the data a bit... here are some questions you should look to answer:
- What sizes are each dataset?
- How many of each value is in the label datasets?
- Does the test set accurately represent the training set? Are the percentages of each label the same?
- Is the data in order based on number?

In [None]:
# sizes of each dataset

In [None]:
# value counds in each label dataset
import numpy as np

In [None]:
# test and train similar world representation

In [None]:
# dataset ordered or unordered

## Neural Networks

Now that we know a little more information about the data we are trying to use, lets start working on a neural network that given the pixel information, will be able to tell us what handwritten number it is.

We are going to create a Multi-layer Perceptron Classifier! Read [this](https://www.tensorflow.org/api_docs/python/tf/keras/Model) over and get a general idea of the methods we are going to use.

Below is an implementation of a sequential model with layers in it. If you are to run the rest of this section, you can see that the model does not do a good job on classifying the numbers. I want you to add more layers, move things around, change loss functions or optimizers to get a better working model. Feel free to ask any questions!

In [None]:
input_shape=(28, 28, 1)

model = tf.keras.Sequential(
    [
        tf.keras.Input(shape=input_shape),
        tf.keras.layers.Conv2D(10, kernel_size=(10, 10), activation="relu"),
        tf.keras.layers.MaxPooling2D(pool_size=(7, 7)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(10, activation="softmax"),
    ])
model.compile(loss="categorical_crossentropy", optimizer="SGD", metrics=["accuracy"])

In [None]:
model.summary()

In [None]:
model.fit(x=train_X, y=train_y, epochs=10, batch_size=300, validation_split=0.1)

In [None]:
loss, accuracy = model.evaluate(test_X, test_y, verbose=0)
print(f"Test accuracy: {accuracy}")

## Metrics Visualization

We might want to see certain metrics as time goes on in our training. One of the most used graphs is a correlation matrix which we will display here.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

prediction = model.predict(test_X)
prediction = prediction.argmax(axis=1)
conf = confusion_matrix(prediction, test_y.argmax(axis=1))

ax= pyplot.subplot()
sns.heatmap(conf, annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')