# <u>**MNIST Digit workbook**</u>

This workbook is going to show you how to load datasets from Keras, and view the data using `matplotlib`

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras.datasets.mnist as mnist

## Importing the data

In [2]:
(train_images, train_labels),(test_images, test_labels) = mnist.load_data()

## **Problem 1.** Get the dimensions of the data.

Specifically, get the dimensions of `train_images`, `train_labels`, `test_images`, and `test_labels` using the `.shape` command.

* For example, if `myArr` is an array, you can get it's shape by calling:
`myArr.shape`

You should get a **three dimensional array**.  The first number represents **how many images you have**. The second and third number are the length and width of the image in pixels

To get the 50th test image, for example, you could write:
* `image_ex = test_image[49]` 

## **Problem 2.** Make variables to store the following:
* The number of training images: `num_train_images`
* The number of training labels: `num_train_labels`
* The number of test images: `num_test_images`
* The number of training labels: `num_train_labels`
* The width of each image in pixels: `image_width`
* The height of each image in pixels: `image_height`

## **Problem 3** Using `matplotlib`, display a test image using `plt.imshow`.

* To display a matrix `myArr`, you call: `plt.imshow(myArr)`, and use the `plt.title` command to set the title of the plot to be the **label** of the image from the `test_labels` array

## **Problem 4.** Find The maximum and minimum pixel values for an image.

For a numpy array, for example, 

`myArr = np.random.randint(0,100,(28,28))`

creates an array of random integers from `0` to `100` with dimensions `(28,28)`. To get the maximum and value in it, we can write:

`max = myArr.max()` and `min = myArr.min()`

## **Problem 5.** Reformat the data to get it ready for machine learning

The maximum value of a pixel is the maximum value that an **8-bit** integer can have. Tensorflow layers require the input to be **32 bits.** So we have to reformat our data. 

* First we will **normalize** the data by dividing all pixel values by the maximum pixel value. This will make all pixel values between $0$ and $1$.

* Then we change the datatype to be a 32 bit floating point number

Below is an example. **You may want to read the following couple examples**

In [3]:
examples = np.random.randint(0,256,(100,32,32), dtype='uint8')
print(examples[0])
print('\nmax value: ' +str(examples[0].max()))
print('\nmin value: ' +str(examples[0].min()))
print('\ndatatype: ' +str(examples[0].dtype))
examples.shape

[[ 92  17  29 ... 241  82  91]
 [122  59 181 ... 207  20  88]
 [ 13 111 174 ... 115 133 116]
 ...
 [ 73  12 151 ... 101  47 161]
 [ 93  76  35 ... 192  45 177]
 [ 17 157  15 ... 211  44  90]]

max value: 255

min value: 0

datatype: uint8


(100, 32, 32)

The array has dimensions $32\times 32$, and has datatype `uint8` which is an 8-bit integer. To **normalize** the data, just divide it as such:

In [4]:
examples = examples/255.0
print(examples[0])

[[0.36078431 0.06666667 0.11372549 ... 0.94509804 0.32156863 0.35686275]
 [0.47843137 0.23137255 0.70980392 ... 0.81176471 0.07843137 0.34509804]
 [0.05098039 0.43529412 0.68235294 ... 0.45098039 0.52156863 0.45490196]
 ...
 [0.28627451 0.04705882 0.59215686 ... 0.39607843 0.18431373 0.63137255]
 [0.36470588 0.29803922 0.1372549  ... 0.75294118 0.17647059 0.69411765]
 [0.06666667 0.61568627 0.05882353 ... 0.82745098 0.17254902 0.35294118]]


To change the datatype to be a 32-bit floating point number, we can to the following:

In [5]:
examples = examples.astype('float32')

In [6]:
print(examples.dtype)

float32


## **Problem 6.** Lastly, add a **color channel** to the data

Store it into an array called `formatted_train_images` and `formatted_test_images`.

Images normally contain an extra dimension to account for color. If our image were in color, it would have shape/dimension of `(28, 28, 3)`. But we don't have that last dimension. We only have one channel. But to input our data into a machine learning model, we need to specify that we only have one color channel. So we want our data to have shape `(60000, 28, 28, 1)`. So we add this by doing the following


In [7]:
print('Shape of examples array: ' +str(examples.shape))

formatted_data = np.expand_dims(examples,-1)

print('Shape of examples array after adding dimension: ' + str(formatted_data.shape))

Shape of examples array: (100, 32, 32)
Shape of examples array after adding dimension: (100, 32, 32, 1)


## **Problem 7.** Save the data for later use

To do this, use the following:
```
with open('formatted_train_images.npy', 'wb') as f:
    np.save(f, formatted_train_images)
```

In [8]:
with open('formatted_train_images.npy','wb') as f:
    np.save(f,formatted_data)

To reload your data later, you can use the following comand:

```
with open('formatted_train_images.npy', 'rb') as f:
    a = np.load(f)
```

this will load your array into the variable `a`

## **Problem 8.** Format Labels

We need to format the labels in a format called **[one hot encoding](https://www.geeksforgeeks.org/ml-one-hot-encoding/)**. Since there are 10 possible digits that it could predict, we need to change each label to be an array having the following format:

$$
0 \rightarrow [\underbrace{1}_0,0,0,0,0,0,0,0,0,0]\\~\\
1 \rightarrow [0,\underbrace{1}_1,0,0,0,0,0,0,0,0]\\~\\
2 \rightarrow [0,0,\underbrace{1}_2,0,0,0,0,0,0,0]\\~\\
3 \rightarrow [0,0,0,\underbrace{1}_3,0,0,0,0,0,0]\\~\\
4 \rightarrow [0,0,0,0,\underbrace{1}_4,0,0,0,0,0]\\~\\
5 \rightarrow [0,0,0,0,0,\underbrace{1}_5,0,0,0,0]\\~\\
6 \rightarrow [0,0,0,0,0,0,\underbrace{1}_6,0,0,0]\\~\\
7 \rightarrow [0,0,0,0,0,0,0,\underbrace{1}_7,0,0]\\~\\
8 \rightarrow [0,0,0,0,0,0,0,0,\underbrace{1}_8,0]\\~\\
9 \rightarrow [0,0,0,0,0,0,0,0,0,\underbrace{1}_9]\\~\\
$$
**Hint:** For the training labels, create $60,000$ arrays, each array containing $10$ zerps using the `np.zeros(60000,10)` command. For each label in the training labels, add a `1` in the position in the position it should be in.