# MNIST Part 1: Establishing a Baseline

Before we train a neural net to recognize handwritten images, we'll try a simpler technique, just to see how it compares. I took most of this code and the explanations from chapter 4 of [FastAI book](https://colab.research.google.com/github/fastai/fastbook), but I tried to simplify things. When you're ready (and when you're interested), check out the fuller treatment.

## Imports, setup

In [None]:
import fastbook
from fastai.vision.all import *
fastbook.setup_book()

matplotlib.rc('image', cmap="Greys")

Import subset of MNIST dataset

In [None]:
path = untar_data(URLs.MNIST_SAMPLE)

Set `BASE_PATH` to the location of the data set.

In [None]:
Path.BASE_PATH = path

## What's in the the data set?

We'll explore the data set a little. First, we can see a fairly typical structure for a ML project: 
  - a CSV file with labels
  - a folder with training data
  - a folder with validation data

In [None]:
path.ls()

The training data is has a folder of images with handwritten 3s and another folder with images with handwritten 7s. (It'll be easier to start with just two digits.)

In [None]:
(path/'train').ls()

We'll created a list of all the images in each. I'm printing the `threes` so you can see what it looks like. (`sorted` just guarantees that we see the same list.)

In [None]:
threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
threes

Here's what one of the images looks like:

In [None]:
ex_three = Image.open(threes[1])
show_image(ex_three)

And here's a seven:

In [None]:
ex_seven = Image.open(sevens[3])
show_image(ex_seven)

## Arrays of gray-scale values

The Python notebook figures that we want to *see* these images *as images*, so that's how it renders them. Each is a square, 28 pixels x 28 pixels. Really, each image in the MNIST data set is a 2D array, a list of 28 lists, each with 28 values. Each of those values is a number between 0 and 255: 0 if it is white, 255 if it is black, and 254 shades of gray in between.

Here's what a part of that 2D array looks like:

In [None]:
array(ex_three)

We can also represent it as a **tensor**, which is a fundamental data type in the `pytorch` machine library. We'll look more closely at tensors soon.

In [None]:
tensor(ex_three)[4:10,4:10]

I'm going to take a slightly bigger slice -- just the top part of the digit, represented as a tensor -- and use the `pandas` library to show you individual pixels with their values.

In [None]:
ex_three_tensor = tensor(ex_three)
df = pd.DataFrame(ex_three_tensor[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

We could do the same for our example 7:

In [None]:
ex_seven_tensor = tensor(ex_seven)
df = pd.DataFrame(ex_seven_tensor[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

## Baseline: Distinguish by Pixel Similarity

Before we try to train a model to distinguish 3s from 7s, let's think about how we could try to accomplish the same goal **without a model**.

Maybe we could calculate an "average 3" and an "average 7" -- the "image" you'd get if you took the average pixel value across all the images. Then, when we want to make a prediction, we could see if the image is "closer" (whatever that means) to the average 3 or the average 4. 

(The "average", we hope, will let us skip the difficulties we'd face if we were trying to compare one specific digit to another specific digit, which may not be identically positioned or sized or curved, etc.)

We can use **tensors** to "stack" all the threes together. Try to picture all those 2D arrays stacked on top of each other so the top-left corner of each lines up with the top-left corner of the others.

First, let's get a generate two lists where each element is a tensor representing one of the images:

In [None]:
three_tensors = [tensor(Image.open(current_three)) for current_three in threes]
seven_tensors = [tensor(Image.open(current_seven)) for current_seven in sevens]
len(three_tensors), len(seven_tensors)

Before we go any farther, let's take a moment to make sure the images we have in our tensor lists look as expected.

In [None]:
show_image(three_tensors[4321])

In [None]:
show_image(seven_tensors[1234])

Looks good to me! Now let's stack them with the PyTorch `stack` method. While we're doing it, we'll also cast the integers to floating point numbers since that what PyTorch expects when we take an average.

In [None]:
stacked_threes = torch.stack(three_tensors).float()
stacked_sevens = torch.stack(seven_tensors).float()

The resulting tensors ("3-rank" tensors, because they have three "dimensions" or "ranks") each have a `shape` property.

In [None]:
stacked_threes.shape

That's 28 pixels x 28 pixels x 6131 images, all "lined up" to make calculations efficient. 

Now let's average all the three-images (or rather their tensor representations) along the *0-dimension* (the "stacked" dimension). The result will be the "average" or "ideal" three, a 28 x 28 tensor.

In [None]:
average3 = stacked_threes.mean(0)
show_image(average3)

It makes some intuitive sense that the average of all those images averaged together would produce a blurry three.

And the average or ideal 7?

In [None]:
average7 = stacked_sevens.mean(0)
show_image(average7)

The seven is kind of interesting because it's extra blurry at its tips. We could guess that those are the parts of the images where there's more variation.

### Measuring the "distance" between an image and the averages

Now that we have our average 3 and our average 7, let's take a random 3 image and see how similar it is to each of the average images.

How should we measure similarity? You might think we could just sum the difference for each pixel pair. But some differences will be positive and some will be negative, and summing them will tend to cancel the differences. So what we want is a method that won't cancel the differences.

#### Mean Absolute Difference

One way to avoid self-cancelling differences is to take the average of the **absolute value** of the pixel-by-pixel differences. This method is calle **mean absolute difference**.

We'll use our `ex_three_tensor` (defined above) as the specific image we want to compare to the averages. Here's how we'd calculate its similarity to the average 3 and the average 7:

In [None]:
dist_3_abs = (ex_three_tensor - average3).abs().mean()
dist_7_abs = (ex_three_tensor - average7).abs().mean()
dist_3_abs, dist_7_abs

The "distance" numbers themselves don't mean much. What's important is that the "distance" is smaller between our random 3 and our average 3 than the difference between our random 3 and our average 7. Using our method, we would predict that our random 3 is in fact a 3, or at least more likely to be a 3 than a 7.

Let's try again, this time with a random 7. I'll also use a built in method from PyTorch. It looks different, but it's performing the **mean absolute difference** calculation.

In [None]:
ex_seven_tensor = tensor(ex_seven)
dist_3_l1 = F.l1_loss(ex_seven_tensor.float(), average3)
dist_7_l1 = F.l1_loss(ex_seven_tensor.float(), average7)
dist_3_l1, dist_7_l1

Yatzee! Our random seven was closer to the average 7 than the average 3. 

#### Root Mean Squared Error (RMSE)

A second method to avoid self-cancelling differences is to take the average of the *squares* of the differences (because squares will always be positive) and then take the square root of that average (which undoes the squaring). Again, we'll implement twice, once "long hand" to test our random 3 and again, using a PyTorch method, to test our random 7.

In [None]:
dist_3_sqr = ((ex_three_tensor - average3)**2).mean().sqrt()
dist_7_sqr = ((ex_three_tensor - average7)**2).mean().sqrt()
dist_3_sqr, dist_7_sqr

Whew. It, too, shows that our random 3 is more like our average 3 than our average 7.

In [None]:
dist_3_sqr = F.mse_loss(ex_seven_tensor, average3).sqrt()
dist_7_sqr = F.mse_loss(ex_seven_tensor, average7).sqrt()
dist_3_sqr, dist_7_sqr

Nice. Our random 7 is more similar to our average 7 than our average 3. 

#### Mean Absolute Difference or Root Mean Squared Error??

Both worked. So which should you prefer. The answer, as usual, is: it depends. But RMSE will tend to "penalize" big errors more and be (comparatively) more forgiving for small errors. That makes a lot of sense, and you're more likely to see RMSE "in the wild".

### Is Our Similarity Comparison Good?

We correctly (if somewhat tediously) classified two digits. But that shouldn't give us a ton of confidence. We'd like to know how well our method will generalize. If we have a couple thousand 3 and 7 images, how many can we correctly classify?

We made an important but hard-to-spot shift, so let's slow down and repeat it. To judge our classifier, we don't want to measure the average *loss* or *distance* or *similarity* over our test set. Instead, we want to calculate a different metric, the ***error rate***, or the percentage of images correctly classified. *Loss* or *similarity* scores don't mean much. What matters is whether we make an accurate prediction.

#### Preparing our validation set

Because we haven't really "trained" a model, it's not so critical that we use a fresh set of images to validate our classifier. But the MNIST set already has a separate validation set, so we may as well use it.

I'm going to "stack" up the all the validation images, but I'm not going to average them. This time, I'm creating that "stacked" tensor because we'll be able to run calculations on it more efficiently. This is also a chance to see how we can condense a lot of code we wrote above into just a few lines. Here goes . . . 

In [None]:
validation_3_tensors = torch.stack([tensor(Image.open(img_path)) for img_path in (path/'valid'/'3').ls()]).float()
validation_7_tensors = torch.stack([tensor(Image.open(img_path)) for img_path in (path/'valid'/'7').ls()]).float()
validation_3_tensors.shape, validation_7_tensors.shape

You can see that we have a stack of 1010 3s and 1028 7s, each 28 pixels by 28 pixels.

Now we'd like to know the "distance" of each from *both* the average 3 and the average 7 so we can compare those distances and decide how to classify each: a 3 if it is more similar (less distant) to the average 3 than to the average 7.

Here's a little function to calculate the "distance" between two images (or, because of some optimzations that tensors make possible, between a stack of images and the average image):

In [None]:
def calc_distance(a, b):
    return (a-b).abs().mean((-1, -2))

We'll use the *mean absolute distance* method. `(-1, -2)` means we'll average the last and the second-to-last dimensions (so the second and third dimensions).

And here's a little function to determine if a given image is a 3 -- that is, if it is more similar to the average 3 than the average 7:

In [None]:
def is_3(x):
    return calc_distance(x, average3) <= calc_distance(x, average7)

(Notice that we'll give the tie to 3. In practice, I doubt we'll ever get identical distances, so nothing really rides on that decision.)

We can test out our little functions on our examples from earlier:

In [None]:
is_3(ex_three_tensor)

In [None]:
is_3(ex_seven_tensor)

Okay, moment of truth. Let's see how many of our validation images `is_3` correctly classifies.

In [None]:
accuracy_3s = is_3(validation_3_tensors).float().mean()
accuracy_7s = (1 - is_3(validation_7_tensors).float()).mean()
accuracy_3s,accuracy_7s,(accuracy_3s+accuracy_7s)/2

Not too bad! We correctly identified 7s more often than we correctly identified 3s, but between the two, we have ~95% accuracy. But before you get too excited, remember that 3 and 7 are not *that* similar and we made the task easier by only needing to distinguish between 2 numbers.

(NOTE: you may have wondered how we could pass a whole "stack" of images to `is_3`. Didn't we write it to find the difference between just two images (one of which was the "averaged" image)? Well yes! But it works because PyTorch defines tensor operations so that the "unstacked" "averaged" image gets *broadcast* -- treated *as if* it had the same dimensions as the stack. On top of that, if executed on a GPU, it can do whole batches of those calculations in parallel.)