# Classification

## Background
In the previous notebooks 'MinimizingLoss.ipynb', 'SequentialModel.ipynb' and 'MultiLayerSeqMod.ipynb', we had been looking into how computers learn with ML by fitting internal model paramters of a function or a neuron to match input and ouput values together when there is a linear relationship between them.
** This is often called as Regression where neural network predicts a single value given one or more inputs.**

Inputs and Ouputs can have non-linear relationships in neural nets. Such relationships can be useful to get some hidden feaure information about the data. For example if we have images data, a neural net can learn the features of images and then tell us what kind of image is that. 
** This is called as Classification.** 

## Main Idea Breakdown
Main idea in classification also have roots into simpler Regression neuron. Classification entails a bit more advanced dense neural network (DNN) which means multiple layers of neurons where each neuron from the previous layer is connected to each neuron in the next layer.
Note here that we are talking about multiple neurons in the top most layers as well. We haven't played with such layer yet in the previous notebooks where we always had a single neuron in the top most layer.
So, imagine this, we have two layers:
1) Layer1 have i=20 neurons
2) Layer2 have j=10 neurons
Every neuron from Layer1 is connected to every neuron in Layer, hence making our DNN. 
Imagine that each neuron provides a function y=wx+b where y is the output of the neuron and x is the input of the neuron.  

Now think like this that each neuron in Layer1 learns (optimizes) weights on input data x and calculates a y. The neuron in the next layer, Layer2, receive inputs from every neuron in Layer 1. Following relationships hold:

Eq.1: Layer1_Ni_Output = (L1NiW * valueToPredict) + L1NiB 
Eq.2: Layer2_Nj_Output =  
[(L2NjW0 * Layer1N0_Output) + (L2NjW1 * Layer1N1_Output) ... (L2NjWi * Layer1Ni_Output)] + L2NjB

We can see that every neuron in Layer2 calculates the regression function as given in Eq.2, i.e., by learning wights for the output of each Layer1 neuron and summing up all those and adding it's learned bias parameeter to finally reach it's own output. 

However, here is a twist. For Classification purposes, you can assosicate a certain Layer2 neuron with a certain property of the data. For example if Layer2N1 gives a distinct value for a dog, you could associate Layer2N1 with dog and not cat. This is how when you predict or validate you can calculate the probablity that the input matches a sequence in Layer2 neurons that was seen previously by the model for Cat or a Dog. How correct is this understanding ?

## Linus Says: <F*** off, show me the code>
Let's build a DNN with same layering approach as mentioned above. We are going to use Mnist hand-written numbers 0-9 data set.

### Info about the data set
The data set contains 60,000 hand-written images showing numbers from 0 to 9. There are 10,000 images for validation purposes. 
Each image is made up of 28x28 pixels whereas each pixel is a monochrom pixel with values ranging from 0-255. Usually, to simplify and quicken up the processing we divide the whole thing with 255.0 so that we get values in range from 0 to 1.

### Laying down the model
To understand basics of what a tensor is, check this out https://tensorflow.org/guide/tensor.
As each image has ths shape(28, 28), i.e., the rank of this tensor is 28. Or you can say there are 28 axes/dimensions to one image. This would mean every image would need to be handled by a layer that has 28 dimensions. This would be very complicated and very compute intensive. Thus we can choose a simpler approach.

This approach entails laying down one dimensional neuron layer and then pre-processing each image to become one-dimensional as well. ** This technique is called as Flattening**. So we can flatten the image from shape(28, 28) into shape(784, 1). So we have basically flatten a multidimentional tensor into a Vector, i.e., a tensor with 1 axis or dimension. Now we have 768 values or data-points that represent an image.   

The number of neurons in the first layer is arbitrary for now. But there are techniques for how to come up with an optimized number. We choose to have 20 neurons in the first layer. While defining layer 1 we are going to use an activation function calles as ReLu which stands for Rectified linear unit. **Every neuron calls it's activation function when it's layer is in use. The ReLu function changes any output that is less than 0 to 0.**

if Output < 0
   then Output = 0

This is done to ensure that the outputs of the neurons do not cancel each other out. Doing this ReLu activation introdce non-linearity into the system which is not necessarily a bad thing. How can we demonstrate that ?

The number of neurons in the 2nd layer is not arbitrary. The number is 10 and that is because we have 10 numbers from 0-9 which we want to classify. Layer 2 is going to use another activation function, known as SoftMax. This function basically picks the largest of the outputs.  

## Model Compilation Parameters
We are using some new things here. So let's add some info about them. 

An optimizer called 'adam': It is like stochastic gradient descent but it can adjust learning rate or descent step size on the fly. That means it can converge more quickly. 

A loss function called 'sparse categorical cross entropy': I am not too sure about this yet. Some  useful info can be read from here, https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other.
Stupid rule of thumb would be use at least ctaegorical loss functions for classifcation models. Seems like if you use one-hot encoding (using 0s and 1s to mark labels) then use categorical cross entropy.
If using integers instead (1,2,3 ...) then use sparse categorical cross entropy.  

An indication for specific type of metric to be used: We have choses 'accuracy' metric. There are numerous metrics defined here https://www.tensorflow.org/api_docs/python/tf/keras/metrics. 'accuracy' metric basically says that we want to see how often predictions match the corresponding label.

In [2]:
# Let's use Mnist data set, why not
import tensorflow as tf
data = tf.keras.datasets.mnist
(training_images, training_labels), (validation_images, validation_labels) = data.load_data()
training_images = training_images/255.0
validation_images = validation_images/255.0

layer1 = tf.keras.layers.Dense(units=20, activation=tf.nn.relu)
layer2 = tf.keras.layers.Dense(units=10, activation=tf.nn.softmax)

model = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape=(28,28)), layer1, layer2])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x=training_images, y=training_labels, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f122855b0a0>

In [6]:
# this function evaluates the model by giving a set of 10,000 validation images
# Note the accuracy will be lower than the accuracy measured while training. 
# The reason is that the model hasn't seen this shit before.
model.evaluate(validation_images, validation_labels)



[0.1437711864709854, 0.9595000147819519]

In [24]:

label = validation_labels[0]
classification = model.predict(validation_images)
import numpy as np

print(f'Value: {np.argmax(classification[0])} \n Probability:{np.amax(classification[0])*100},  \n Label: {label}')


Value: 7 
 Probability:99.89122152328491,  
 Label: 7


## Some manual analysis

In [30]:
print(f'Total number of weights at layer1: {layer1.get_weights()[0].size}')
print(f'Expected number of weights: {20*784}')

Total number of weights at layer1: 15680
Expected number of weights: 15680


784 flattened pixel values were provided for an image to each of the 20 neurons in the layer1. Hence the number printed above.

In [31]:
print(f'Total number of weights at layer2: {layer2.get_weights()[0].size}')
print(f'Expected number of weights: {20*10}')

Total number of weights at layer2: 200
Expected number of weights: 200


Each of the 10 neurons in layer2 receives 20 inputs from the previous layer and hence each neuron has to calculate 20 weights. 