# TensorFlow 2: Softmax Regression using  on the MNIST Handwritten Digit Dataset

<h2>[1]. Introduction</h2>
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems.
The database is also widely used for training and testing in the field of machine learning.
Furthermore, the black and white images from MNIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

The MNIST database contains 60,000 training images and 10,000 testing images.
There have been a number of scientific papers on attempts to achieve the lowest error rate; one paper, 
using a hierarchical system of convolutional neural networks, manages to get an error rate on the MNIST database of 0.23%
In their original paper, they use a support vector machine to get an error rate of 0.8%.
An extended dataset similar to MNIST called EMNIST has been published in 2017, which contains 240,000 training images, and 40,000 testing images of handwritten digits and characters.

This is a dataset of 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images. More info can be found at the MNIST homepage.

A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test new machine learning approaches. For details, see http://yann.lecun.com/exdb/mnist/

<h2>[2]. Complete Code</h2>

In [39]:
import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data(path="C:\\Users\\RAZ3KOR\\SimpliLearn\\data\\mnist.npz")
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.08143853396177292, 0.9761000275611877]

<h2>[3]. Code Explanation</h2>

Helpful links:

1. Machine Learning Glossary: https://developers.google.com/machine-learning/glossary

2. TensorFlow 2 quickstart for beginners:  https://www.tensorflow.org/tutorials/quickstart/beginner

<h3>[3.1]. First of all, we import the dependencies.</h3>

In [1]:
# importing modules/dependencies  
import tensorflow as tf 
print("TensorFlow version:", tf.__version__) #TensorFlow version: 2.8.0
import numpy as np 
import matplotlib.pyplot as plt 

TensorFlow version: 2.2.0


<h3>[3.2]. Download and load the dataset.</h3>

<h5>tf.keras.datasets.mnist.load_data()</h5>

It loads the MNIST dataset and returns Tuple of NumPy arrays: (x_train, y_train), (x_test, y_test).

x_train: uint8 NumPy array of grayscale image data with shapes (60000, 28, 28), containing the training data. Pixel values range from 0 to 255.

y_train: uint8 NumPy array of digit labels (integers in range 0-9) with shape (60000,) for the training data.

x_test: uint8 NumPy array of grayscale image data with shapes (10000, 28, 28), containing the test data. Pixel values range from 0 to 255.

y_test: uint8 NumPy array of digit labels (integers in range 0-9) with shape (10000,) for the test data.

In [2]:
#here we already have downloaded mnist.npz file in our current path. Please note, You need to give the full path of the file
#if you give the path , it will not try to download the data from the server, else it will try to load the data.
#For more information, check this link: https://github.com/keras-team/keras/blob/master/keras/datasets/mnist.py#L11
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data(path="C:\\Users\\RAZ3KOR\\SimpliLearn\\data\\mnist.npz")
x_train, x_test = x_train / 255.0, x_test / 255.0

<h3>[3.3]. Build the machine learning model.</h3>

<h4>tf.keras.layers.Flatten: Flattens the input. Does not affect the batch size</h4>
tf.keras.layers.Flatten(
    data_format=None, **kwargs
)

data_format =	A string, one of channels_last (default) or channels_first. The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch, ..., channels) while channels_first corresponds to inputs with shape (batch, channels, ...). It defaults to the image_data_format value found in your Keras config file at ~/.keras/keras.json. If you never set it, then it will be "channels_last". 

In [3]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(64, 3, 3, input_shape=(3, 32, 32)))
print(model.output_shape) #(None, 1, 10, 64)

model.add(tf.keras.layers.Flatten())
print(model.output_shape) #(None, 640)

(None, 1, 10, 64)
(None, 640)


<h4>tf.keras.layers.Dense: Just your regular densely-connected NN layer</h4>
tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer='glorot_uniform',
    bias_initializer='zeros',
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True). These are all attributes of Dense.

Note: If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel (using tf.tensordot). For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).

Besides, layer attributes cannot be modified after the layer has been called once (except the trainable attribute). When a popular kwarg input_shape is passed, then keras will create an input layer to insert before the current layer. This can be treated equivalent to explicitly defining an InputLayer.

Example:

In [4]:
# Create a `Sequential` model and add a Dense layer as the first layer.
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(16,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
# Now the model will take as input arrays of shape (None, 16)
# and output arrays of shape (None, 32).
# Note that after the first layer, you don't need to specify
# the size of the input anymore:
model.add(tf.keras.layers.Dense(32))
model.output_shape #(None, 32)

(None, 32)

#Args:

units: 	Positive integer, dimensionality of the output space.

activation: 	Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).

use_bias: 	Boolean, whether the layer uses a bias vector.

kernel_initializer: 	Initializer for the kernel weights matrix.

bias_initializer: 	Initializer for the bias vector.

kernel_regularizer: 	Regularizer function applied to the kernel weights matrix.

bias_regularizer: 	Regularizer function applied to the bias vector.

activity_regularizer: 	Regularizer function applied to the output of the layer (its "activation").

kernel_constraint: 	Constraint function applied to the kernel weights matrix.

bias_constraint: 	Constraint function applied to the bias vector.

Input shape:
N-D tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).

Output shape:
N-D tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units).

<h4>tf.keras.layers.Dropout: Applies Dropout to the input</h4>
tf.keras.layers.Dropout(
    rate, noise_shape=None, seed=None, **kwargs
)

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.

Note that the Dropout layer only applies when training is set to True such that no values are dropped during inference. When using model.fit, training will be appropriately set to True automatically, and in other contexts, you can set the kwarg explicitly to True when calling the layer.

(This is in contrast to setting trainable=False for a Dropout layer. trainable does not affect the layer's behavior, as Dropout does not have any variables/weights that can be frozen during training.)




Dropout layer
Dropout class

tf.keras.layers.Dropout(rate, noise_shape=None, seed=None, **kwargs)

Applies Dropout to the input.

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.

Note that the Dropout layer only applies when training is set to True such that no values are dropped during inference. When using model.fit, training will be appropriately set to True automatically, and in other contexts, you can set the kwarg explicitly to True when calling the layer.

(This is in contrast to setting trainable=False for a Dropout layer. trainable does not affect the layer's behavior, as Dropout does not have any variables/weights that can be frozen during training.)

>>> tf.random.set_seed(0)
>>> layer = tf.keras.layers.Dropout(.2, input_shape=(2,))
>>> data = np.arange(10).reshape(5, 2).astype(np.float32)
>>> print(data)
[[0. 1.]
 [2. 3.]
 [4. 5.]
 [6. 7.]
 [8. 9.]]
>>> outputs = layer(data, training=True)
>>> print(outputs)
tf.Tensor(
[[ 0.    1.25]
 [ 2.5   3.75]
 [ 5.    6.25]
 [ 7.5   8.75]
 [10.    0.  ]], shape=(5, 2), dtype=float32)

Arguments

    rate: Float between 0 and 1. Fraction of the input units to drop.
    noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features).
    seed: A Python integer to use as random seed.

Call arguments

    inputs: Input tensor (of any rank).
    training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (doing nothing).


In [5]:
tf.random.set_seed(0)
layer = tf.keras.layers.Dropout(.2, input_shape=(2,))
data = np.arange(10).reshape(5, 2).astype(np.float32)
print(data)

[[0. 1.]
 [2. 3.]
 [4. 5.]
 [6. 7.]
 [8. 9.]]


In [6]:
outputs = layer(data, training=True)
print(outputs)

tf.Tensor(
[[ 0.    1.25]
 [ 2.5   3.75]
 [ 5.    6.25]
 [ 7.5   8.75]
 [10.    0.  ]], shape=(5, 2), dtype=float32)


Args:

rate: 	Float between 0 and 1. Fraction of the input units to drop.

noise_shape: 	1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features).

seed: 	A Python integer to use as random seed.

Call arguments:

    inputs: Input tensor (of any rank).
    training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (doing nothing).


In [10]:
#Build a tf.keras.Sequential model by stacking layers
#here the image is a 2D matrix, Flatten layer will convert that to an array based on the input_shape.
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)), #Flattens the input (output size= 28*28 = 784, no param). Does not affect the batch size.
  tf.keras.layers.Dense(128, activation='relu'), #Just your regular densely-connected NN layer (output size=128, param= w->784*128  + b->128 = 100480)
  tf.keras.layers.Dropout(0.2), #The Dropout layer is a mask that nullifies the contribution of some (here 20 %) neurons towards the next layer and leaves unmodified all others (output size= 128, no param).
  tf.keras.layers.Dense(10) #Just your regular densely-connected NN layer with no activation, since we will aplly softmax at the end (output size= 10, param=  w->128*10  + b->10 = 1290)
])

In [9]:
#Once a model is "built", you can call its summary() method to display its contents:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               100480    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


In [8]:
#to know the input shape
model.input_shape #(None, 28, 28)

(None, 28, 28)

In [58]:
#to know the ouput shape
model.output_shape #(None, 10)

(None, 10)

<h4>logits:  In multi-class classification problem, logits typically become an input to the softmax function</h4>

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more information, see tf.nn.sigmoid_cross_entropy_with_logits.

<h4>log-odds</h4>

The logarithm of the odds of some event.

If the event refers to a binary probability, then odds refers to the ratio of the probability of success (p) to the probability of failure (1-p). For example, suppose that a given event has a 90% probability of success and a 10% probability of failure. In this case, odds is calculated as follows:

odds = p/(1-p) = 0.9/.1 = 9

The log-odds is simply the logarithm of the odds. By convention, "logarithm" refers to natural logarithm, but logarithm could actually be any base greater than 1. Sticking to convention, the log-odds of our example is therefore:

log-odds = ln(9) = 

The log-odds are the inverse of the sigmoid function.

In [14]:
predictions = model(x_train[:1]).numpy() #the model returns a vector of logits or log-odds scores, one for each class
predictions

array([[-0.18428129, -0.7219969 , -0.47647095,  0.10074701, -0.2048079 ,
        -0.30569717, -0.27898103, -0.15912877, -0.20344016, -0.17858827]],
      dtype=float32)

In [15]:
#The tf.nn.softmax function converts these logits to probabilities for each class:
tf.nn.softmax(predictions).numpy()

array([[0.1058458 , 0.06182252, 0.07902733, 0.14075372, 0.1036953 ,
        0.09374398, 0.09628221, 0.10854186, 0.10383722, 0.1064501 ]],
      dtype=float32)

<h4>Note: It is possible to bake the tf.nn.softmax function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.</h4>

Define a loss function for training using losses.SparseCategoricalCrossentropy, which takes a vector of logits and a True index and returns a scalar loss for each example.

In [30]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [27]:
print(y_train[1:])

[0 4 1 ... 5 6 8]


This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.

This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3.

In [31]:
loss_fn(y_train[:1], predictions).numpy()

2.3671877

<h3>[3.4]. Compile the machine learning model</h3>

Before you start training, configure and compile the model using Keras Model.compile. Set the optimizer class to adam, set the loss to the loss_fn function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics parameter to accuracy.

In [32]:
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

<h3>[6]. Train and evaluate your model</h3>


In [33]:
#Use the Model.fit method to adjust your model parameters and minimize the loss:
model.fit(x_train, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x249dbba8048>

In [34]:
#The Model.evaluate method checks the models performance, usually on a "Validation-set" or "Test-set".
model.evaluate(x_test,  y_test, verbose=2)

313/313 - 0s - loss: 0.0754 - accuracy: 0.9768


[0.07540382444858551, 0.9768000245094299]

The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the TensorFlow tutorials.

In [35]:
#If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:
probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])

In [37]:
x_test[:5]

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

    

In [36]:
probability_model(x_test[:5])

<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[1.35855611e-08, 2.77398762e-08, 1.94282929e-05, 1.47160172e-05,
        4.52103563e-11, 1.24562516e-09, 5.92374186e-15, 9.99949932e-01,
        7.01450489e-08, 1.57407758e-05],
       [9.70423844e-06, 2.09432710e-05, 9.99934077e-01, 2.17639972e-05,
        3.52543569e-13, 1.33275580e-05, 1.80792341e-08, 2.92987070e-14,
        1.17694384e-07, 9.48589471e-12],
       [2.52938025e-06, 9.99515414e-01, 3.47253372e-05, 2.93347057e-05,
        6.33147065e-05, 5.51480935e-06, 3.93062073e-05, 2.50117824e-04,
        5.94164667e-05, 4.25687773e-07],
       [9.99409318e-01, 7.16708648e-09, 3.37866441e-05, 3.05442889e-08,
        3.95960296e-06, 6.81661686e-06, 5.38714230e-04, 3.43730267e-06,
        2.16267892e-07, 3.69892700e-06],
       [2.30002456e-06, 3.09313570e-08, 5.78733961e-05, 7.96103379e-08,
        9.96653378e-01, 9.81490231e-08, 4.82993564e-05, 5.18467095e-05,
        4.14015085e-06, 3.18196113e-03]], dtype=float32)>

<h2>[3]. Conclusion</h2>
Congratulations! You have trained a machine learning model using a prebuilt dataset using the Keras API.

For more examples of using Keras, check out the tutorials. To learn more about building models with Keras, read the guides. If you want learn more about loading and preparing data, see the tutorials on image data loading or CSV data loading.