<a href="https://colab.research.google.com/github/marionwenger/DLColabNotebooks/blob/main/notebooks/12b_mnist_loglike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calculation of the cross entropy loss (NLL) for a classification tasks


**Goal:** In this notebook you will use Keras to set up a CNN for classification of MNIST images and calculate the cross entropy before the CNN was trained. You will use basic numpy functions to calculate the loss that is expected from random guessing and see that an untrained CNN is not better than guessing.

**Usage:** The idea of the notebook is that you try to understand the provided code by running it, checking the output and playing with it by slightly changing the code and rerunning it.

**Dataset:** You work with the MNIST dataset. You have 60'000 28x28 pixel greyscale images of digits (0-9).

**Content:**
* load the original MNIST data
* define a CNN in Keras
* evaluation of the cross entropy loss function of the untrained CNN for all classes
* implement the loss function yourself using the predicted probabilities and numpy


| [open in colab](https://colab.research.google.com/github/tensorchiefs/dl_book/blob/master/chapter_04/nb_ch04_02.ipynb)



#### Imports

First you load all the required libraries.

In [14]:
# load required libraries:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('default')
from sklearn.metrics import confusion_matrix

import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Convolution2D, MaxPooling2D, Flatten , Activation
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import optimizers

### Exercise: Likelihood if you have no clue
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  
If you have no idea about the training dataset, your guess for every image would be 1/nr_of_classes. Calculate the NLL for that case.

In [15]:
# @title Solution
nr_of_classes=10
-np.log(1/nr_of_classes) # aber das ist nicht der ganze NLL, oder? man müsste doch über die Klassen summieren, oder?

2.3025850929940455

## Likelihood of untrained CNN





#### Loading and preparing the MNIST data


In [16]:
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

X_train=x_train / 255 #divide by 255 so that they are in range 0 to 1
X_train=np.reshape(X_train, (X_train.shape[0],28,28,1))
Y_train=tensorflow.keras.utils.to_categorical(y_train,10) # one-hot encoding


Y_train.shape, X_train.shape

((60000, 10), (60000, 28, 28, 1))

## CNN model



In [17]:
# here you define hyperparameter of the CNN
batch_size = 128
nb_classes = 10
img_rows, img_cols = 28, 28
kernel_size = (3, 3)
input_shape = (img_rows, img_cols, 1)
pool_size = (2, 2) # pool size for max pooking 2D see below

In [18]:
# define CNN with 2 convolution blocks and 2 fully connected layers
model = Sequential()

model.add(Convolution2D(8,kernel_size,padding='same',input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(8, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Convolution2D(16, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(Convolution2D(16,kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())
model.add(Dense(40))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# compile model and intitialize weights
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [19]:
# summarize model along with number of model weights
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_4 (Conv2D)           (None, 28, 28, 8)         80        
                                                                 
 activation_6 (Activation)   (None, 28, 28, 8)         0         
                                                                 
 conv2d_5 (Conv2D)           (None, 28, 28, 8)         584       
                                                                 
 activation_7 (Activation)   (None, 28, 28, 8)         0         
                                                                 
 max_pooling2d_2 (MaxPoolin  (None, 14, 14, 8)         0         
 g2D)                                                            
                                                                 
 conv2d_6 (Conv2D)           (None, 14, 14, 16)        1168      
                                                      

Here you predict the probabilities for all images in the training data set. You did not train the network yet, therefore the probabilities will be around 10% for each class.

meaning: the best you get is Zufall

In [20]:
# Calculate the probailities for the training data
Pred_prob = model.predict(X_train)



In [21]:
Pred_prob[0:5] # Wkeit dass das Richtige vorausgesagt wird - wurde noch nicht optimiert, deshalb Zufallswerte!

array([[0.10296943, 0.09263191, 0.10035602, 0.0961155 , 0.11109497,
        0.0962236 , 0.09257398, 0.08844453, 0.10369592, 0.11589415],
       [0.10268418, 0.09098673, 0.10038212, 0.10209884, 0.10864345,
        0.09776118, 0.09255474, 0.08970154, 0.10368382, 0.1115034 ],
       [0.10375823, 0.09088536, 0.0971105 , 0.09826044, 0.10639428,
        0.10105354, 0.09742084, 0.09685304, 0.10034951, 0.10791427],
       [0.10585216, 0.09597373, 0.10191199, 0.09743162, 0.10418429,
        0.09507533, 0.09522017, 0.09180664, 0.1051529 , 0.10739113],
       [0.09703027, 0.09535006, 0.0986185 , 0.0992071 , 0.10573916,
        0.10074426, 0.09695639, 0.09739409, 0.1038496 , 0.10511065]],
      dtype=float32)

In [22]:
Pred_prob.shape, Y_train.shape

((60000, 10), (60000, 10))

### Exercise : Calculate the loss function using numpy
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

*Exercise : Use numpy to calculate the value of the negative log-likelihood loss (=cross entropy) that you expect for the untrained CNN, which you have constructed above to discriminate between the 10 classes. Determine the cross entropy that results from the predicted probabilities (Pred_prob). To determine the cross entropy of the prediction, you can loop over each example and use its true label (Y_train) and the predicted probability for the true class. Do you get the cross entropy value that you have expected?*




In [25]:
# Write your code here
import numpy as np

loss: float = 0
for index in range(Pred_prob.shape[0])):
    loss += (-1)/batch_size*np.log(Pred_prob[index][richtige Klasse!]) # ist batch_size wirklich das richtige 'n'?
    #batch ist pro Durchgang, aber das Modell macht ja mehrere Durchgänge
loss

array([1072.5604, 1100.5393, 1089.9813, 1088.987 , 1049.7903, 1092.1382,
       1102.5281, 1106.1909, 1060.3293, 1038.0255], dtype=float32)

Scroll down to see the solution.

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

In the next cell you calculate the cross entropy loss of each single image, then you sum up all individual losses and divide the sum with the nr of training examples. You take the negative of this result to get the NLL, also known as categorical cross entropy.

In [26]:
loss=np.zeros(len(X_train))
Y=np.argmax(Y_train,axis=1) # Y war hot encoded
for i in range(0,len(X_train)):
  loss[i]=np.log(Pred_prob[i][Y[i]])
-np.sum(loss)/len(X_train)

2.3025348495125773

You get more a similar result as as you got with the model.evaluate function for the untrained CNN.  

In [28]:
model.evaluate(X_train, Y_train,verbose=2)

1875/1875 - 5s - loss: 2.3025 - accuracy: 0.0716 - 5s/epoch - 3ms/step


[2.302535057067871, 0.07164999842643738]

In [None]:
# das ist gut für ein Modell, das noch nicht trainiert ist...

### Exercise: Change normalization
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

Load the data again, but this time do not scale the data. Repeat the analysis. What is the result? Why is it a good idea to check the loss for untrained networks.

In [None]:
# Scroll Down for solution
# ganz oben nicht /255 teilen...
# es kommen sehr schlechte Ergebnisse

# inf sobald mal Null für die korrekte Klasse vorausgesagt wurde, dann wird der loss unendlich...





























#

In the unscaled case, you do not necessarily get 1/10 for the probabilities, and thus the loss is usually larger than 2.3. So, the training starts worse than what you would get from pure guessing. This poor start not only leads to a higher initial loss but also increases training time. Effective initialization ensures a smoother and more efficient learning trajectory.