## Chapter - Metrics and Loss functions 

- Metrics 
- Loss functions 

Before discussion these both concepts. lets call our validation data to and process using the model we have saved in the earlier section. We will use these predictions and original labels of validation data to calculate all the metrics and dicuss the loss functions further.

Note: Since our **agumentation** function in the coco_dataloader.py file has randomness involved in it (choosing random scales and choosing random locations for crop), I have created another function called **val_augmentation**. This uses only one scale and removes random cropping and flipping, which helps in replicating the results.

```python
from coco_dataloader import aspect_ratio_calc, resize_image, pad_image

def val_augmentation(image, label, resize, crop_size, mean_bgr, ignore_label=255):
    
    ## h and w select with changing aspect ratio
    h, w = aspect_ratio_calc(image, label, resize)
    
    ## resize
    image, label = resize_image(image, label, (int(h), int(w)))
    
    # Padding to fit for crop_size
    image, label = pad_image(image, label, crop_size, mean_bgr, ignore_label)
    
    ## crop the necessary portion of the image.
    image = image[:crop_size, :crop_size]
    label = label[:crop_size, :crop_size]
    return image, label
```

In [1]:
## import necessary libraries 
from segmentation_models.backbones import get_preprocessing
from segmentation_models import Unet
import numpy as np 
import glob

## Calling val ids 
val_ids = [i.rsplit("/")[-1].rsplit(".")[0] for  i in glob.glob("../../data/cocostuff/images/val/*.jpg")]

## Load the model  and the preprocessing func
BACKBONE = 'resnet50'
preprocess_input = get_preprocessing(BACKBONE)
model = Unet(BACKBONE, encoder_weights='imagenet', classes=1, activation="sigmoid")

## Load the trained weights 
model.load_weights("../../data/cocostuff/model.h5")
print("Model weights loaded")

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.




Model weights loaded


In [2]:
from coco_dataloader import get_image_and_mask, val_augmentation
from tqdm import tqdm 

root="../../data/cocostuff/images/"
folder_name = "val"

original = []
predicted = []
for i in tqdm(val_ids):
    ## get image and mask
    image, label = get_image_and_mask(root, folder_name, i)
    
    ## preprocess using val augmentation
    image, label = val_augmentation(image, label, 512, 448, (0, 0, 0), 0)
    
    ## preprocess as per network requirement
    image_preprocess = preprocess_input(image)
    
    ## convert to float
    image_final = np.expand_dims(image_preprocess, 0).astype(np.float64)
    
    ## predict on image
    pred = model.predict(image_final)
    
    ## Store the labels
    original.append(label)
    predicted.append(pred)
print(len(original), len(predicted))

100%|██████████| 564/564 [00:36<00:00, 15.60it/s]

564 564





In [47]:
## Concatenating all the predictions 
final_preds = np.concatenate([np.squeeze(i, 3) for i in predicted])
final_preds = np.where(final_preds>0.5, 1, 0) ## Using a threshold of 0.5 
print(final_preds.shape)

(564, 448, 448)


In [48]:
## Concatenating all the original labels 
final_orig = np.concatenate([np.expand_dims(i, 0) for i in original])
print(final_orig.shape)

(564, 448, 448)


In [49]:
## b is the total number of images 
## h is height 
## w is width
b, h, w = final_orig.shape

## 1. Metrics 
Metrics are the way of knowing how well your model is performing. In the above examples we have used iou_score without much knowledge about it.  In segmentation there are various metrics used to calculate the network performace. Lets look at a few of the most important metrics below and the use cases when we need to use it.  
1) f1_score or dice_score  
2) f2_score  
3) iou_score  
4) pixel accuracy.

Before going in depth of metrics, lets dive into some of the terminology. The below diagram from wikipedia comes handy to answer a lot of questions.

![confustion_matrix](../images/confusion_matrix.png)

In [72]:
beta = 2
smooth = 1e-12
tp  = 0.91
fp = 0.19
fn = 0.03 

score = ((1 + beta ** 2) * tp + smooth) \
/ ((1 + beta ** 2) * tp + beta ** 2 * fn + fp + smooth)
score

0.9362139917695603

In [58]:
(final_orig == final_preds).sum()/ (564 * 448 * 448)

0.856757979642156

## Precision 
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Here in our case, precision means of the predicted human pixels, how many are actually human pixels.

\begin{equation*}
Precision = tp/(tp+fp)
\end{equation*}

In [73]:
## Of all the original pos labels. How many are actually predicted as pos ?
tp = np.sum(final_preds[np.where(final_orig == 1)])

## of all the original neg labels, how many are actually predicted as pos ?
fp = np.sum(final_preds[np.where(final_orig == 0)])

## precision is defined above
precision = tp/(tp+fp)
precision

0.7980823777051048

## Recall (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all positive observations in actual class. In our use case, recall means of all the original human pixels, how many are correctly predicted as human pixels.


\begin{equation*}
Recall = tp/(tp+fn)
\end{equation*}

In [76]:
## Of all the original pos labels. How many are actually predicted as pos ?
tp = np.sum(final_preds[np.where(final_orig == 1)])

## of all the original pos labels, how many are actually predicted as neg ?
fn = final_preds[np.where(final_orig == 1)].shape[0] - np.sum(final_preds[np.where(final_orig == 1)])

recall = tp/(tp+fn)
print(recall)

0.9136739722577483


## Pixel Accuracy
Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. 

\begin{equation*}
Accuracy = (tp+ tn)/(tp+fp+fn+tn)
\end{equation*}

In [77]:
## Of all the original pos labels. How many are actually predicted as pos ?
tp = np.sum(final_preds[np.where(final_orig == 1)])

## of all the original neg labels, how many are actually predicted as pos ?
fp = np.sum(final_preds[np.where(final_orig == 0)])

## of all the original pos labels, how many are actually predicted as neg ?
fn = final_preds[np.where(final_orig == 1)].shape[0] - np.sum(final_preds[np.where(final_orig == 1)])


## of all the original neg labels, how many are actually predicted as neg?
tn = final_preds[np.where(final_orig == 0)].shape[0] - np.sum(final_preds[np.where(final_orig == 0)])


accuracy = (tp+tn)/(tp+tn+fp+fn)
print(accuracy)

0.856757979642156


## F1 score or Dice Score 
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. 


\begin{equation*}
F1 Score = 2*(Recall * Precision) / (Recall + Precision) 
\end{equation*}


which can also be written as  

\begin{equation*}
F1Score = 2TP/ (2TP + FP + FN)
\end{equation*}

In [78]:
f1_score = (2*tp)/(2*tp+ fp+ fn)
print(f1_score)

0.851975336609741


## F2 Score 
Since F1 score is only the hormonic mean of precision and recall it weights precision and recall the same way. What if you want to keep more weight to precision then recall or the reverse, here comes the **fbeta_score**. 


\begin{equation*}
f_\beta score = ((1 + \beta^{2}) * tp ) \
/ ((1 + \beta^{2}) * tp + \beta^{2} * fn + fp )
\end{equation*}

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of precision in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> inf only recall).

F2 score means keeping beta value at 2, which means favoring recall more.

In [79]:
beta = 2 
fbeta_score = ((1 + beta ** 2) * tp ) \
/ ((1 + beta ** 2) * tp + beta ** 2 * fn + fp )
print(fbeta_score)

0.8879523595533807


## IOU score 
The Jaccard Index also known as Intersection over Union and the jaccard similarity coefficient is a statistic used to compare the similarity and diversity of sample sets. The jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection by the size of the union of the sample sets. 

In simple terms we can say that it is the ratio of area of overlap to the area of union. We can understand this using a venn diagram below




![IOU score](../images/iou.png)

In the above diagram, we can see there are two rectangles, The top rectangle is called ground truth (In our case these are actual human pixels), the bottom rectangle is the predicted one (Algo predicted human pixels). In an ideal case both these rectangle should have full over lap, but sometimes we miss predicting some object (human) pixels (False negatives - Represented in yellow) and sometimes we predict non-object pixels (non-human pixels) as object (human) pixels (False negatives- Represented in red). So IoU as metric takes care of both these things. It is the area of correctly predicted pixels (true positives) to the total area of (true positives, false positives and false negatives). In terms of equation, we can write as follows.

In terms of TP, FP and FN it can be written as 
\begin{equation*}
IoU = TP/(TP+FP+FN)
\end{equation*}


- For multi-class classification problems, IoU is calculated for each object, averaged across each class and then average over all the classes.
- The maximum value is 1 when FP and FN are zero. The minimum value is zero when TP is zero.

In [80]:
iou_score = (tp)/(tp+fp+fn)
iou_score

0.7421228513451552

## Quiz

Q) When both precision and recall are of equal importance, which of the following metric is best to use? (one or more correct answers).  
A) F1 Score  
B) F2 Score  
C) pixel accuracy  
D) Any of the metric is okay.  

Ans) A, D. F1 score is used when precision and recall are of equal importance. 

## Task
In the above example what is F0.5 score ?

(a) the objective of performing the task i.e. what will the learner perform and gain out of that task, 
To calculate the F0.5 score.

(b) a set of instructions and 
- replace beta value with 0.5 in the F2_score section

(c) the solution code. 
```python
beta = 0.5
fbeta_score = ((1 + beta ** 2) * tp ) \
/ ((1 + beta ** 2) * tp + beta ** 2 * fn + fp )
print(fbeta_score)
```

## Loss Functions 
For any deep learning models we need to have a loss function that needs to be optimized. In the above experiements we have used **bce_jaccard_loss** without much knowledge about it. In this section lets look at the intution behind each loss function and how they are calculated.

The major loss functions include
- Jaccard loss
- dice_loss 


## Jaccard loss or BCE_Jaccard loss or cce_jaccard loss
- Jaccard loss is 1-iou_score. 
- Cross entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. 

\begin{equation*}
CE = \sum_{Nclasses} y_{true} * log(y_{pred})
\end{equation*}

- Since jaccard loss is essentially the measure of overlap between two samples and cross entropy loss evaluates at pixel level. Combaining both these loss functions will help the network train better. This is empherically tested and is generally phenomena across data science practioners. 
- In multi-class classification problem, jaccard loss is calculated first individually for each class as mentioned above and later averaged over all the classes. for multi-classification loss we categorical cross entropy loss instead of binary cross entropy loss.


## Dice loss or bce_dice_loss or cce_dice_loss 
- Dice_loss is 1-f2_score.
- It is very similar to what we have discussed previously except that instead of iou_score we use f2_score.
- For multi-class classification too dice loss is calculated for each class separately and later averaged over all the classes.

## End Notes:
- In the first section, We have seen the fundamental difference between image classification and segmentation, The practical use cases of the segmentation across different industries. We have later seen the different kinds of segmentation along with uses cases for each one.
- In the second section, we have seen the public datasets available for semantic segmentation and how to process cocostuff dataset using cocostuffapi.
- In the third section, we have seen how to construct a data loader for semantic segmentation task using keras.utils.sequence module and also learnt how to do different kinds of augmentation and color transformation
- In the fourth section we have trained segmentation model using unet architecutures, introduced segmentation models repo and discussed various architecutures available.
- In the final section, we have looked into various metrics and loss functions used in semantic segmentation.

With this, we have come to an end. Now you are equipped with all the tools necessary to deal with semantic segmentation. 