# Model Evaluation 🔍📊

### In this kernel we'll look at how to evaluate *tf-keras* model for `multilabel-text` data.

![Header Image](https://www.aihr.com/wp-content/uploads/Training-evaluation-cover-1000x553-1.png)

## About Proejct📜:

- I'm currently learning `MLOps` for Deep Learning/BERT model so for that I picked up a text-data from Kaggle which `jigsaw-comment-classification-challange.`
- I uploaded the code of whole ongoinh `comment-toxicity-detection project` on https://github.com/karan842/comment-toxicity
- I created `DVC` based data cleaning pipeline which can clean the text data and saved them in `.csv` format.
- Meanwhile, I started analyzing the text data to gain information about the comments and how they are i.e toxic, insulting, threating or all.
- Perfomed EDA and plotted the charts, graphs using Python libraries such as `Matplotlib and Seaborn`. You can find text analytics notebook in `notebooks` folder on given repository.
- After performing text data cleaning and Analyzing process and I started building a Tensorflow based neural network model which can detect the toxicity in the comment. For that I've `BERT` pre-trained model. You can also find that notebook on `notebooks` folder.
- I trained the model on `Kaggle GPU P100`, although I've GPU in my system but I prefer Kaggle for acceleration. I saved the model and used it in `Flask` app under APIs.
- I tested APIs using `Postman` and then built a dockerfile which can support `GPU and CUDNN` to run neural network model.
- I'll deploy this FlasAPI on Cloud using using Github Actions.


## What is in this kernel🤔?

- In this kernel we'll look at the model evaluation part for multilabel classification data.

- *Model evaluation* is the process of using different evaluation metrics to understand a machine learning model's performance, as well as its strengths and weaknesses


#### Let's revise some important concepts for model evaluation techniques on classification data.
- Source: `ChatGPT`

1. **Confustion Matrix**: A matrix that displays the number of true positive, true negative, false positive and false negative predictions made by a model.

   1. **True Positive(TP)** represents the number of True Positives. This refers to the total number of observations that belong to the positive class and have been predicted correctly.
   
   2. **True Negative(TN)** represents the number of True Negative. This is the total number of observations that belong to the negative class and have been predicted correctly.
   
   3. **False Positive(FP)** also known as `Type 1` error. This is total number of observation that have been predicted to belong positive class, but instead they are actually belong to negative class
   
   4. **False Negative(FN)** also known as `Type 2` error. This is total number of observation that have been predicted to belong negative class, but instead belong to the positive class
   
 
2. **Accuracy**: The proportion of correctly classified instances out of the total number of instances. It is the most commonly used metric for classification problems, but it can be misleading if the classes are imbalanced. `TP/TP+FP`

3. **Precision**: The proportion of true positive predictions out of all positive predictions made by the model. It measures how many of the positive predictions were actually correct.
   
3. **Recall** The proportion of true positive predictions out of all actual positive instances. It measures how well the model can find all the positive instances. TP/TP+FN

4. **F1 Score**: The harmonic mean of precision and recall. It is a balance between precision and recall and can be used when both are important.

**P.S: We're not using `AUC-ROC` metrices because thery useful in binary classification problem.**


### What is Multi-label classification? How it is different than Single-label classification.

- Multi-label classification is a type of supervised machine learning problem where each instance can be assigned multiple labels, rather than just one label in single-label classification. 


- **Key Differences**
 1. **Label Independence**:  In single-label classification, the labels are mutually exclusive, meaning that an instance can only have one label. In multi-label classification, the labels are not mutually exclusive, meaning that an instance can have multiple labels.
 
 2. **Output format**: In single-label classification, the output format is typically a single label or class. In multi-label classification, the output format is typically a binary vector or a probability vector, where each element represents the likelihood of a particular label being assigned to the instance.
 
 3. **Evaluation Metrics**: Because of the different output format, the evaluation metrics for multi-label classification are also different from those of single-label classification. Some commonly used metrics include hamming loss, subset accuracy, jaccard similarity, and F1-score.


### Hamming Losss:

It is a commonly used evaluation metric for measuring the performance of a classifier. It is defined as the proportion of labels that are incorrectly predicted. The value ranges from 0 (perfect prediction) to 1 (all labels are incorrectly predicted).
Advantages of using Hamming loss is that it penalizes all types of errors equally. In other words, it does not distinguish between false positives and false negatives. It is that it's easy to compute and understand. It doesn't take into account the order of the labels, and it doesn't consider the correlations between labels.

### Jaccard Similarity

It is defined as the proportion of labels that are correctly predicted out of the total number of labels in the true and predicted sets. The Jaccard similarity for a dataset is the average Jaccard similarity over all instances in the dataset. The value ranges from 0 (no labels are correctly predicted) to 1 (all labels are correctly predicted). It penalizes predictions where too many labels are predicted, but rewards predictions where the correct labels are predicted. Jaccard similarity doesn't take into account false negatives, which means it only cares about the labels that were predicted regardless of the labels that were missed.

In [2]:
!pip install tensorflow_text==2.8.2

Collecting tensorflow_text==2.8.2
  Downloading tensorflow_text-2.8.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tensorflow<2.9,>=2.8.0
  Downloading tensorflow-2.8.4-cp37-cp37m-manylinux2010_x86_64.whl (497.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m497.9/497.9 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tensorboard<2.9,>=2.8
  Downloading tensorboard-2.8.0-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting libclang>=9.0.1
  Downloading libclang-15.0.6.1-py2.py3-none-manylinux2010_x86_64.whl (21.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.5/21.5 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


## Installing Dependencies

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub

%matplotlib inline
sns.set_style('whitegrid')

## Loading the model and dataset

In [4]:
# loading the dataset
train = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv')
test = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/test.csv')

In [5]:
# loading the keras model
load_model = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
model = tf.keras.models.load_model('/kaggle/input/commenttoxicitymodel/comment-toxicity-bert.h5', 
                                       custom_objects={'KerasLayer':hub.KerasLayer},
                                       options=load_model)
print("Model loaded Successfully!")

Model loaded Successfully!


In [6]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [7]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


## Clean the data 
I copied the code from my Github

In [8]:
import nltk
from nltk.corpus import stopwords
import os
import re
import string
import numpy as np
import pandas as pd
from itertools import groupby

## DATA PREPROCESSING

# Removing duplicates, if words occured more than  2 times in comment.
def remove_duplicates(text_before):
    my_dict = dict()
    text_after = list()
    for word in text_before.split():
        if word not in my_dict.keys():
            my_dict[word] = 1
        else:
            my_dict[word] = my_dict[word] + 1
    
    for key,value in my_dict.items():
        if value>=2:
            text_after.append(key)
        else:
            text_after.append(key)
    return " ".join(text_after)


def denoise_text(text):
    """Make text lowercase, remove text in square brackets, remove links,remove punctuation
    and remove stop words containing numbers"""
    text = text.lower()                                            # Converts the text to lowercase using regex 
    text = re.sub(r"\[.*?\]","",text)                              # Replace's the text into 'nothing" if text is present inside squre brackets.
    text = re.sub("https?://\S+|www\.\S+","",text)                 # Removes the links from the comments.
    text = re.sub("<.*?>+","",text)                                # Remove unwanted
    text = re.sub("[%s]" % re.escape(string.punctuation),"",text)  # Remove punctuations
    text = re.sub("\n","",text)                                    # Remove next line symbols '\n'
    text = re.sub("\w*\d\w*","",text)                              # Takes only albhabet and digits.
    return text

## STOPWORDS
en_stop_words = stopwords.words('english')
more_stopwords = ['u', 'im', 'c', 'cu']
stop_words = en_stop_words + more_stopwords

def remove_stopwords(text):
    text = ' '.join(word for word in text.split(' ') if word not in stop_words)
    return text

# Stemming 
'''
    using snowballstemmer which is better than simple stemming 
    we are not using lemmaization because here we are looking for 
    a performance where time matters.  
    In training we are using BERT to it can understand the sentiments behinf comment_text.
'''
stemmer = nltk.SnowballStemmer("english")

def stemm_text(text):
    text = ' '.join(stemmer.stem(word) for word in text.split(' '))
    return text

'''
    As we are using BERT for predictive analysis, 
    no need to do stemming or lemmatization
    
    Bert uses BPE (Byte- Pair Encoding to shrink its vocab size), 
    so words like run and running will ultimately be decoded to run + ##ing. 
    So it's better not to convert running into run because, in some NLP problems, 
    you need that information.
'''

## Creating one parent function 

def text_data_cleaning(text):
    text = remove_duplicates(text)
    text = denoise_text(text)
    text = remove_stopwords(text)
    return text

In [9]:
## applying cleaning on test and train

train['comment_text_clean'] = train['comment_text'].apply(lambda text: text_data_cleaning(text))
test['comment_text_clean'] = test['comment_text'].apply(lambda text: text_data_cleaning(text))

In [10]:
## Splitting data into train data and target data
from sklearn.model_selection import train_test_split
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
target_data = train[list_classes]
train_data = train['comment_text_clean']

X_train, X_test, y_train, y_test = train_test_split(train_data, target_data, test_size=0.2, random_state=42)

In [11]:
## loading a model
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                    

## Model Evaluation

In [12]:
## y_pred
y_pred = model.predict(X_test)

In [13]:
y_pred[0]

array([0.4170251 , 0.00927003, 0.13616456, 0.00258452, 0.16567819,
       0.0514953 ], dtype=float32)

In [14]:
# Convert predictions to binary format ( 0 or 1)
y_pred = (y_pred>0.5).astype(int)

## Different Multi-label classification metrices

#### Accuracy Score

In [15]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 90.48%


#### Precision and Recall Score

In [16]:
## Precision/Recall Score
from sklearn.metrics import precision_score, recall_score
precision_score = precision_score(y_test, y_pred, average='micro')
recall_score = recall_score(y_test, y_pred, average='micro')

print("Precision: {:.2f}%".format(precision_score * 100))
print("Recall: {:.2f}%".format(recall_score * 100))

Precision: 82.32%
Recall: 35.54%


- When a model has a high precision and low recall, it means that the model is able to identify a high proportion of true positive examples among the positive examples it predicts, but it also predicts many negative examples as positive.

#### Jaccard Score

In [20]:
from sklearn.metrics import jaccard_score

# Compute the Jaccard similarity
jaccard_score = jaccard_score(y_test, y_pred, average='samples')

print("Jaccard similarity: {:.2f}".format(jaccard_score))

Jaccard similarity: 0.03


  _warn_prf(average, modifier, msg_start, len(result))


#### Hamming Loss

In [21]:
from sklearn.metrics import hamming_loss

hm_loss = hamming_loss(y_test,y_pred)

print("Hamming Loss: {:.2f}".format(hm_loss))

Hamming Loss: 0.03


#### Classification Report

In [25]:
from sklearn.metrics import classification_report
# Generate the classification report
class_report = classification_report(y_test, y_pred, zero_division=0)

print(class_report)

              precision    recall  f1-score   support

           0       0.84      0.39      0.53      3056
           1       0.64      0.15      0.24       321
           2       0.87      0.40      0.55      1715
           3       0.00      0.00      0.00        74
           4       0.76      0.36      0.49      1614
           5       0.67      0.01      0.03       294

   micro avg       0.82      0.36      0.50      7074
   macro avg       0.63      0.22      0.31      7074
weighted avg       0.80      0.36      0.49      7074
 samples avg       0.04      0.03      0.03      7074



## Please see my GitHub repository to follow end to end MLOps project on Multilabel Text Data to detect toxicity in the comment.