# Explainability of TFIDF based classification algorithm

## Motivation 

When you try to build Natural Language Processing (NLP) models it could be tedious to understand the model. At times, one might not be confident if the training data is large enough or if the quality of the samples are good enough. As a Data Scientist you do not want to train models on text which might be irrelevant to the NLP task. Hence, understanding your NLP model and the training data at an early stage of model development can go a long way. 

This notebook focuses on understanding the TFIDF based classification algorithm. I have chosen a simple model for explainability of your NLP model as most Machine Learning Engineers tend to start with a simple model before experimenting with more complex models. In this notebook we will try to understand the global explainability of the model and then further take some examples to understand the local explainability. 

## 1. Model description 
For this demonstration we will be using the [IMDB dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download) open scource dataset. This dataset has 25,000 highly polar movie reviews for training and 25,000 for testing. We will be using sklearn's TfidfVectorizer and then we will be using sklearn's SelectKBest to get the top keywords. Further, we will be using Random Forest alghorithm to classify the movie reviews into positive or negative reviews. 

## 2. Explainability 
**Global Explainability:** Once we have obtained satisfactory results with our random forerst classification model. We will further try to understand the top keywords and their contributions to the classification model. This global explainability helps in understanding the overall performance of the model. We should be getting relevant keywords as strong contributors to our model prediction. 

**Local Explainability:** Further we will be focusing on a few examples to undersating what keywords were the major contributors in making a prediction. I believe that Local Explainability can be very helpful for human in the loop use cases. 

## 3. Model Building 

Let's dive in to build the model! 

### 3.1 Data preparation 

In [4]:
import pandas as pd
import numpy as np
import os
import re

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score, accuracy_score

import matplotlip.pyplot as plt 
import seaborn as sns

import shap

In [2]:
# Get path locations 
path_current = os.getcwd()
print("Current Path:", path_current)

os.chdir('../')
path_root = os.getcwd()
print("Root Path:", path_root)

Current Path: /home/vib/Desktop/Github/Explainable-NLP/TFIDF
Root Path: /home/vib/Desktop/Github/Explainable-NLP


Make sure that you have downloaded the IMDB Dataset and if is stored in the currect folder. Then let's load the data and review it. 

In [3]:
# Load the data
df_data = pd.read_csv(path_root + '//data//IMDB Dataset.csv')
df_data.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [4]:
# data exploration 
df_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [5]:
#sentiment count
df_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [6]:
# Defining functions to clean up the reviews dataframe
def label_preprocess(row):
    """
    Doed the following conversions:
    positive: 1
    negative: 0 
    """
    if row.sentiment == 'positive':
        row.sentiment = 1
    else:
        row.sentiment = 0
    return row

lemmatizer = WordNetLemmatizer()   
def text_preprocess(ds: pd.Series) -> pd.Series:
    """
    Remove digits, stopwords, lemmatize 
    """
    lemmatizer = WordNetLemmatizer() 
    for index in range(len(ds)):
        main_words = re.sub('[^a-zA-Z]', ' ', ds[index])                                  # Retain only alphabets
        main_words = (main_words.lower()).split()                                         # Lower case and tokenize
        main_words = [w for w in main_words if not w in set(stopwords.words('english'))]  # Remove stopwords
                                                       
        main_words = [lemmatizer.lemmatize(w) for w in main_words if len(w) > 1]          # lemmatization     
        main_words = ' '.join(main_words)
        ds[index] = main_words
    return ds

df_data['review'] = text_preprocess(df_data['review'])

df_data.apply(label_preprocess, axis = 'columns')
df_data.head()

Unnamed: 0,review,sentiment
0,one reviewer mentioned watching oz episode hoo...,1
1,wonderful little production br br filming tech...,1
2,thought wonderful way spend time hot summer we...,1
3,basically family little boy jake think zombie ...,0
4,petter mattei love time money visually stunnin...,1


### 3.2 Prepare Training and Test data set

In [7]:
# Split the data into test and train set
X = df_data['review']
y = df_data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19)

print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))

40000
10000
40000
10000


### 3.3 Create TFIDF Vectorizer

In [8]:
tfidf_vectorizer = TfidfVectorizer(max_features = 5000, 
                                   sublinear_tf = True, 
                                   max_df = 0.7, 
                                   ngram_range = (1,3))

X_train_vectors = tfidf_vectorizer.fit_transform(X_train)
X_train_vectors

<40000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 3440577 stored elements in Compressed Sparse Row format>

In [10]:
df_feature_names = pd.DataFrame(columns = tfidf_vectorizer.get_feature_names())
X_train_vectors = pd.DataFrame(X_train_vectors.toarray(), 
                               columns = tfidf_vectorizer.get_feature_names())
print(X_train_vectors.shape)
X_train_vectors.head()



(40000, 5000)


Unnamed: 0,abandoned,abc,ability,able,absence,absolute,absolutely,absolutely nothing,absurd,abuse,...,young,young boy,young girl,young man,young woman,younger,youth,zero,zombie,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.097754,0.0,0.0,0.0,0.167999,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.4 Feature Selection 

In [3]:
selector = SelectKBest(f_classif, 
                       k = 200)
X_train_select_vectors = selector.fit_transform(X_train_vectors, 
                                                y_train)
print(X_train_select_vectors.shape)
X_train_select_vectors.head()

NameError: name 'X_train_vectors' is not defined

In [None]:
# Get coloumns to keep and create new dataframe with selected features only 
cols = selector.get_support(indices = True)
cols

In [None]:
df_features_new = df_features_new.iloc[:, cols]
print("Selected Vectors: ", len(df_features_new.columns))
X_train_new_features = pd.DataFrame(X_train_select_vectors, 
                                    columns = df_feature_new.columns)

### 3.5 Get selected vectors for test data

In [None]:
X_test_vectors = tfidf_vectorizer.transform(X_test)
X_test_select_vectors = selector.transform(X_test_vectors)
X_test_new_features = pd.DataFrame(X_test_select_vectors, 
                                   columns = df_feature_new.columns)

### 3.6 Implement Classification 

In [None]:
forest = RandomForestClassification(n_estimators = 100, 
                                    random_state = 19)
forest.fit(X_train, y_train)

y_test_pred = forest.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_test_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
print("F1 score: ", f1_score(y_test, y_test_pred))
print("Accuracy: ", accuracy_score(y_test, y_test_pred))

### 3.7 Global explainability: Feature Importance

In [None]:
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis = 0)
df_feature_importance = pd.DataFrame({'feature': df_feature_importance.columns, 
                                      'importance': importances})
df_feature_importance = df_feature_importance.sort_values(by = 'importance', 
                                                          ascending = False).reset_index(drop = True)
df_top_feature_importance = df_feature_importance.head(25).copy() 

# plot the importance of 25 top features
plt.barh(df_top_feature_importance.feature, df_top_feature_importance.importance)
plt.show()

df_feature_importance.head(25)

### 3.8 Local explaiability using Shap

In [None]:
Using SHAP Values to Explain How Your Machine Learning Model Works
Ref -> https://towardsdatascience.com/using-shap-values-to-explain-how-your-machine-learning-model-works-732b3f40e137