<a href="https://colab.research.google.com/github/joaocarvoli/xai/blob/main/intro/02_xai_lime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explainable AI - introduction to LIME
[Data - Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)

- The link to the main paper about the LIME technique is [here](https://arxiv.org/abs/1602.04938)
- LIME GitHub repository: [here](https://github.com/marcotcr/lime)

In [None]:
#@title Loading libraries
import sys
sys.path.insert(0,"/content/drive/MyDrive/studies/explainable-AI/notebooks/")
!pip install interpret

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score
from interpret.blackbox import LimeTabular
from interpret import show

from deepfindr_utils import DataLoader  # Source code: https://github.com/deepfindr/xai-series/blob/master/utils.py

## 1. Data Analysis and Preprocessing

In [3]:
data_loader = DataLoader()
data_loader.load_dataset(path="/content/drive/MyDrive/datasets/healthcare-dataset-stroke-data.csv")
data_loader.preprocess_data()
X_train, X_test, y_train, y_test = data_loader.get_data_split() # Split the data for evaluation
X_train, y_train = data_loader.oversample(X_train, y_train)

In [4]:
X_train

Unnamed: 0,gender_Female,gender_Male,gender_Other,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,age,hypertension,heart_disease,avg_glucose_level,bmi
0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,53.0,0.0,0.0,175.92,26.9
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,31.0,0.0,0.0,72.60,31.6
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,16.0,0.0,0.0,136.23,22.6
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,17.0,0.0,0.0,83.23,0.0
4,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,38.0,0.0,0.0,162.30,23.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7773,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,58.0,0.0,0.0,107.26,38.6
7774,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,68.0,1.0,0.0,79.79,29.7
7775,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,74.0,1.0,1.0,70.09,27.4
7776,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,74.0,0.0,0.0,74.96,26.6


In [5]:
print(f'Features shape - Train: {X_train.shape} Test: {X_test.shape}')
print(f'Labels shape - Train: {y_train.shape} Test: {y_test.shape}')

Features shape - Train: (7778, 21) Test: (1022, 21)
Labels shape - Train: (7778,) Test: (1022,)


## 2. Algorithm
The example model will be the Random Forest, an ensemble technique. This technique behavior as a black box model because it uses many Decision Trees, and we don't know which internal decisions were taken.

In [6]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [7]:
# Metrics
print('The F1-Score is:', f1_score(y_test, y_pred, average='macro'))
print('The Accuracy is:', accuracy_score(y_test, y_pred))

The F1-Score is: 0.5322285353535354
The Accuracy is: 0.9432485322896281


## 3. Explainer

In [None]:
explainer = LimeTabular(predict_fn=rf.predict_proba, data=X_train, random_state=1)
lime_local = explainer.explain_local(X_test[30:40], y_test[30:40], name="LIME")

In [19]:
show(lime_local)