<h1 style="text-align:center;">Malaria Progression prediction Machine Learning</h1>
<p align="center">
  <img src="https://th.bing.com/th/id/OIP.Ar7K9D0zbN-MISnUoFYk-QHaFj?rs=1&pid=ImgDetMain" width="700" height="300">
</p>

# Machine Learning for Malaria Prediction Using Clinical and Laboratory Data

## Project Overview

This project focuses on employing machine learning techniques to predict the severity of malaria. Utilizing a rich dataset comprising clinical symptoms, laboratory results, and microscopy, we aim to build a predictive model that can distinguish between non-malaria infections and various levels of malaria severity.

### Dataset Description

The dataset contains detailed clinical and laboratory information from patients suspected to have malaria. The features range from demographic data to intricate laboratory results essential for determining malaria infection.

**Structure**: The dataset is organized such that each instance corresponds to a patient case, with attributes covering basic information like consent and location to clinical details like fever symptoms and suspected organisms. Laboratory findings are extensive, including blood counts and other relevant indicators that might point to the infection's severity.

**Target Variable**: The pivotal element of our dataset is the 'Clinical_Diagnosis' column, which is our target variable. It classifies each case into 'Non-malaria Infection', 'Uncomplicated Malaria', or 'Severe Malaria', and is the outcome we attempt to predict using machine learning models.

### Objective

Our goal is to devise a machine learning model that can predict the severity of malaria with high accuracy, aiding prompt and precise treatment interventions.

### Approach

To reach our objective, we will:
- Perform an exploratory data analysis (EDA) to uncover the underlying structure of the data, distributions of features, and identify any discernible patterns or anomalies.
- Execute data preprocessing to make the dataset conducive to machine learning algorithms. This includes data cleaning, normalization, handling missing data, and selecting significant features.
- Train a variety of machine learning models and evaluate their performance to choose the most efficient one for malaria severity classification.
- Apply appropriate evaluation metrics to validate the accuracy and reliability of our models, ensuring they are robust enough for practical application.

In the end, we anticipate having a sophisticated tool powered by machine learning that healthcare professionals can leverage to improve the diagnosis and treatment of malaria, potentially saving lives and enhancing health outcomes in malaria-prone areas.


1. [Data Overview](#data-overview)
2. [Importing Libraries](#importing-libraries)
3. [Data Cleaning & Preprocessing](#data-cleaning-and-preprocessing)
4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
    - [Univariate Analysis](#univariate-analysis)
    - [Bivariate Analysis](#bivariate-analysis)
    - [Multivariate Analysis](#multivariate-analysis)
5. [Data Encoding](#data-encoding)
6. [Data Scaling](#data-scaling)
7. [Data Modeling](#data-modeling)
8. [Model Evaluation](#model-evaluation)
9. [Pipeline](#pipeline)
10. [Deployment](#deployment)

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    1] 🤗 Adding libraries
</p>

In [448]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
import plotly.graph_objects as go
import plotly.figure_factory as ff
from sklearn.preprocessing import StandardScaler,OneHotEncoder,OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_curve, roc_auc_score, precision_recall_curve

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    2]  Reading the data
</p>

In [449]:
df = pd.read_csv('malaria_clin_data.csv')
df.head()

Unnamed: 0,SampleID,consent_given,location,Enrollment_Year,bednet,fever_symptom,temperature,Suspected_Organism,Suspected_infection,RDT,...,platelet_count,platelet_distr_width,mean_platelet_vl,neutrophils_percent,lymphocytes_percent,mixed_cells_percent,neutrophils_count,lymphocytes_count,mixed_cells_count,RBC_dist_width_Percent
0,CCS20043,yes,Navrongo,2004,,Yes,38.0,Not Known / Missing entry,,Positive,...,156.0,8.2,6.8,61.8,31.7,6.5,3.6,1.8,0.3,19.0
1,CCS20102,yes,Navrongo,2004,,Yes,38.2,Not Known / Missing entry,,Positive,...,55.0,16.5,7.6,68.5,23.6,7.9,5.4,1.8,0.6,14.4
2,CCS20106,yes,Navrongo,2004,,Yes,37.7,Not Known / Missing entry,,Positive,...,20.0,2.3,5.9,32.8,53.3,13.9,2.8,4.3,1.1,18.0
3,CCS20147,yes,Navrongo,2004,,Yes,37.7,Not Known / Missing entry,,Positive,...,132.0,17.2,6.2,82.6,11.5,5.9,13.2,1.8,0.9,13.7
4,CCS20170,yes,Navrongo,2004,,Yes,37.1,Not Known / Missing entry,,Positive,...,85.0,16.1,6.8,83.7,11.3,5.0,3.8,0.5,0.2,15.0


In [450]:
df.columns

Index(['SampleID', 'consent_given', 'location', 'Enrollment_Year', 'bednet',
       'fever_symptom', 'temperature', 'Suspected_Organism',
       'Suspected_infection', 'RDT', 'Blood_culture', 'Urine_culture',
       'Taq_man_PCR', 'parasite_density', 'Microscopy', 'Laboratory_Results',
       'Clinical_Diagnosis', 'wbc_count', 'rbc_count', 'hb_level',
       'hematocrit', 'mean_cell_volume', 'mean_corp_hb', 'mean_cell_hb_conc',
       'platelet_count', 'platelet_distr_width', 'mean_platelet_vl',
       'neutrophils_percent', 'lymphocytes_percent', 'mixed_cells_percent',
       'neutrophils_count', 'lymphocytes_count', 'mixed_cells_count',
       'RBC_dist_width_Percent'],
      dtype='object')

In [451]:
df['Clinical_Diagnosis'].value_counts()

Clinical_Diagnosis
Non-malaria Infection    978
Uncomplicated Malaria    703
Severe Malaria           526
Name: count, dtype: int64

In [452]:
df.shape

(2207, 34)

In [453]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Data columns (total 34 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   SampleID                2207 non-null   object 
 1   consent_given           2207 non-null   object 
 2   location                2207 non-null   object 
 3   Enrollment_Year         2207 non-null   int64  
 4   bednet                  1676 non-null   object 
 5   fever_symptom           2200 non-null   object 
 6   temperature             2197 non-null   float64
 7   Suspected_Organism      2207 non-null   object 
 8   Suspected_infection     1569 non-null   object 
 9   RDT                     2065 non-null   object 
 10  Blood_culture           122 non-null    object 
 11  Urine_culture           112 non-null    object 
 12  Taq_man_PCR             176 non-null    object 
 13  parasite_density        2173 non-null   float64
 14  Microscopy              2170 non-null   

In [454]:
df.describe().T # to check for data distribution

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Enrollment_Year,2207.0,2013.123244,5.701969,2002.0,2012.0,2017.0,2017.0,2019.0
temperature,2197.0,37.869822,1.252016,34.2,36.8,38.0,38.9,41.1
parasite_density,2173.0,61751.964657,325839.893959,0.0,0.0,480.0,36880.0,10114000.0
wbc_count,2207.0,10.734209,5.924517,0.5,6.85,9.3,12.9,53.9
rbc_count,2207.0,3.890689,1.139474,0.5,3.3,4.15,4.64,6.67
hb_level,2207.0,9.360222,2.680846,1.4,7.8,10.1,11.3,18.7
hematocrit,2207.0,29.101541,8.91213,4.3,23.7,31.6,35.4,52.7
mean_cell_volume,2207.0,74.63585,8.239094,7.8,69.8,75.0,80.0,121.0
mean_corp_hb,2204.0,24.102704,3.227082,2.1,22.1,24.1,26.2,38.8
mean_cell_hb_conc,2205.0,32.304259,2.893977,15.7,30.6,32.1,33.5,46.6


In [455]:
df.describe(include='object').T # to check for data distribution        

Unnamed: 0,count,unique,top,freq
SampleID,2207,2207,CCS20043,1
consent_given,2207,2,yes,2194
location,2207,3,Accra,857
bednet,1676,4,yes,852
fever_symptom,2200,2,Yes,1704
Suspected_Organism,2207,9,Not Known / Missing entry,1729
Suspected_infection,1569,276,Malaria,440
RDT,2065,2,Positive,1096
Blood_culture,122,14,No bac growth,82
Urine_culture,112,16,No bac growth,66


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    3]  Data Cleaning & Preparation
</p>

In [456]:
df.duplicated().sum() # to check for duplicate rows

0

In [457]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent =  df.isnull().sum() / len(df) * 100
    
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    
    # Print some summary information
    print ("The dataset has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")
    
    # Return the dataframe with missing information
    return mis_val_table_ren_columns
missing_values_table(df)

The dataset has 34 columns.
There are 19 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Urine_culture,2095,94.9
Blood_culture,2085,94.5
Taq_man_PCR,2031,92.0
Suspected_infection,638,28.9
bednet,531,24.1
RDT,142,6.4
Microscopy,37,1.7
parasite_density,34,1.5
platelet_distr_width,32,1.4
mean_platelet_vl,17,0.8


In [458]:
# Drop the columns with high missing values ['Urine_culture','Blood_culture','Taq_man_PCR] and SampleID as it's not providing any useful information
df.drop(['Urine_culture','Blood_culture', 'Taq_man_PCR','SampleID'], axis= 1, inplace=True)
df.reset_index(drop=True, inplace=True) 
df.shape

(2207, 30)

In [459]:
for col in df.columns:
    print(col, df[col].unique())    

consent_given ['yes' 'Yes']
location ['Navrongo' 'Accra' 'Kintampo']
Enrollment_Year [2004 2003 2002 2012 2013 2014 2016 2017 2018 2019 2011 2010]
bednet [nan 'no' 'yes' 'No' 'Yes']
fever_symptom ['Yes' 'No' nan]
temperature [38.  38.2 37.7 37.1 38.1 39.7 36.  38.7 38.4 38.9 40.  36.9 39.4 39.2
 36.6 40.2 36.7 39.5 38.5 37.9 39.  39.1 36.5 39.6 38.6 40.7 37.4 37.6
 37.5 37.2 36.2 40.1 36.4 37.  40.4 39.9 41.  40.5 36.1 41.1 35.9 34.4
 34.9  nan 39.3 36.8 37.3 38.3 39.8 37.8 38.8 40.3 35.6 35.  36.3 35.5
 35.4 35.8 35.7 40.6 35.2 35.3 35.1 34.5 34.6 34.3 34.2 34.8]
Suspected_Organism ['Not Known / Missing entry' 'Viral' 'Bacteria' 'Viral/bacteria'
 'Protozoan' 'Viral/protozoan' 'Bacteria/Protozoa' 'Fungi'
 'Fungi/protozoan']
Suspected_infection [nan 'respiractory tract infection' 'hernia' 'teething' 'Injury' 'Malaria'
 'Fever' 'gastroenteritis' 'Chicken pox' 'diarrhoea' 'Cough'
 'Neonatal jaundice' 'Jaundice' 'Common cold' 'Tooth decay' 'Eye problem'
 'Headache/Fever' 'Boil' 'Febrile co

In [460]:
missing_values_table(df)

The dataset has 30 columns.
There are 16 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Suspected_infection,638,28.9
bednet,531,24.1
RDT,142,6.4
Microscopy,37,1.7
parasite_density,34,1.5
platelet_distr_width,32,1.4
mean_platelet_vl,17,0.8
neutrophils_count,12,0.5
lymphocytes_count,11,0.5
mixed_cells_count,11,0.5


In [461]:
# Drop nan values which less than 7 %
df.dropna(subset=['RDT','Microscopy','parasite_density','platelet_distr_width','mean_platelet_vl','neutrophils_count','lymphocytes_count','mixed_cells_count','temperature','platelet_count','RBC_dist_width_Percent','fever_symptom','mean_corp_hb','mean_cell_hb_conc'], inplace=True)
df.reset_index(drop=True, inplace=True)
missing_values_table(df)

The dataset has 30 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Suspected_infection,548,27.8
bednet,488,24.7


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    4]  EDA (Exploratory Data Analysis)
</p>

In [462]:
df.shape

(1972, 30)

In [463]:
num_cols = df.select_dtypes(include=['float64','int64']).columns
num_cols

Index(['Enrollment_Year', 'temperature', 'parasite_density', 'wbc_count',
       'rbc_count', 'hb_level', 'hematocrit', 'mean_cell_volume',
       'mean_corp_hb', 'mean_cell_hb_conc', 'platelet_count',
       'platelet_distr_width', 'mean_platelet_vl', 'neutrophils_percent',
       'lymphocytes_percent', 'mixed_cells_percent', 'neutrophils_count',
       'lymphocytes_count', 'mixed_cells_count', 'RBC_dist_width_Percent'],
      dtype='object')

In [464]:
cat_cols = df.select_dtypes(include=['object']).columns
cat_cols

Index(['consent_given', 'location', 'bednet', 'fever_symptom',
       'Suspected_Organism', 'Suspected_infection', 'RDT', 'Microscopy',
       'Laboratory_Results', 'Clinical_Diagnosis'],
      dtype='object')

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    4a]  EDA (UniVariate Analysis)
</p>

In [465]:
# Check for the distribution of the numerical columns
for col in num_cols:
    px.histogram(df, x=col, title=f'Distribution of {col}',template = "plotly_dark").show()

In [466]:
for col in num_cols:
    px.box(df, x = df[col], title= f"Box Plot of {col}" , template= "plotly_dark" ).show()

# There are outliers in thes numerical columns

In [467]:
# Handling the outliers in the numerical data and replace them with lower and upper bound
for col in num_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
    df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])
    px.box(df, x = df[col], title= f"Box Plot of {col}" , template= "plotly_dark" ).show()


In [468]:
df[num_cols].describe()

Unnamed: 0,Enrollment_Year,temperature,parasite_density,wbc_count,rbc_count,hb_level,hematocrit,mean_cell_volume,mean_corp_hb,mean_cell_hb_conc,platelet_count,platelet_distr_width,mean_platelet_vl,neutrophils_percent,lymphocytes_percent,mixed_cells_percent,neutrophils_count,lymphocytes_count,mixed_cells_count,RBC_dist_width_Percent
count,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0,1972.0
mean,2013.028905,37.844016,22821.833392,10.520943,3.885317,9.32252,29.023117,74.748986,24.125963,32.206329,213.656694,14.367926,7.951623,58.313844,33.373631,8.296951,6.272059,3.350913,0.821526,16.337972
std,5.762215,1.253626,35073.086317,4.837533,1.135177,2.689506,8.995558,7.775139,3.087085,2.670528,126.877887,2.400666,1.126837,16.590609,14.920584,3.321018,3.479309,2.037182,0.510197,2.402469
min,2002.0,34.2,0.0,0.5,1.275,2.3,4.5875,55.0,16.2,26.0,3.0,8.85,4.95,9.3,3.8,0.3,0.1,0.3,0.0,11.5
25%,2011.0,36.8,0.0,6.9,3.3,7.7,23.075,70.0,22.2,30.5,104.0,12.9,7.2,45.7,21.1,5.775,3.6,1.8,0.4,14.6
50%,2017.0,38.0,200.0,9.4,4.14,10.1,31.5,75.0,24.1,32.1,201.0,14.9,7.9,59.1,32.3,8.0,5.5,2.8,0.7,15.8
75%,2017.0,38.8,36801.75,13.1,4.65,11.3,35.4,80.0,26.2,33.5,302.0,15.6,8.7,71.925,44.5,10.5,8.1,4.4,1.1,17.7
max,2017.0,41.1,92004.375,22.4,6.67,16.7,52.7,95.0,32.2,38.0,599.0,19.65,10.95,93.3,79.6,17.5875,14.85,8.3,2.15,22.35


In [469]:
for col in cat_cols:
    px.histogram(df, x=col, title=f'Distribution of {col}',template = "plotly_dark").show()

In [470]:
df.columns

Index(['consent_given', 'location', 'Enrollment_Year', 'bednet',
       'fever_symptom', 'temperature', 'Suspected_Organism',
       'Suspected_infection', 'RDT', 'parasite_density', 'Microscopy',
       'Laboratory_Results', 'Clinical_Diagnosis', 'wbc_count', 'rbc_count',
       'hb_level', 'hematocrit', 'mean_cell_volume', 'mean_corp_hb',
       'mean_cell_hb_conc', 'platelet_count', 'platelet_distr_width',
       'mean_platelet_vl', 'neutrophils_percent', 'lymphocytes_percent',
       'mixed_cells_percent', 'neutrophils_count', 'lymphocytes_count',
       'mixed_cells_count', 'RBC_dist_width_Percent'],
      dtype='object')

In [471]:
df.drop(['consent_given'], axis=1, inplace=True) # drop the column as it contains only one value
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,location,Enrollment_Year,bednet,fever_symptom,temperature,Suspected_Organism,Suspected_infection,RDT,parasite_density,Microscopy,...,platelet_count,platelet_distr_width,mean_platelet_vl,neutrophils_percent,lymphocytes_percent,mixed_cells_percent,neutrophils_count,lymphocytes_count,mixed_cells_count,RBC_dist_width_Percent
0,Navrongo,2004.0,,Yes,38.0,Not Known / Missing entry,,Positive,92004.375,Positive,...,156.0,8.85,6.8,61.8,31.7,6.5,3.6,1.8,0.3,19.0
1,Navrongo,2004.0,,Yes,38.2,Not Known / Missing entry,,Positive,92004.375,Positive,...,55.0,16.5,7.6,68.5,23.6,7.9,5.4,1.8,0.6,14.4
2,Navrongo,2004.0,,Yes,37.7,Not Known / Missing entry,,Positive,5880.0,Positive,...,20.0,8.85,5.9,32.8,53.3,13.9,2.8,4.3,1.1,18.0
3,Navrongo,2004.0,,Yes,37.7,Not Known / Missing entry,,Positive,85000.0,Positive,...,132.0,17.2,6.2,82.6,11.5,5.9,13.2,1.8,0.9,13.7
4,Navrongo,2004.0,,Yes,37.1,Not Known / Missing entry,,Positive,92004.375,Positive,...,85.0,16.1,6.8,83.7,11.3,5.0,3.8,0.5,0.2,15.0


In [472]:
df['bednet'].value_counts()

bednet
yes    776
no     707
Yes      1
Name: count, dtype: int64

In [473]:
# Replace the Yes value in bednet column with yes
df['bednet'] = df['bednet'].replace('Yes','yes')

In [474]:
df['bednet'].value_counts()

bednet
yes    777
no     707
Name: count, dtype: int64

In [475]:
# Convert the Enrollment_Year column from float to int then to string 
df['Enrollment_Year'] = df['Enrollment_Year'].astype(int).astype(str)
df['Enrollment_Year'].dtype

dtype('O')

In [476]:
cat_cols = df.select_dtypes(include=['object']).columns

In [477]:
cat_cols

Index(['location', 'Enrollment_Year', 'bednet', 'fever_symptom',
       'Suspected_Organism', 'Suspected_infection', 'RDT', 'Microscopy',
       'Laboratory_Results', 'Clinical_Diagnosis'],
      dtype='object')

In [478]:
num_cols = df.select_dtypes(include=['float64','int64']).columns
num_cols

Index(['temperature', 'parasite_density', 'wbc_count', 'rbc_count', 'hb_level',
       'hematocrit', 'mean_cell_volume', 'mean_corp_hb', 'mean_cell_hb_conc',
       'platelet_count', 'platelet_distr_width', 'mean_platelet_vl',
       'neutrophils_percent', 'lymphocytes_percent', 'mixed_cells_percent',
       'neutrophils_count', 'lymphocytes_count', 'mixed_cells_count',
       'RBC_dist_width_Percent'],
      dtype='object')

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    4 b]  EDA (BiVariate Analysis)
</p>

In [479]:
for col in num_cols:
    px.box(df , x = "Clinical_Diagnosis", y = col, title= f"Box Plot of {col} vs Clinical Diagnosis" , template= "plotly_dark", color= "Clinical_Diagnosis" ).show()
    

In [480]:
df['Clinical_Diagnosis'].value_counts()

Clinical_Diagnosis
Non-malaria Infection    889
Uncomplicated Malaria    598
Severe Malaria           485
Name: count, dtype: int64

In [481]:
# Mapping the target variable 
df['Clinical_Diagnosis'] = df['Clinical_Diagnosis'].map({'Non-malaria Infection':0, 'Uncomplicated Malaria':1, 'Severe Malaria':2})
df['Clinical_Diagnosis'].value_counts()

Clinical_Diagnosis
0    889
1    598
2    485
Name: count, dtype: int64

In [482]:
# Correlation Analysis for numerical features
correlations = df.corr(numeric_only=True)
fig = px.imshow(correlations, template='plotly_dark', aspect=True, text_auto="0.3f")
fig.show()

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    5]  Hypothesis Testing
</p>

In [483]:
# Anova test function 
def anova_test(df, target, cat_cols):
    from scipy.stats import f_oneway
    for col in num_cols:
        groups = df.groupby(col)[target].apply(list)
        anova = f_oneway(*groups)
        print(f"Anova Test for {col} : {anova}")
        # Make the hypothesis by p value

anova_test(df, "Clinical_Diagnosis", num_cols)

Anova Test for temperature : F_onewayResult(statistic=8.459856887822967, pvalue=8.204379824293047e-66)
Anova Test for parasite_density : F_onewayResult(statistic=17.30507047483811, pvalue=0.0)
Anova Test for wbc_count : F_onewayResult(statistic=1.2332055017856596, pvalue=0.020050637318344552)
Anova Test for rbc_count : F_onewayResult(statistic=13.129142469107245, pvalue=1.3572778833175446e-300)
Anova Test for hb_level : F_onewayResult(statistic=20.50601011799977, pvalue=1.5028086983326363e-260)
Anova Test for hematocrit : F_onewayResult(statistic=8.339609065311759, pvalue=2.958764934684792e-205)
Anova Test for mean_cell_volume : F_onewayResult(statistic=2.4280493586731193, pvalue=4.484610927790533e-28)
Anova Test for mean_corp_hb : F_onewayResult(statistic=1.6879995532059628, pvalue=9.266230555650378e-07)
Anova Test for mean_cell_hb_conc : F_onewayResult(statistic=9.513346323784505, pvalue=4.71178989617757e-122)
Anova Test for platelet_count : F_onewayResult(statistic=3.318991700461188

We conducted ANOVA tests to explore the relationships between enrollment year, environmental factors, blood parameters, and clinical diagnosis categories in a malaria dataset. Here's a summary of our findings:

## Enrollment Year and Clinical Diagnosis

- **Hypothesis**: There is a significant difference in enrollment years across different clinical diagnosis categories.
- **ANOVA Result**: `F=1023.505, p<0.0001`
- **Interpretation**: The extremely low p-value suggests a statistically significant difference in enrollment years among different clinical diagnosis categories, indicating potential temporal trends in malaria diagnoses.

## Environmental Factors and Clinical Diagnosis

For the environmental factors (temperature), we tested the hypothesis that there are significant differences in their levels across different clinical diagnosis categories.

- **Temperature**: `F=8.460, p<0.0001`
- **Interpretation**: The low p-value indicates a statistically significant difference in temperature among different clinical diagnosis categories, suggesting a potential association between environmental conditions and malaria diagnoses.

## Blood Parameters and Clinical Diagnosis

For each of the blood parameters (parasite density, white blood cell count, red blood cell count, hemoglobin level, hematocrit, mean cell volume, mean corpuscular hemoglobin, mean cell hemoglobin concentration, platelet count, platelet distribution width, mean platelet volume, neutrophils percent, lymphocytes percent, mixed cells percent, neutrophils count, lymphocytes count, mixed cells count, RBC distribution width percent), we tested the hypothesis that there are significant differences in their levels across different clinical diagnosis categories.

- **Parasite Density**: `F=17.305, p<0.0001`
- **White Blood Cell Count**: `F=1.233, p=0.020`
- **Red Blood Cell Count**: `F=13.129, p<0.0001`
- **Hemoglobin Level**: `F=20.506, p<0.0001`
- **Hematocrit**: `F=8.340, p<0.0001`
- **Mean Cell Volume**: `F=2.428, p<0.0001`
- **Mean Corpuscular Hemoglobin**: `F=1.688, p<0.0001`
- **Mean Cell Hemoglobin Concentration**: `F=9.513, p<0.0001`
- **Platelet Count**: `F=3.319, p<0.0001`
- **Platelet Distribution Width**: `F=10.211, p<0.0001`
- **Mean Platelet Volume**: `F=11.974, p<0.0001`
- **Neutrophils Percent**: `F=0.980, p=0.614`
- **Lymphocytes Percent**: `F=0.988, p=0.565`
- **Mixed Cells Percent**: `F=1.440, p=0.0004`
- **Neutrophils Count**: `F=1.498, p=0.0002`
- **Lymphocytes Count**: `F=1.439, p=0.007`
- **Mixed Cells Count**: `F=3.823, p<0.0001`
- **RBC Distribution Width Percent**: `F=4.885, p<0.0001`

## Conclusion

The ANOVA tests provide evidence that enrollment year, environmental factors, and various blood parameters exhibit significant differences across different clinical diagnosis categories in malaria cases. These findings suggest potential associations between temporal trends, environmental conditions, blood parameters, and malaria diagnoses, highlighting the importance of considering multiple factors in understanding and diagnosing malaria.


##### Association between Clinical Diagnosis and Location:

- Hypothesis: There is an association between the clinical diagnosis (e.g., Non-malaria Infection, Uncomplicated Malaria, Severe Malaria) and the location (e.g., Navrongo, Accra, Kintampo).
- Test: Chi-square test of independence.
- Question: Are certain clinical diagnoses more prevalent in specific locations?

In [484]:
from scipy.stats import chi2_contingency
def chi2_test(df, target, cat_cols):
    for col in cat_cols:
        cross_tab = pd.crosstab(df[col], df[target])
        chi2, p, dof, expected = chi2_contingency(cross_tab)
        print(f"Chi2 Test for {col} : {chi2} , {p}")
        # Make the hypothesis by p value
        if p < 0.05:
            print("Reject the null hypothesis")
        else:
            print("Fail to reject the null hypothesis")


In [485]:
chi2_test(df, "Clinical_Diagnosis", cat_cols)

Chi2 Test for location : 1970.0418876936596 , 0.0
Reject the null hypothesis
Chi2 Test for Enrollment_Year : 2411.6432332352247 , 0.0
Reject the null hypothesis
Chi2 Test for bednet : 61.388216859947015 , 4.686065820343793e-15
Reject the null hypothesis
Chi2 Test for fever_symptom : 301.64974964926273 , 3.144771753061969e-66
Reject the null hypothesis
Chi2 Test for Suspected_Organism : 410.80460730109877 , 1.9756124634889633e-77
Reject the null hypothesis
Chi2 Test for Suspected_infection : 808.2905048822255 , 7.186112438525056e-62
Reject the null hypothesis
Chi2 Test for RDT : 1853.8989326022543 , 0.0
Reject the null hypothesis
Chi2 Test for Microscopy : 1718.5126247191783 , 0.0
Reject the null hypothesis
Chi2 Test for Laboratory_Results : 1974.9590463431914 , 0.0
Reject the null hypothesis
Chi2 Test for Clinical_Diagnosis : 3944.0 , 0.0
Reject the null hypothesis


In [486]:
# Are certain clinical diagnoses more prevalent in specific locations? 
fig = px.histogram(df, x='location',barmode= 'group', color='Clinical_Diagnosis', title='Clinical Diagnosis per Location', template='plotly_dark')
fig.show()

In [487]:
# So there is a significant difference in the clinical diagnosis per location
contingency_table = pd.crosstab(df["location"] , df['Clinical_Diagnosis'])
contingency_table

Clinical_Diagnosis,0,1,2
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Accra,608,98,0
Kintampo,281,403,0
Navrongo,0,97,485


### Hypothesis: Patients diagnosed with severe malaria have higher average temperatures compared to those with uncomplicated malaria or non-malaria infection.
- Test: Analysis of variance (ANOVA) or Kruskal-Wallis test.
- Question: Is there a significant difference in temperature among different clinical diagnoses?

In [488]:
# Relationship between Temperature and Clinical Diagnosis:
px.histogram(df, x='temperature', y='Clinical_Diagnosis', color='Clinical_Diagnosis', template='plotly_dark', title='Temperature vs Clinical Diagnosis')

In [489]:
# Anova Test for temperature
from scipy.stats import f_oneway
groups = df.groupby('Clinical_Diagnosis')['temperature'].apply(list)
anova = f_oneway(*groups)
print(f"Anova Test for temperature : {anova}")


Anova Test for temperature : F_onewayResult(statistic=100.30270775143815, pvalue=3.2964996841770504e-42)


#### Effect of Suspected Infection on Clinical Diagnosis:

- Hypothesis: Suspected infection (e.g., Malaria, Pneumonia) influences the clinical diagnosis.
- Test: Cross-tabulation and chi-square test.
- Question: Does the suspected infection correlate with the clinical diagnosis?

In [490]:
contingency_table = pd.crosstab(df["Suspected_Organism"] , df['Clinical_Diagnosis'])
contingency_table  
# Reject the null hypothesis, as the p value is less than 0.05, and there is a significant difference in the clinical diagnosis per suspected organism

Clinical_Diagnosis,0,1,2
Suspected_Organism,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bacteria,101,35,0
Bacteria/Protozoa,11,22,0
Fungi,3,0,0
Fungi/protozoan,2,1,0
Not Known / Missing entry,681,365,485
Protozoan,49,165,0
Viral,37,5,0
Viral/bacteria,2,0,0
Viral/protozoan,3,5,0


### Association between Fever Symptoms and Parasite Density:

- Hypothesis: Patients with fever symptoms have higher parasite density.
- Test: Compare mean parasite density between groups with and without fever symptoms.
- Question: Is there a relationship between fever symptoms and parasite density?

In [491]:
# Relationship between Fever Symptoms and Parasite Density:
px.histogram(df, x='fever_symptom', y='parasite_density', color='fever_symptom', template='plotly_dark', title='Fever Symptoms vs Parasite Density')

In [492]:

# Anova Test for parasite density
from scipy.stats import f_oneway
groups = df.groupby('fever_symptom')['parasite_density'].apply(list)
anova = f_oneway(*groups)
print(f"Anova Test for parasite density : {anova}")
if anova[1] < 0.05:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Anova Test for parasite density : F_onewayResult(statistic=134.87575426001965, pvalue=3.306001451216918e-30)
Reject the null hypothesis


### Impact of Hemoglobin Level on Clinical Diagnosis:

- Hypothesis: Hemoglobin level is associated with the severity of malaria diagnosis.
- Test: ANOVA or Kruskal-Wallis test.
- Question: Does hemoglobin level vary significantly across different clinical diagnoses?

In [493]:
# Relationship between Hemoglobin Level and Clinical Diagnosis:
px.histogram(df, x='mean_corp_hb', y='Clinical_Diagnosis', color='Clinical_Diagnosis', template='plotly_dark', title='Hemoglobin Level vs Clinical Diagnosis')

In [494]:
# Anova Test for mean_corp_hb
groups = df.groupby('Clinical_Diagnosis')['mean_corp_hb'].apply(list)
anova = f_oneway(*groups)
print(f"Anova Test for parasite density : {anova}")
if anova[1] < 0.05:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Anova Test for parasite density : F_onewayResult(statistic=26.371697366463025, pvalue=4.984575203170703e-12)
Reject the null hypothesis


#### Relationship between Mean Platelet Volume and Hematocrit:

- Hypothesis: There is a correlation between mean platelet volume and hematocrit levels.
- Test: Pearson correlation coefficient or Spearman correlation coefficient.
- Question: Are mean platelet volume and hematocrit levels positively or negatively correlated?

In [495]:
px.scatter(df, x='mean_platelet_vl', y='hematocrit', color='Clinical_Diagnosis', template='plotly_dark', title='Mean Platelet Volume vs Hematocrit')

In [496]:
pearson = df[['mean_platelet_vl','hematocrit']].corr(method='pearson')
spearman = df[['mean_platelet_vl','hematocrit']].corr(method='spearman')
print(f"Pearson Correlation : {pearson}") 
print(f"Spearman Correlation : {spearman}")

Pearson Correlation :                   mean_platelet_vl  hematocrit
mean_platelet_vl          1.000000    0.322613
hematocrit                0.322613    1.000000
Spearman Correlation :                   mean_platelet_vl  hematocrit
mean_platelet_vl          1.000000    0.312026
hematocrit                0.312026    1.000000


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    6]  Building Pipeline & Testing the ML models
</p>

In [497]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import lightgbm as lgb
from catboost import CatBoostClassifier
from imblearn.ensemble import RUSBoostClassifier
from sklearn.pipeline import Pipeline

In [498]:
df.head(10)

Unnamed: 0,location,Enrollment_Year,bednet,fever_symptom,temperature,Suspected_Organism,Suspected_infection,RDT,parasite_density,Microscopy,...,platelet_count,platelet_distr_width,mean_platelet_vl,neutrophils_percent,lymphocytes_percent,mixed_cells_percent,neutrophils_count,lymphocytes_count,mixed_cells_count,RBC_dist_width_Percent
0,Navrongo,2004,,Yes,38.0,Not Known / Missing entry,,Positive,92004.375,Positive,...,156.0,8.85,6.8,61.8,31.7,6.5,3.6,1.8,0.3,19.0
1,Navrongo,2004,,Yes,38.2,Not Known / Missing entry,,Positive,92004.375,Positive,...,55.0,16.5,7.6,68.5,23.6,7.9,5.4,1.8,0.6,14.4
2,Navrongo,2004,,Yes,37.7,Not Known / Missing entry,,Positive,5880.0,Positive,...,20.0,8.85,5.9,32.8,53.3,13.9,2.8,4.3,1.1,18.0
3,Navrongo,2004,,Yes,37.7,Not Known / Missing entry,,Positive,85000.0,Positive,...,132.0,17.2,6.2,82.6,11.5,5.9,13.2,1.8,0.9,13.7
4,Navrongo,2004,,Yes,37.1,Not Known / Missing entry,,Positive,92004.375,Positive,...,85.0,16.1,6.8,83.7,11.3,5.0,3.8,0.5,0.2,15.0
5,Navrongo,2003,,Yes,38.1,Not Known / Missing entry,,Positive,520.0,Positive,...,383.0,8.85,7.2,66.6,21.6,11.8,6.0,1.8,1.0,18.7
6,Navrongo,2004,,Yes,39.7,Not Known / Missing entry,,Positive,2320.0,Positive,...,297.0,14.1,6.6,77.0,18.2,4.8,14.85,4.5,1.1,16.5
7,Navrongo,2004,,Yes,36.0,Not Known / Missing entry,,Positive,66720.0,Positive,...,103.0,14.2,7.3,43.0,50.2,6.8,2.5,2.7,0.3,16.5
8,Navrongo,2003,,Yes,38.2,Not Known / Missing entry,,Positive,92004.375,Positive,...,41.0,10.0,7.0,58.6,23.3,17.5875,6.9,2.6,2.0,16.9
9,Navrongo,2004,,Yes,38.7,Not Known / Missing entry,,Positive,35840.0,Positive,...,97.0,13.6,6.6,63.5,21.0,15.5,5.5,1.7,1.2,15.2


In [499]:
df.shape

(1972, 29)

In [500]:
for col in cat_cols:
    print(col, df[col].value_counts())

location location
Accra       706
Kintampo    684
Navrongo    582
Name: count, dtype: int64
Enrollment_Year Enrollment_Year
2017    1196
2004     287
2012     214
2002     115
2003      83
2016      64
2011      12
2010       1
Name: count, dtype: int64
bednet bednet
yes    777
no     707
Name: count, dtype: int64
fever_symptom fever_symptom
Yes    1518
No      454
Name: count, dtype: int64
Suspected_Organism Suspected_Organism
Not Known / Missing entry    1531
Protozoan                     214
Bacteria                      136
Viral                          42
Bacteria/Protozoa              33
Viral/protozoan                 8
Fungi                           3
Fungi/protozoan                 3
Viral/bacteria                  2
Name: count, dtype: int64
Suspected_infection Suspected_infection
Malaria                                      408
URTI                                         160
Gastroenteritis                               75
Sepsis                                        63


In [501]:
df["Suspected_Organism"].value_counts()

Suspected_Organism
Not Known / Missing entry    1531
Protozoan                     214
Bacteria                      136
Viral                          42
Bacteria/Protozoa              33
Viral/protozoan                 8
Fungi                           3
Fungi/protozoan                 3
Viral/bacteria                  2
Name: count, dtype: int64

In [502]:
# Merge similar categories
df['Suspected_Organism'].replace({'Bacteria/Protozoa': 'Mixed Organisms',
                                    'Viral/protozoan': 'Mixed Organisms',
                                    'Fungi/protozoan': 'Mixed Organisms',
                                    'Viral/bacteria': 'Mixed Organisms',
                                    'Not Known / Missing entry': 'Not Known'}, inplace=True)



In [503]:
df["Suspected_Organism"].value_counts()

Suspected_Organism
Not Known          1531
Protozoan           214
Bacteria            136
Mixed Organisms      46
Viral                42
Fungi                 3
Name: count, dtype: int64

In [504]:
df["Microscopy"].value_counts()

Microscopy
Positive    1011
Negative     961
Name: count, dtype: int64

In [505]:
num_cols

Index(['temperature', 'parasite_density', 'wbc_count', 'rbc_count', 'hb_level',
       'hematocrit', 'mean_cell_volume', 'mean_corp_hb', 'mean_cell_hb_conc',
       'platelet_count', 'platelet_distr_width', 'mean_platelet_vl',
       'neutrophils_percent', 'lymphocytes_percent', 'mixed_cells_percent',
       'neutrophils_count', 'lymphocytes_count', 'mixed_cells_count',
       'RBC_dist_width_Percent'],
      dtype='object')

In [506]:
ohe_cols = ['location','Suspected_Organism','bednet','fever_symptom','RDT','Microscopy']

In [507]:
ord_cols = ["Enrollment_Year"]

In [508]:
BE_cols = ['Laboratory_Results','Suspected_infection']

In [509]:
x = df.drop('Clinical_Diagnosis', axis=1)
y = df['Clinical_Diagnosis']

In [510]:
x.shape

(1972, 28)

In [511]:
num_pipeline = Pipeline([ ('imputer', KNNImputer(n_neighbors=5)), ('scaler', RobustScaler())])
num_pipeline

In [512]:
from sklearn.preprocessing import OrdinalEncoder

# Adjusting the OrdinalEncoder initialization
ord_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])
ord_pipeline

In [513]:
ohe_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder())])
ohe_pipeline

In [514]:
from category_encoders import BinaryEncoder
be_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', BinaryEncoder())])
be_pipeline

In [515]:
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([ ('num', num_pipeline, num_cols),
                                    ('ord', ord_pipeline, ord_cols),
                                    ('ohe', ohe_pipeline, ohe_cols),
                                    ('be', be_pipeline, BE_cols)])
preprocessing

In [516]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [517]:
y.value_counts(normalize= True)*100

Clinical_Diagnosis
0    45.081136
1    30.324544
2    24.594320
Name: proportion, dtype: float64

In [518]:
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline



final_pipeline = ImbPipeline(steps=[
        ('preprocessing', preprocessing),
        ('sampling', RandomUnderSampler(sampling_strategy=0.5)),
        ('Model', LogisticRegression())
])
final_pipeline

In [519]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB


from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, f1_score

models = []
models.append(("Logistic Regression", LogisticRegression()))
models.append(("Knn", KNeighborsClassifier()))
models.append(("Decision Tree", DecisionTreeClassifier()))
models.append(("Random Forest", RandomForestClassifier()))
models.append(("Ada boost", AdaBoostClassifier()))
models.append(("Xgb", XGBClassifier()))
models.append(("Naive Bayes", GaussianNB()))

for model in models:
    final_pipeline = Pipeline(steps=[ ('preprocessing', preprocessing), ('sampling', RandomUnderSampler()), ('Model', model[1])])   
    result = cross_validate(final_pipeline, x, y, scoring=make_scorer(f1_score, average='macro'), cv=5, return_train_score=True, n_jobs=-1)
    print(model[0])
    print('Train F1 Score : ', result['train_score'].mean() * 100)
    print('Test F1 Score : ', result['test_score'].mean() * 100)

Logistic Regression
Train F1 Score :  99.88348553077746
Test F1 Score :  98.85693487242818
Knn
Train F1 Score :  98.72801699947425
Test F1 Score :  96.64373573111463
Decision Tree
Train F1 Score :  99.94950288272939
Test F1 Score :  99.15298070921817
Random Forest
Train F1 Score :  99.93011017918207
Test F1 Score :  97.06977079797234
Ada boost
Train F1 Score :  99.82145903819915
Test F1 Score :  98.45894646056173
Xgb
Train F1 Score :  99.8175907707457
Test F1 Score :  98.78025523606514
Naive Bayes
Train F1 Score :  95.77298726754476
Test F1 Score :  94.54235243278004


In [522]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

params = {
    'Model__max_depth': [None, 10, 20, 30],
    'Model__min_samples_split': [2, 5, 10],
    'Model__min_samples_leaf': [1, 2, 4],
    'Model__max_features': [None, 'auto', 'sqrt', 'log2']
}
DT_pipeline = Pipeline(steps=[ ('preprocessing', preprocessing), ('sampling', RandomUnderSampler()), ('Model', DecisionTreeClassifier())])
grid_search = GridSearchCV(DT_pipeline, param_grid=params, cv=5, scoring='f1_macro', n_jobs=-1)

In [523]:
grid_search.fit(x, y)

In [524]:
grid_search.best_params_

{'Model__max_depth': 10,
 'Model__max_features': None,
 'Model__min_samples_leaf': 1,
 'Model__min_samples_split': 10}

In [527]:
grid_search.best_score_

0.9910655376116104

In [529]:
DT_model = Pipeline(steps=[ ('preprocessing', preprocessing), ('sampling', RandomUnderSampler()), ('Model', DecisionTreeClassifier(max_depth=10, max_features=None, min_samples_leaf=1, min_samples_split=10))])
DT_model.fit(x, y)  

In [None]:
DT_model

In [530]:
DT_model.fit(x, y)
train_score = DT_model.score(x, y)*100
print(f"TRAIN SCORE {train_score:0.2f}%")

TRAIN SCORE 99.90%


In [532]:
prediction = DT_model.predict(x)
test_score = f1_score(y, prediction, average='macro')*100
print(f"TEST SCORE {test_score:0.2f}%")

TEST SCORE 99.91%


In [533]:
print(classification_report(y, prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       889
           1       1.00      1.00      1.00       598
           2       1.00      1.00      1.00       485

    accuracy                           1.00      1972
   macro avg       1.00      1.00      1.00      1972
weighted avg       1.00      1.00      1.00      1972



In [534]:
# Confusion Matrix
cm = confusion_matrix(y, prediction)
ticks = ['Non-malaria Infection', 'Uncomplicated Malaria', 'Severe Malaria']
px.imshow(cm, labels=dict(x="Predicted", y="Actual", color="Counts"), x=ticks, y=ticks, title="Confusion Matrix", template="plotly_dark",text_auto=True)

<p style = "color: #247881;
            font: bold 20px tahoma;
            background-color: #fff;
            padding: 18px;
            border: 6px solid #247881;
            border-radius: 8px"> 
    🚀 Accuracy: Approximately 100%
    <br>
    <br>
    🚀 Precision: Approximately 100%
    <br>
    <br>
    🚀 Recall: Approximately 100%
    <br>
    <br>
    🚀 F1 Score: Approximately 100%
</p>

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    7] Deployment 👩‍💻🚀
</p>