# Disease prediction machine learning project
## Overview
In this project, we aim to develop a classification model to make the work of physicians easier when predicting and or diagnosing diseases in their patients.

## Dataset
> **TLDR**   
> `132` symptoms mapped to `42` diseases
> Download dataset [here](https://www.kaggle.com/datasets/marslinoedward/disease-prediction-data/download?datasetVersionNumber=1)

The dataset used in this project consists of 133 columns. Of which `132` represents common symptoms to diseases, and a column representing the prognosis of `42` different diseases. The dataset consists of a train and test set.   
You can download the dataset [here](https://www.kaggle.com/datasets/marslinoedward/disease-prediction-data/download?datasetVersionNumber=1) on kaggle. After download extract the dataset folder to the same directory as `model.ipynb` file. Directory structure should be as shown below:
![Directory structure](directory.png)

## Approach
- **Load training dataset and preprocess:** Load dataset from disk, preprocess the data and split into feature matrix(X) and labels(y).
- **Model training and selection:** Train several classifieers on the training dataset and select the best.
- **Best model evaluation:** Evaluate the best model on the test set.
- **Save best model to disk:**: Save best model to disk for later use.

### Import libraries

In [1]:
import joblib
import pandas
from pathlib import Path
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

### Load training set

In [2]:
dataset_dir = Path().resolve() / "Disease Prediction Data"
df = pandas.read_csv(dataset_dir / "Training.csv")
df.describe()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,Unnamed: 133
count,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,...,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,0.0
mean,0.137805,0.159756,0.021951,0.045122,0.021951,0.162195,0.139024,0.045122,0.045122,0.021951,...,0.021951,0.021951,0.023171,0.023171,0.023171,0.023171,0.023171,0.023171,0.023171,
std,0.34473,0.366417,0.146539,0.207593,0.146539,0.368667,0.346007,0.207593,0.207593,0.146539,...,0.146539,0.146539,0.150461,0.150461,0.150461,0.150461,0.150461,0.150461,0.150461,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,


In [3]:
df = df.drop("Unnamed: 133", axis=1)
df["prognosis"].unique()

array(['Fungal infection', 'Allergy', 'GERD', 'Chronic cholestasis',
       'Drug Reaction', 'Peptic ulcer diseae', 'AIDS', 'Diabetes ',
       'Gastroenteritis', 'Bronchial Asthma', 'Hypertension ', 'Migraine',
       'Cervical spondylosis', 'Paralysis (brain hemorrhage)', 'Jaundice',
       'Malaria', 'Chicken pox', 'Dengue', 'Typhoid', 'hepatitis A',
       'Hepatitis B', 'Hepatitis C', 'Hepatitis D', 'Hepatitis E',
       'Alcoholic hepatitis', 'Tuberculosis', 'Common Cold', 'Pneumonia',
       'Dimorphic hemmorhoids(piles)', 'Heart attack', 'Varicose veins',
       'Hypothyroidism', 'Hyperthyroidism', 'Hypoglycemia',
       'Osteoarthristis', 'Arthritis',
       '(vertigo) Paroymsal  Positional Vertigo', 'Acne',
       'Urinary tract infection', 'Psoriasis', 'Impetigo'], dtype=object)

### Split training set into feature matrix and labels then ecocde label

In [4]:
labels = df["prognosis"]
X = df.iloc[:, :-1]
X = X.values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels)
X.shape, y.shape

((4920, 132), (4920,))

### Train and select best model

In [6]:
pipeline = Pipeline([
    ("classifier", "passthrough")
])

RAND_STATE = 69

param_grid = [
    {"classifier": [AdaBoostClassifier(LogisticRegression(solver="lbfgs", multi_class="multinomial"))]},
    {"classifier": [RandomForestClassifier(random_state=RAND_STATE, n_jobs=-1)]},
    {"classifier": [GradientBoostingClassifier()],
    "classifier__learning_rate": [0.03, 0.01, 0.1, 1.0]}
]

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy', verbose=2)
grid.fit(X, y)
model = grid.best_estimator_
print(f"Best accuracy: {grid.best_score_}")

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best accuracy: 1.0


### Evaluate best model on test set

In [7]:
testing_df = pandas.read_csv(dataset_dir / "Testing.csv")
y_test = label_encoder.transform(testing_df["prognosis"])
X_test = testing_df.iloc[:, :-1]
X_test = X_test.values
test_predictions = model.predict(X_test)
print(classification_report(y_test, test_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1
           2       1.00      1.00      1.00         1
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         1
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         1
           7       1.00      1.00      1.00         1
           8       1.00      1.00      1.00         1
           9       1.00      1.00      1.00         1
          10       1.00      1.00      1.00         1
          11       1.00      1.00      1.00         1
          12       1.00      1.00      1.00         1
          13       1.00      1.00      1.00         1
          14       1.00      1.00      1.00         1
          15       1.00      0.50      0.67         2
          16       1.00      1.00      1.00         1
          17       1.00    

### Store best model and label encoder for later use

In [8]:
joblib.dump(label_encoder, "label_encoder.pickle.gz", compress=3)
joblib.dump(model, "best_estimator.pickle.gz", compress=3)

['best_estimator.pickle.gz']

[CV] END classifier=RandomForestClassifier(n_jobs=-1, random_state=69); total time=   0.8s
[CV] END classifier=RandomForestClassifier(n_jobs=-1, random_state=69); total time=   0.8s
[CV] END classifier=GradientBoostingClassifier(), classifier__learning_rate=0.03; total time= 2.9min
[CV] END classifier=GradientBoostingClassifier(), classifier__learning_rate=0.1; total time= 2.4min
[CV] END classifier=AdaBoostClassifier(estimator=LogisticRegression(multi_class='multinomial')); total time=   9.3s
[CV] END classifier=GradientBoostingClassifier(), classifier__learning_rate=0.03; total time= 2.8min
[CV] END classifier=GradientBoostingClassifier(), classifier__learning_rate=0.1; total time= 2.5min
[CV] END classifier=RandomForestClassifier(n_jobs=-1, random_state=69); total time=   0.8s
[CV] END classifier=RandomForestClassifier(n_jobs=-1, random_state=69); total time=   0.8s
[CV] END classifier=GradientBoostingClassifier(), classifier__learning_rate=0.03; total time= 2.7min
[CV] END classifi