# Build Model Weather-Related Disease Prediction

On this process we will build the model to predict based on the target variable (prognosis). We also want to find out which model is the best.

## Setup & Import

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

In [4]:
import warnings
warnings.filterwarnings("ignore")

## Data Load

In [6]:
# Load the dataset
path = path = "D:\\Project Data Analysis\\Data & src code\\Weather Related Disease Prediction\\data\\raw\\Weather-related disease prediction.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,Age,Gender,Temperature (C),Humidity,Wind Speed (km/h),nausea,joint_pain,abdominal_pain,high_fever,chills,...,facial_pain,shortness_of_breath,reduced_smell_and_taste,skin_irritation,itchiness,throbbing_headache,confusion,back_pain,knee_ache,prognosis
0,4,1,25.826,0.74,8.289,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,Heart Attack
1,55,0,21.628,0.6,15.236,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,Influenza
2,45,0,13.8,0.817083,4.291992,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Influenza
3,6,0,37.254,0.61,18.009,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,Dengue
4,70,0,18.162,0.87,17.916,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,Sinusitis


## Feature Engineering

### Label Encoding for Target Variable

In [7]:
# Encode the target variable
le = LabelEncoder()
df['prognosis'] = le.fit_transform(df['prognosis']) # Encoding the target variable

For the first setp we're  gonna encode all the category into number so is gonna make the model easy to run the target for prediction. In this case we use label encoder and transfrom to the prognosis feature. For other feature like Age, Temperature, Humidity, and Wind Speed we don't need to encode because we consider to make the model stick to the original data so that the model can predict well.

### Define Train-Test Data and Target Variable

In [8]:
# define train-test data and target variable
X = df.drop(columns=['prognosis']) # training data
y = df['prognosis'] # target variable

Next, we gonna define the train-test data and target variable. In this data we decide to make prognosis feature as target variable, so the X is going to be data without target and y is a target variable.

In [9]:
# checking y value counts
y.value_counts() 

4     1013
8      941
6      658
5      338
7      330
10     329
3      327
2      322
1      321
0      311
9      310
Name: prognosis, dtype: int64

### Train-Test Split

In [10]:
# Split the dataset into training and testing sets 80:20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After we define the X and y, we split the data into train and test with scale 80:20.

In [11]:
# checking the shape of the data
print(f"Size of data train: {X_train.shape[0]} row, {X_train.shape[1]} column")
print(f"Size of data test: {X_test.shape[0]} row, {X_test.shape[1]} column")

Size of data train: 4160 row, 50 column
Size of data test: 1040 row, 50 column


In [12]:
# checking y values
print(f"y train: {y_train.value_counts()}")
print(f"y test: {y_test.value_counts()}")

y train: 4     821
8     740
6     522
7     269
5     267
2     264
1     262
3     259
9     253
0     253
10    250
Name: prognosis, dtype: int64
y test: 8     201
4     192
6     136
10     79
5      71
3      68
7      61
1      59
0      58
2      58
9      57
Name: prognosis, dtype: int64


## Build Model

When the data is ready, we are going to build the model and find the best parameter with hypertuning. On this process we use two different models, the first one is Random Forest, we choose Random Forest because this model is match for data tabualar. The second model is XGBoost, we use this model because this model is more flexible for the data that has imbalance data, and good on overfitting.

#### Random Forest Model Classification

In [13]:
# initializing the random forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train) # fit the model on the training data
y_pred = rf_model.predict(X_test) # make predictions on the test data

# checking with classification report
print("\n Random Forest Classification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))


 Random Forest Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        58
           1       0.91      0.98      0.94        59
           2       0.98      0.98      0.98        58
           3       1.00      1.00      1.00        68
           4       1.00      0.99      1.00       192
           5       0.97      1.00      0.99        71
           6       0.99      0.96      0.97       136
           7       1.00      0.98      0.99        61
           8       0.98      1.00      0.99       201
           9       1.00      1.00      1.00        57
          10       0.99      0.95      0.97        79

    accuracy                           0.99      1040
   macro avg       0.98      0.99      0.98      1040
weighted avg       0.99      0.99      0.99      1040


Accuracy Score: 0.9855769230769231


After we try to fit the data into the Random Forest model, as we can see the data run it well with 98% accuracy, even if we see on the class variance they have a good balance between precision and recall.

### Hypertuning for Random Forest Model

On this process, we will look what best parameter to Random Forest using grid_search and we also using Cross Validation to avoid overfitting.

In [14]:
# make a parameter grid for hyperparameter tuning for Random Forest
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt'],
}

# Perform grid search for hyperparameter tuning
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search_rf.fit(X_train, y_train) # fit the model on the training data

# showing the best parameters and score
print("Best Parameters:", grid_search_rf.best_params_)
print("Best Cross-Validation Accuracy:", grid_search_rf.best_score_)

# use the best estimator to make predictions
best_rf_model = grid_search_rf.best_estimator_
y_pred_best_rf = best_rf_model.predict(X_test) # make predictions on the test data

# checking with classification report
print("\n Random Forest Classification Report after Hyperparameter Tuning:")
print(classification_report(y_test, y_pred_best_rf))
print("\nAccuracy Score after Hyperparameter Tuning:", accuracy_score(y_test, y_pred_best_rf))

Fitting 3 folds for each of 216 candidates, totalling 648 fits
Best Parameters: {'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best Cross-Validation Accuracy: 0.98124999783255

 Random Forest Classification Report after Hyperparameter Tuning:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        58
           1       0.89      0.98      0.94        59
           2       0.98      0.98      0.98        58
           3       1.00      1.00      1.00        68
           4       1.00      0.99      0.99       192
           5       0.97      1.00      0.99        71
           6       0.99      0.95      0.97       136
           7       1.00      0.98      0.99        61
           8       0.98      1.00      0.99       201
           9       1.00      1.00      1.00        57
          10       0.99      0.95      0.97        79

    accuracy                           0.98

After fitting the model in 3 fold using cross validation the result after overfitting don't have significance difference, the model before tuning is slightly better but overall both model give a good accuracy. Besides, the cross validation also give the good result with 98% accuracy and we can also see the best parameter we can use:

* max_depth: 30, 
* max_features :  sqrt 
* min_samples_leaf: 1 
* min_samples_split: 10
* n_estimators: 200


### XGBoost Model Classification

In [15]:
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train, y_train) # fit the model on the training data
y_pred_xgb = xgb_model.predict(X_test) # make predictions on the test data

# checking with classification report
print("\n XGBoost Classification Report:")
print(classification_report(y_test, y_pred_xgb))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred_xgb))


 XGBoost Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        58
           1       0.92      0.97      0.94        59
           2       0.98      0.98      0.98        58
           3       1.00      1.00      1.00        68
           4       1.00      0.99      0.99       192
           5       0.97      0.99      0.98        71
           6       0.99      0.96      0.98       136
           7       1.00      0.98      0.99        61
           8       0.97      1.00      0.98       201
           9       1.00      1.00      1.00        57
          10       0.99      0.95      0.97        79

    accuracy                           0.98      1040
   macro avg       0.98      0.98      0.98      1040
weighted avg       0.98      0.98      0.98      1040


Accuracy Score: 0.9836538461538461


If we try to fit the data to the XGBoost model, the accuracy score that we have is 98.3%, just a little bit smaller than Random Forest but it's not a big difference, the model still have good model to predict the class, the precision and recall also have a good score on each class same as Random Forest.

### Hypertuning for XGBoost Model

In [16]:
# make a parameter grid for hyperparameter tuning for XGBoost
xgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1.0],
    'colsample_bytree': [0.5, 0.7, 1.0],
}

# Perform grid search for hyperparameter tuning
grid_search_xgb = GridSearchCV(estimator=xgb_model, param_grid=xgb_param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search_xgb.fit(X_train, y_train) # fit the model on the training data

# showing the best parameters and score
print("Best Parameters:", grid_search_xgb.best_params_)
print("Best Cross-Validation Accuracy:", grid_search_xgb.best_score_)

# use the best estimator to make predictions
best_xgb_model = grid_search_xgb.best_estimator_
y_pred_best_xgb = best_xgb_model.predict(X_test) # make predictions on the test data

# checking with classification report
print("\n XGBoost Classification Report after Hyperparameter Tuning:")
print(classification_report(y_test, y_pred_best_xgb))
print("\nAccuracy Score after Hyperparameter Tuning:", accuracy_score(y_test, y_pred_best_xgb))

Fitting 3 folds for each of 243 candidates, totalling 729 fits
Best Parameters: {'colsample_bytree': 0.5, 'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 100, 'subsample': 1.0}
Best Cross-Validation Accuracy: 0.978605362167006

 XGBoost Classification Report after Hyperparameter Tuning:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        58
           1       0.91      0.98      0.94        59
           2       0.98      0.98      0.98        58
           3       1.00      1.00      1.00        68
           4       1.00      0.99      0.99       192
           5       0.97      1.00      0.99        71
           6       0.98      0.96      0.97       136
           7       1.00      0.98      0.99        61
           8       0.98      1.00      0.99       201
           9       1.00      1.00      1.00        57
          10       0.99      0.95      0.97        79

    accuracy                           0.98      1040
  

After we use the hypertuning to XGBoost model and use the cross validation, the accuracy have slightly better before hypertuning. The cross validation is also show that the accuracy is 97% which mean the model is already good and have optimal performance when trying to predict the class. For the best parameter to use XGBoost model:

* colsample_bytree: 0.5 
* learning_rate: 0.2 
* max_depth: 5
* n_estimators: 100 
* subsample: 1.0

The next process is evaluation. In the evaluation we will evaluate the model and compare the prediction on each class using the confusion matrix and feature/permutation importance to see which feature has the high influence.

## Save Model & Data Test

In [17]:
import joblib
import os

In [None]:
# Save model & data test
os.makedirs('models', exist_ok=True)
joblib.dump(rf_model, 'models/rf_model.pkl') # using the default model
joblib.dump(X_test, 'models/X_test.pkl')
joblib.dump(y_test, 'models/y_test.pkl')

print("[✓] Model Random Forest dan test data berhasil disimpan.")

[✓] Model Random Forest dan test data berhasil disimpan.


In [None]:
joblib.dump(best_xgb_model, '../models/xgb_model.pkl') # using the best model
print("[✓] Model XGBoost berhasil disimpan.")

[✓] Model XGBoost berhasil disimpan.
