# **This notebook handle ML modelling**

## Objectives

* Here we create a ML model to predict diabetes based on the features engineered in the previous notebook.

## Inputs

* The input is the diabeters_model_input.csv file located in the data/transformed directory. This file was generated in the Feature_Engineering notebook. 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# ML model 

* Import the libraries needed

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score

* Load the input file to a dataframe

In [None]:
df = pd.read_csv('Data/transformed/diabetes_model_input.csv')
# display the shape and first few rows of the dataframe
print(df.shape)
df.head()

* chekcking for the valule counts of the target variable

In [None]:
df['diabetes'].value_counts()

* We split the data into train and test sets.

In [None]:
# Separate the features and target variable
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['diabetes'],axis=1),
                                    df['diabetes'],
                                    test_size=0.2,
                                   #stratify=df['diabetes'],
                                    random_state=42
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:", X_test.shape, y_test.shape)

## Logistic Regression

* we create a pipeline with a scaler, feature selector and a classifier model

In [None]:
def pipeline_logistic_regression():

    pipeline = Pipeline(
        [
            ("feat_scaling", StandardScaler()),
            ("feat_selection", SelectFromModel(LogisticRegression(random_state=42))),
            ("model", LogisticRegression(random_state=42)),
        ]
    )

    return pipeline

* Fit the pipeline with the train set

In [None]:
pipeline = pipeline_logistic_regression()
pipeline.fit(X_train, y_train)

* Next we learn the model coefficients 

In [None]:
def logistic_regression_coef(model, columns):

    coeff_df = pd.DataFrame(
        model.coef_, index=["Coefficient"], columns=columns
    ).T.sort_values(["Coefficient"], key=abs, ascending=False)
    print(coeff_df)

In [None]:
logistic_regression_coef(
    model=pipeline["model"],
    columns=X_train.columns[pipeline["feat_selection"].get_support()],
)

* Here's what we observe from the coefficients

| Feature              | Coefficient | Interpretation                                                              |
|----------------------|-------------|-----------------------------------------------------------------------------|
| HbA1c_level          | 2.45        | Strong positive impact on the probability of diabetes                       |
| blood_glucose_level  | 1.82        | Also a strong positive impact, but slightly less than HbA1c                 |
| age                  | 0.91        | Smaller positive impact, but still relevant                                 |


* The model selected the 3 features based on the importance of their logistic regression coefficients.

* Among them, HbA1c level is the most influential, followed by blood glucose level, and then age.

* The model is interpreting higher values for all three as increasing the chance of the positive outcome.

* Now we will look at the confusion matrix and classification report for performance metrics

In [None]:
def confusion_matrix_and_report(X, y, pipeline, label_map):
    
    prediction = pipeline.predict(X)

    print("---  Confusion Matrix  ---")
    print(
        pd.DataFrame(
            confusion_matrix(y_true=y, y_pred=prediction),
            columns=[["Prediction " + sub for sub in label_map]],
            index=[["Actual " + sub for sub in label_map]],
        )
    )
    print("\n")

    print("---  Classification Report  ---")
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)


In [None]:
clf_performance(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    pipeline=pipeline,
    label_map=["No Diabetes", "Diabetes"],
)

## ✅ What Looks Good

- **Overall Accuracy is solid:**
  - Train: 85%
  - Test: 86%
  - No major signs of overfitting or underfitting.

- **Performance is slightly better for the 'Diabetes' class**, which is often the more critical one to catch:
  - Test set F1-score for Diabetes: 0.87
  - Recall for Diabetes: 0.87 (highly valuable in medical diagnosis)

- **Balanced macro averages:**  
  Precision, recall, and F1 are all around 0.85, which shows that the model doesn't heavily favor one class over another.

  ## ⚠️ What Might Be Concerning

- **False Negatives (missed diabetes cases):**
  - Test: 133 missed diabetic patients
  - That's ~13% of diabetic cases going undetected.
  - In real-world healthcare, this could be dangerous, as undiagnosed diabetes can lead to complications.

- **False Positives (wrongly predicted diabetes):**
  - Test: 126 non-diabetic people predicted as diabetic
  - This can cause anxiety, unnecessary testing, and cost.
  - Even though this number is relatively low, its impact on patient well-being should not be dismissed.

  ## ⚙️ Experimentation with Stratification and Class Weights

I experimented with applying **stratification** and **class-weight balancing** to improve model performance, especially to reduce false negatives.

- However, these changes **increased the number of false negatives**, meaning more actual diabetes cases were predicted as no diabetes.
- Since **minimizing false negatives is critical** in this healthcare context, I chose **not to keep these changes**.


### ✅ Summary
> The model demonstrates strong and balanced performance, achieving an accuracy of **85% on the training set** and **86% on the test set**. It maintains consistent precision and recall across both classes, with particularly good performance in identifying diabetes cases (**F1-score: 0.87** on the test set). The confusion matrix reveals a manageable number of false negatives and false positives, but since this is a healthcare application, further steps may be needed to validate these predictions before clinical action is taken.

---

## Random Forest Classifier

* Implementing a Random Forest Classifier to see if it improves performance over Logistic Regression.

* Define the pipeline with a scaler, feature selector and the Random Forest Classifier

In [None]:
def pipeline_rf_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=42)) ),
      ( "model", RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42)),

    ])

  return pipeline

* Using GridSearchCV to find the best hyperparameters for the Random Forest model

In [None]:
# param_grid = {"model__n_estimators":[50,20],
#               }

# param_grid = {
#     'model__n_estimators': [50, 100],
#     'model__max_depth': [None, 10, 20]
# }

# param_grid


In [None]:
# grid = GridSearchCV(estimator=pipeline_rf_clf(),
#                     param_grid=param_grid,
#                     cv=2,
#                     n_jobs=-2,
#                     verbose=1,
#                     scoring=make_scorer(recall_score, pos_label=1)
#                     )

# grid = GridSearchCV(
#     estimator=pipeline_rf_clf(),
#     param_grid=param_grid,
#     scoring=make_scorer(recall_score, pos_label=1),  # assuming 1 = 'Diabetes'
#     cv=3,
#     n_jobs=-1,
#     verbose=2
# )

# grid.fit(X_train,y_train)

* we check the resulst of 4 models using cv_results_

In [None]:
# (pd.DataFrame(grid.cv_results_)
# .sort_values(by='mean_test_score',ascending=False)
# .filter(['params','mean_test_score'])
# .values
#  )

* checking the best parameters

In [None]:
# grid.best_params_

* getting the pipeline with the best estimator

In [None]:
pipeline_rf = pipeline_rf_clf()
pipeline_rf.fit(X_train, y_train)

* evaluate the model with a confusion matrix and classification report

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_rf,
                label_map= ['No Diabetes', 'Diabetes'] 
                )

### 🔍 Side-by-Side Comparison

#### ✅ Accuracy

| Model              | Train Accuracy | Test Accuracy |
|--------------------|----------------|---------------|
| Random Forest      | 91%            | 90%           |
| Logistic Regression | 85%            | 86%           |

> 🔹 **Winner: Random Forest** — better overall accuracy, and smaller train-test gap.

---

#### ✅ Recall (especially important for Diabetes class)

| Model              | Train Recall – Diabetes | Test Recall – Diabetes |
|--------------------|------------------------|-----------------------|
| Random Forest      | 0.94                   | 0.94                  |
| Logistic Regression | 0.86                   | 0.87                  |

> 🔹 **Winner: Random Forest** — significantly higher recall for the Diabetes class, which is crucial in medical diagnosis.

---

#### ✅ False Negatives (missed diabetes cases)

| Model              | Test False Negatives    |
|--------------------|------------------------|
| Random Forest      | 66 (6.5%)               |
| Logistic Regression | 133 (13%)               |

> 🔹 **Winner: Random Forest** — detects more true diabetes cases (almost halves the number of missed diagnoses).

### ✅ Final Verdict

Random Forest clearly outperforms Logistic Regression across all major metrics, especially in terms of recall for diabetic patients, overall accuracy, and reduction in false negatives — making it the stronger choice between the 2 models for this diabetes prediction task.

## ⚙️ Experimentation with hyperparameter optimisation

- Initially tried with just 'n_estimators' but the model overfitted the training data. The recall rate in train set was 1 but it was performing poorly on the test set. 

- So added 'max_depth' to the hyperparameter grid to control the depth of each tree in the forest.

- This helped reduce overfitting and improved generalization to unseen data.

---