# Machine Learning notebooks - Part 2

## Part II: Classified supervised ML
We will continued exploration of the Pima Indian Diabetes Prediction dataset that we explore in explore in Part 1 (JN17) the diabetes occurence in a populaton of Pima Indian heritage. This dataset has a target variable that is discrete measuring a binary outcome. For this reason we will explore different classified supervised learning models.

Pima Indians Diabetes Dataset
- `Pregnancies` - Number of times pregnant
- `Glucose` - Plasma glucose concentration from a 2 hours oral glucose tolerance tests
- `BloodPressure` - Diastolic blood pressure (mm Hg)
- `SkinThickness` - Triceps skin fold thickness (mm)
- `Insulin` - 2-hour serum insulin (mu U/ml)
- `BMI` - Body mass index (weight in kg/(height in m)^2)
- `DiabetesPedigree` - Diabetes pedigree function
- `Age` - Age (years)
- `Outcome` - Class variable (0 or 1) 268 of 768 are 1, the others are 0
  
Learning outcome
- Applying different feature engineering and feature scaling to the dataset
- Try different classified ML algorithms

In [1]:
### Load libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
from google.colab import drive
drive.mount('/content/drive')
gdrive='drive/MyDrive/SJSU/SJSU_Fall2024/CS133_Data-Visualization/week12_ml/'
fp=gdrive+'diabetes.csv'

Mounted at /content/drive


In [3]:
pima = pd.read_csv(fp)
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Since we have already done some EDA in the previous notebook as well as some exploration of how we should preprocess our data, we will jump into exploring classified ML models.

## Split the data  
To understand model performance, dividing the dataset into a training set and a test set is a good strategy. Typically when we select a random set of instances for our model we wil randomly select 20% of our dataset. We set this 20% aside for out test set and train on the remaining 80%.

It is important that the training and test split contain known output from which the model can learn off of. The test set then tests the model's prediction based on what it learned from the training set.

Let's split the dataset by using the function train_test_split(). You need to pass three parameters features; target, and test_set size.

`random_state` is a "seed" number for the random generator to ensure that the same training and test sets across different executions. The int value used does not matter.

`stratify=y` is useful to deal with imbalance dataset.

In [5]:
### Select features and target variables
# pima.columns
X = pima.drop('Outcome', axis=1) # Features
y = pima.Outcome # Target variable

### Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


### Pre-processing and feature engineering

**Replacing null or zero values**
This dataset had 3 features that had null value; Age, Cabin, and Embarked. We will explor and addrss these columns by imputing them with new values.

**Feature Engineering**  
Another pre-processing task that we can do is feature engineering. An example of that would be to code our data into a numberical vector, where you convert categorical variables to a numerial variable. We do this because many ML algorirthm requires a numerical input.

In [22]:
zero_cols = []
for col in pima.columns:
    zero_count = (pima[col] == 0).sum()
    if zero_count > 0:
      zero_cols.append(col)
zero_cols.remove('Pregnancies')
zero_cols.remove('Outcome')

def ml_process(df, zero_cols):
  ## Replace 0 values with median values
  df_processed = df.copy()
  for col in zero_cols:
    median_val = df_processed[df_processed[col] != 0][col].median() # Calculate median from original data
    df_processed[col] = df_processed[col].replace(0, median_val)

  ## Feature engineering of Age
  df_processed['Age_Group'] = pd.cut(df_processed['Age'], bins=[0, 30, 40, 50, 100],
                                    labels=['Young', 'Middle', 'Senior', 'Elderly'])
  df_processed = df_processed.drop('Age', axis=1)
  df_processed['Age_Group'] = df_processed['Age_Group'].cat.codes

  return df_processed

In [24]:
### Run ml_process() on the training and test set
X_train_fe=ml_process(X_train, zero_cols)
X_test_fe=ml_process(X_test, zero_cols)

In [25]:
X_train_fe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age_Group
353,1,90,62,12,43,27.2,0.58,0
711,5,126,78,27,22,29.6,0.439,1
373,2,105,58,40,94,34.9,0.225,0
46,1,146,56,29,125,29.7,0.564,0
682,0,95,64,39,105,44.6,0.366,0


### Classified supervised machine learning models
Since the target variable for this dataset is a discrete value, classification models will be the preferred model over regressor models.

### Logistic regression model




In [26]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    roc_auc_score, roc_curve, ConfusionMatrixDisplay, classification_report
)

log_reg = LogisticRegression(random_state=42, max_iter=5000)

param_grid = {
    'penalty': ['l2', None],
    'solver': ['lbfgs', 'saga'],
    'C': [0.01, 0.1, 1, 10, 100],
    'class_weight': [None, "balanced"]
    }

best_log = GridSearchCV(log_reg,
                        param_grid,
                        cv=5,
                        scoring='accuracy',
                        refit='accuracy',
                        n_jobs=-1,
                        error_score='raise'
                        )
best_log.fit(X_train_fe, y_train)
logreg_y_pred = best_log.predict(X_test_fe)
print("Best parameters for logistic regression:", best_log.best_params_)
print("Best CV score for logistic regression:", best_log.best_score_)

Best parameters for logistic regression: {'C': 1, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Best CV score for logistic regression: 0.7866853258696521


### K-Neighbors classification


In [28]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
### Create Decision Tree classifer object
kn = KNeighborsClassifier()

### Train Decision Tree Classifer
kn = kn.fit(X_train_fe, y_train)

#Predict the response for test dataset
kn_y_pred = kn.predict(X_test_fe)

# Model Accuracy, how often is the classifier correct? Returns a fraction of correctly classified samples
print("Accuracy with KNN:", accuracy_score(y_test, kn_y_pred, normalize=True))

Accuracy with KNN: 0.6883116883116883


### Decision tree classifier model


In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import make_classification

### Parameter for Decision tree
param_dist = {
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ["gini", "entropy"]
}

## Create a Decision tree classifier
dtc = DecisionTreeClassifier(random_state=42) #default n_estimator is 100

## Use random search to find optimal hyperparameters
random_search = RandomizedSearchCV(dtc,
                                   param_distributions=param_dist,
                                   n_iter=10, cv=5, scoring='accuracy',
                                   n_jobs=-1, random_state=42) #n_jobs=-1: use all available CPU cores

# Fit random search to the training data
random_search.fit(X_train_fe, y_train)

## Best model for random forest
best_dtc = random_search.best_estimator_
print("Optimal hyperparameters for decision tree:", random_search.best_params_)

## Fit the best model to the testing data
dtc_y_pred = best_dtc.predict(X_test_fe)

## Evaluate the model with accuracy
print("Accuracy with Decision tree:", accuracy_score(y_test, dtc_y_pred, normalize=True))

Optimal hyperparameters for decision tree: {'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 5, 'criterion': 'gini'}
Accuracy with Decision tree: 0.7727272727272727


## SVM

In [34]:
from sklearn.svm import SVC

### SVM classifier object
svm = SVC(random_state=42) #default n_estimator is 100'

svm_param_dist = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear', 'poly'],
    'gamma': ['scale', 'auto']
}

## Use random search to find optimal hyperparameters
svm_random_search = RandomizedSearchCV(svm,
                                   param_distributions=svm_param_dist,
                                   n_iter=10, cv=5, scoring='accuracy',
                                   n_jobs=-1, random_state=42) #n_jobs=-1: use all available CPU cores

# Fit random search to the training data
svm_random_search.fit(X_train_fe, y_train)

## Best model for random forest
best_svm = svm_random_search.best_estimator_
print("Optimal hyperparameters:", svm_random_search.best_params_)

## Fit the best model to the testing data
svm_y_pred = best_svm.predict(X_test_fe)

## Evaluate SVM model
print("Accuracy with SVM:", accuracy_score(y_test, svm_y_pred, normalize=True))

KeyboardInterrupt: 

**Evaluating models**  
In the classifier models we selected, we used `RandomizedSearchCV` to find the optimal hyperparameters. This technique is used to find the best combination of hyperparameters for a machine learning model. It will randomly sampling a fixed number of parameter settings from a defined distribution for each hyperparameter and use cross-validation to select the best parameters.
  
In a future notebook we will cover more details about how to use different metrics to define the best model for your problem and dataset.

In [None]:
### Build a dataframe with all the accuracy scores to plot
accuracy_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_score(y_test, logreg_y_pred, normalize=True), accuracy_score(y_test, kn_y_pred, normalize=True), accuracy_score(y_test, dtc_y_pred, normalize=True), accuracy_score(y_test, svm_y_pred, normalize=True)]
})

plt.bar(accuracy_df['Model'], accuracy_df['Accuracy'])
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.show()