<a href="https://colab.research.google.com/github/keran-w/Projects-in-ML-and-AI/blob/main/ProjML%26AI_hw2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# connect to google drive
from google.colab import drive
from IPython.display import clear_output
drive.mount('/content/drive', force_remount=True)
clear_output()

import warnings
warnings.filterwarnings('ignore')

In [2]:
# install packages
!pip install optuna
clear_output()

In [3]:
# import dataset from kaggle
# dataset link https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
import os
if not os.path.isfile('~/.kaggle/kaggle.json'):
    os.system('mkdir ~/.kaggle')
    os.system('cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json')
!kaggle datasets download -d uciml/pima-indians-diabetes-database
!unzip pima-indians-diabetes-database.zip -d .

Downloading pima-indians-diabetes-database.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 8.34MB/s]
Archive:  pima-indians-diabetes-database.zip
  inflating: ./diabetes.csv          


In [4]:
# read file using pandas
import numpy as np
import pandas as pd
data = pd.read_csv('diabetes.csv')
data.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# split data into train set and test set
def get_X_and_y(data, target):
    X = data.copy()
    y = X.pop(target)
    X, y = X.values, y.values
    return X, y

X, y = get_X_and_y(data, 'Outcome')
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=20, test_size=0.3)

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
# [(train_index, test_index) for train_index, test_index in 

#### Task 1 (30 points): 

Implement a Decision Tree Classifier for your classification problem. You
may use a built-in package to implement your classifier. Try modifying one or more of the input
parameters and describe what changes you notice in your results. Clearly describe how these
factors are affecting your output.

In [6]:
# zip fit and predict 
def fit_transform(clf, X_train, y_train, X_test):
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    return preds

In [7]:
# Decision Tree
import optuna
from sklearn.tree import DecisionTreeClassifier

def objective(trial):
    params = {
        'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
        'max_depth': trial.suggest_int('max_depth', 1, 10),
    }

    clf = DecisionTreeClassifier(**params, random_state=20)
    preds = fit_transform(clf, X_train, y_train, X_test)
    return metrics.accuracy_score(y_test, preds)

search_space = {
    'criterion': ['gini', 'entropy'],
    'max_depth':  np.arange(1, 10)
}

study = optuna.create_study(sampler=optuna.samplers.GridSampler(search_space), direction='maximize')
study.optimize(objective, n_trials=200)
clear_output()

# print sorted-by-accuracy results
study.trials_dataframe().iloc[:, np.r_[5:5+len(search_space), 1]].sort_values('value', ascending=False).head(15)

Unnamed: 0,params_criterion,params_max_depth,value
0,entropy,3,0.757576
11,gini,2,0.757576
3,gini,3,0.757576
8,entropy,4,0.74026
14,entropy,5,0.731602
12,entropy,8,0.714286
10,entropy,6,0.714286
1,entropy,7,0.709957
2,entropy,9,0.701299
16,gini,9,0.69697


For the decision tree classifier, I tried two hyperparameters -- criterion and max depth. With all others being equal, a small max depth (2~5) is better than a large one, and there are no significant differences between using the Gini criterion and the entropy criterion.

In [8]:
# generate classification report
best_params = study.best_params
clf = DecisionTreeClassifier(**best_params, random_state=20)
preds = fit_transform(clf, X_train, y_train, X_test)

print(metrics.classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.77      0.90      0.83       153
           1       0.70      0.49      0.58        78

    accuracy                           0.76       231
   macro avg       0.74      0.69      0.70       231
weighted avg       0.75      0.76      0.74       231



#### Task 2 (30 points): 

From the Bagging and Boosting ensemble methods pick any one algorithm
from each category. Implement both the algorithms using the same data. Use k-fold cross
validation to find the effectiveness of both the models. Comment on the difference/similarity of
the results.

In [9]:
# Bagging
from sklearn.ensemble import RandomForestClassifier

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 1, 10000),
        'max_depth': trial.suggest_int('max_depth', 1, 10)
    }

    clf = RandomForestClassifier(**params, random_state=20)
    acc_list = []
    for train_index, test_index in skf.split(X, y):
        X_train, y_train, X_test, y_test = X[train_index], y[train_index], X[test_index], y[test_index]
        preds = fit_transform(clf, X_train, y_train, X_test)
        acc_list.append(metrics.accuracy_score(y_test, preds))
    return np.mean(acc_list)

search_space = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth':  np.arange(4, 10)
}

study = optuna.create_study(sampler=optuna.samplers.GridSampler(search_space), direction='maximize')
study.optimize(objective, n_trials=200)
clear_output()

# print sorted-by-accuracy results
study.trials_dataframe().iloc[:, np.r_[5:5+len(search_space), 1]].sort_values('value', ascending=False).head(15)

Unnamed: 0,params_max_depth,params_n_estimators,value
0,4,200,0.776055
29,7,500,0.773466
10,5,300,0.772176
23,5,500,0.772176
5,5,200,0.770877
12,6,500,0.770877
7,4,500,0.770868
9,8,400,0.77086
11,5,400,0.769578
16,9,300,0.76957


For the random forest classifier, a bagging method, I tried two hyperparameters -- max depth and n estimators. With all others being equal, there are no significant differences between using max depth 4\~10 and n estimators 100\~500. Thus, these two factors should be considered in a group.

In [10]:
# generate classification report
best_params = study.best_params
clf = RandomForestClassifier(**best_params, random_state=20)
preds = fit_transform(clf, X_train, y_train, X_test)

print(metrics.classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.77      0.90      0.83       153
           1       0.71      0.46      0.56        78

    accuracy                           0.75       231
   macro avg       0.74      0.68      0.69       231
weighted avg       0.75      0.75      0.74       231



In [11]:
# Boosting
from sklearn.ensemble import GradientBoostingClassifier

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 1, 10000),
        'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1),
        'max_depth': trial.suggest_int('max_depth', 1, 10)
    }

    clf = GradientBoostingClassifier(**params, random_state=20)
    acc_list = []
    for train_index, test_index in skf.split(X, y):
        X_train, y_train, X_test, y_test = X[train_index], y[train_index], X[test_index], y[test_index]
        preds = fit_transform(clf, X_train, y_train, X_test)
        acc_list.append(metrics.accuracy_score(y_test, preds))
    return np.mean(acc_list)

search_space = {
    'n_estimators': [200, 500, 800, 1000],
    'learning_rate': [5e-3, 8e-3, 1e-2, 3e-2],
    'max_depth':  np.arange(2, 8)
}

study = optuna.create_study(sampler=optuna.samplers.GridSampler(search_space), direction='maximize')
study.optimize(objective, n_trials=200)
clear_output()

# print sorted-by-accuracy results
study.trials_dataframe().iloc[:, np.r_[5:5+len(search_space), 1]].sort_values('value', ascending=False).head(15)

Unnamed: 0,params_learning_rate,params_max_depth,params_n_estimators,value
73,0.005,3,500,0.774807
11,0.01,3,200,0.773474
94,0.005,2,800,0.772209
3,0.008,2,500,0.770911
0,0.01,2,500,0.770894
54,0.008,3,200,0.770877
64,0.005,2,1000,0.769595
39,0.01,2,800,0.769595
78,0.008,2,1000,0.768296
56,0.005,3,800,0.766998


For the gradient boosting classifier, a boosting method, I tried three hyperparameters -- learning rate, max depth, and n estimators. With all others being equal, a small max depth (2\~4) is better than a large one, and a small learning rate (0.005, 0.01) is better than a large one. Also, n estimator of 500 is best for this problem.

In [12]:
# generate classification report
best_params = study.best_params
clf = GradientBoostingClassifier(**best_params, random_state=20)
preds = fit_transform(clf, X_train, y_train, X_test)

print(metrics.classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.78      0.88      0.83       153
           1       0.69      0.51      0.59        78

    accuracy                           0.76       231
   macro avg       0.74      0.70      0.71       231
weighted avg       0.75      0.76      0.75       231



#### Task 3 (40 points): 

Compare the effectiveness of the three models implemented above. Clearly
describe the metric you are using for comparison. Describe (with examples) Why is this
metric(metrics) suited/appropriate for the problem at hand? How would a choice of a different
metric impact your results? Can you demonstrate that?

For the three models shown above -- decision tree classifier, random forest classifier, and gradient boosting classifier, gradient boosting classifier achieve the best result. I use accuracy and cross-validation mean accuracy as the metrics to evaluate model performance. For this problem, the label is relatively balanced, and accuracy is essential, so I choose accuracy.

Choosing other metrics like the f1 score or roc AUC will emphasize the true positives and the true negatives. However, for this problem, choosing a model with high accuracy will have a high f1 score and roc AUC, so switching to these two metrics would not affect the best model we choose.