<a href="https://colab.research.google.com/github/michaeljf00/rpicsbot/blob/main/homework2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Homework 2: Ensemble Learning**

**Task 1(30 points)**: Implement a Decision Tree Classifier for your classification problem. You may use a built-in package to implement your classifier. Try modifying one or more of the input parameters and describe what changes you notice in your results. Clearly describe how these factors are affecting your output.

In [130]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [131]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

SEED = 27

In [145]:
df = pd.read_csv('drive/MyDrive/cardio_train.csv', sep = ',', header = 0)
new_df = df.drop(["id"], axis=1)
X = new_df.drop("cardio", axis=1).to_numpy()
Y = new_df["cardio"].to_numpy() 

In [146]:
new_df = new_df.head(10000)
new_df.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,50,2,168,62.0,110,80,1,1,0,0,1,0
1,55,1,156,85.0,140,90,3,1,0,0,1,1
2,52,1,165,64.0,130,70,3,1,0,0,0,1
3,48,2,169,82.0,150,100,1,1,0,0,1,1
4,48,1,156,56.0,100,60,1,1,0,0,0,0


In [147]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=SEED) 

In [148]:
treeModel1 =  DecisionTreeClassifier(criterion = 'entropy', random_state=SEED)
treeModel1.fit(X_train, Y_train)

DecisionTreeClassifier(criterion='entropy', random_state=27)

In [149]:
treeModel1.score(X_train, Y_train)

0.9769333333333333

In [150]:
treeModel1.score(X_test, Y_test)

0.6352

In [139]:
treeModel2 = DecisionTreeClassifier(criterion = 'gini', min_samples_split=7, random_state=SEED)
treeModel2.fit(X_train, Y_train)

DecisionTreeClassifier(min_samples_split=7, random_state=27)

In [151]:
treeModel2.score(X_train, Y_train)

0.8857904761904762

In [152]:
treeModel2.score(X_test, Y_test)

0.6385714285714286

I created two decision tree models with diferent values for the paremeters `criterion` and `min_samples_split`. The results from changing these values are very interesting. The first model had 'entropy' as the `criterion`, measuring the quality of the split and it had a default minimum of 2 samples required to split an internal node. The second model had 'gini' as the `criterion` and required a minimum of 7 samples to split an internal node. When scoring both models, the second model had a lower accuracy than first one when taking the trainning set as input. The accuracy of the second model using the test set was slightly higher compared to the first. 

Gini has a smaller range in impurities compared to entropy which usually leads to choosing better features. Increasing the amount on which a node splits will limit the number of features being considered. This resulted in certain relations between features lost in the training model.

**Task 2(30 points)**: From the Bagging and Boosting ensemble methods pick any one algorithm from each category. Implement both the algorithms using the same data. Use k-fold cross validation to find the effectiveness 
of both the models. Comment on the difference/similarity of
the results.

In [153]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import mean, std

In [154]:
def kFoldCrossValAccuracy(model, X, Y, random_state=SEED):
  cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=random_state)
  n_scores = cross_val_score(model, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
  print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In [155]:
adaBoostModel = AdaBoostClassifier(random_state=SEED)
adaBoostModel.fit(X, Y)
print("~~Adaboost~~")
kFoldCrossValAccuracy(adaBoostModel, X, Y)

~~Adaboost~~
Accuracy: 0.730 (0.005)


In [156]:
randomForestModel = RandomForestClassifier(random_state=SEED)
randomForestModel.fit(X, Y)
print("~~Random Forest~~")
kFoldCrossValAccuracy(randomForestModel, X, Y)

~~Random Forest~~




Accuracy: 0.706 (0.005)


The boosting method I chose to use was the adaboost classifier. The bagging method that was implemented was random forest. Performance was near identical for both implementation with the AdaBoost model having a higher accuracy by 0.024. The standard deviations for both are the same with low values of 0.005. Both models are at default so most likely the performance of both can be improved by adjusting certain parameters.

**Task 3(40 points)**:  Compare the effectiveness of the three models implemented above. Clearly describe the metric you are using for comparison. Describe (with examples) Why is this metric(metrics) suited/appropriate for the problem at hand? How would a choice of a different
metric impact your results? Can you demonstrate that?

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score

In [None]:
def analyzeData(model, X_test=X_test, Y_test=Y_test):
  Y_pred = model.predict(X_test)
  print(f"Confusion Matrix: {confusion_matrix(Y_test, Y_pred)}")
  print(f"Accuracy: {accuracy_score(Y_test, Y_pred)}")
  print(f"Precision: {precision_score(Y_test, Y_pred)}\n")

In [None]:
# Default Decision Tree Classifier
print("Default Decision Tree Classifier: ")
analyzeData(treeModel1)

# AdaBoost Model
print("AdaBoost Model:")
analyzeData(adaBoostModel)

# Random Forest Model
print("Random Forest Model:")
analyzeData(randomForestModel)

Default Decision Tree Classifier: 
Confusion Matrix: [[5698 3017]
 [3367 5418]]
Accuracy: 0.6352
Precision: 0.6423236514522822

AdaBoost Model:
Confusion Matrix: [[6976 1739]
 [2967 5818]]
Accuracy: 0.7310857142857143
Precision: 0.7698822283975122

Random Forest Model:
Confusion Matrix: [[8571  144]
 [ 294 8491]]
Accuracy: 0.9749714285714286
Precision: 0.98332368268674



In [157]:
# Default Decision Tree K-Fold Cross Validation
kFoldCrossValAccuracy(treeModel1, X, Y)

Accuracy: 0.640 (0.005)


I utilized the confusion matrix along and compared the measurements of the accuracy and precision of each model. The confuson matrix is a suitable metric to determine the effectiveness of each of these models. It is able to take into account if all classes are being predicted equally well or certain classes are being neglected when evaluating performance. Compared to regular classification accuracy, it is able to give explanation on what type of errors are being made. In the medical field, it is important to understand what errors are being made and to minimize it as much as possible, which makes this metric apporpiate to be used here since the goal is to predict cardiovascular disease.

Looking at three models, it is apparent that boosting and bagging was overall beneficial in its use here. AdaBoost outperformed the default decision tree classifier and the Random Forest Model outperformed both with a very high accuracy and precision.

If we based the effectiveness of the three models on the cross validation score, then there would be a discrepancy in which model is the most effective. The values calculated in step 2 show that the AdaBoost performed better compared to the Random Forest which contrasts from the order of effectiveness here.