## Tree Ensemble

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
import matplotlib.pyplot as plt

RANDOM_STATE=55

### Dataset Information

- `Age`: Age of the patient [years]

- `Sex`: Sex of the patient [M: Male, F: Female]

- `ChestPainType`: Chest Pain Type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]

- `RestingBP`: Resting Blood Pressure [mm Hg]

- `Cholesterol`: Serum Cholesterol [mm/dl]

- `FastingBS`: Fasting Blood Sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]

- `RestingECG`: Resting Electro Cardiogram Results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]

- `MaxHR`: Maximum Heart Rate achieved [Numeric value between 60 and 202]

- `ExerciseAngina`: Exercise-Induced Angina [Y: Yes, N: No]

- `Oldpeak`: Oldpeak = ST [Numeric value measured in depression]

- `ST_Slope`: the Slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

- `HeartDisease`: Output class [1: Heart disease, 0: Normal]

In [None]:
heart_df=pd.read_csv("heart.csv")

In [None]:
heart_df.shape

In [None]:
heart_df.head()

### One-hot Encoding

In [None]:
# Already existing binary features
cat_bin_features=['Sex', 'ChestPainType', 'RestingECG',
                  'ExerciseAngina', 'ST_Slope']

In [None]:
# This will replace the columns with the one-hot encoded ones and keep the columns outside 'columns' argument as it is
heart_ohe_df=pd.get_dummies(data=heart_df, prefix=cat_bin_features, columns=cat_bin_features)

In [None]:
heart_ohe_df.shape

In [None]:
heart_ohe_df.head()

In [None]:
# Removing our target variable
features=[x for x in heart_ohe_df.columns if x not in "HeartDisease"]

In [None]:
len(features)

### Splitting the Dataset

In [None]:
# Splits arrays or matrices into random training and test subsets
x_train, x_val, y_train, y_val=train_test_split(
  heart_ohe_df[features],
  heart_ohe_df["HeartDisease"],
  train_size=0.8,
  random_state=RANDOM_STATE
)

In [None]:
print(f"Number of training samples: {len(x_train)}")
print(f"Number of validation samples: {len(x_val)}")
print(f"Target proportion: {sum(y_train)/len(y_train)}")

### Decision Tree Model

In [None]:
min_samples_split_list=[2, 10, 30, 50, 100, 200, 300, 700]
max_depth_list=[1, 2, 3, 4, 8, 16, 32, 64, None]

In [None]:
accuracy_list_train=[]
accuracy_list_validation=[]

for min_samples_split in min_samples_split_list:
  model=DecisionTreeClassifier(min_samples_split=min_samples_split,
                               random_state=RANDOM_STATE).fit(x_train, y_train)
  
  predictions_train=model.predict(x_train)
  predictions_val=model.predict(x_val)
  accuracy_train=accuracy_score(predictions_train, y_train)
  accuracy_val=accuracy_score(predictions_val, y_val)

  accuracy_list_train.append(accuracy_train)
  accuracy_list_validation.append(accuracy_val)

plt.title("Training x Validation Metrics")
plt.xlabel("Minimum Samples Split")
plt.ylabel("Accuracy")
plt.xticks(range(len(min_samples_split_list)), labels=min_samples_split_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_validation)
plt.legend(['Training', 'Validation'])

Increasing the the number of `min_samples_split` reduces overfitting.

- Increasing `min_samples_split` from 10 to 30, and from 30 to 50, even though it does not improve the validation accuracy, it brings the training accuracy closer to it, showing a reduction in overfitting.

In [None]:
accuracy_list_train=[]
accuracy_list_validation=[]

for max_depth in max_depth_list:
  model=DecisionTreeClassifier(max_depth=max_depth,
                               random_state=RANDOM_STATE).fit(x_train, y_train)
  
  predictions_train=model.predict(x_train)
  predictions_val=model.predict(x_val)
  accuracy_train=accuracy_score(predictions_train, y_train)
  accuracy_val=accuracy_score(predictions_val, y_val)

  accuracy_list_train.append(accuracy_train)
  accuracy_list_validation.append(accuracy_val)

plt.title("Training x Validation Metrics")
plt.xlabel("Maximum Depth")
plt.ylabel("Accuracy")
plt.xticks(range(len(max_depth_list)), labels=max_depth_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_validation)
plt.legend(['Training', 'Validation'])

We can see that in general, reducing `max_depth` can help to reduce overfitting.

- Reducing `max_depth` from 8 to 4 increases validation accuracy closer to training accuracy, while significantly reducing training accuracy.

- The validation accuracy reaches the highest at tree_depth=4. 

- When the `max_depth` is smaller than 3, both training and validation accuracy decreases.  The tree cannot make enough splits to distinguish positives from negatives (the model is underfitting the training set).
 
- When the `max_depth` is too high ( >= 5), validation accuracy decreases while training accuracy increases, indicating that the model is overfitting to the training set.

In [None]:
decision_tree_model=DecisionTreeClassifier(min_samples_split=50,
                                           max_depth=4,
                                           random_state=RANDOM_STATE).fit(x_train, y_train)

In [None]:
print(f"Training Metrics:\n\tAccuracy Score: {accuracy_score(decision_tree_model.predict(x_train), y_train)}")
print(f"Validation Metrics:\n\tAccuracy Score: {accuracy_score(decision_tree_model.predict(x_val), y_val)}")

### Random Forest Model

In [None]:
max_depth_list=[2, 4, 8, 16, 32, 64, None]
n_estimators_list=[10, 50, 100, 500]

In [None]:
accuracy_list_train=[]
accuracy_list_validation=[]

for min_samples_split in min_samples_split_list:
  model=RandomForestClassifier(min_samples_split=min_samples_split,
                               random_state=RANDOM_STATE).fit(x_train, y_train)
  
  predictions_train=model.predict(x_train)
  predictions_val=model.predict(x_val)
  accuracy_train=accuracy_score(predictions_train, y_train)
  accuracy_val=accuracy_score(predictions_val, y_val)

  accuracy_list_train.append(accuracy_train)
  accuracy_list_validation.append(accuracy_val)

plt.title("Training x Validation Metrics")
plt.xlabel("Minimum Samples Split")
plt.ylabel("Accuracy")
plt.xticks(range(len(min_samples_split_list)), labels=min_samples_split_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_validation)
plt.legend(['Training', 'Validation'])

In [None]:
accuracy_list_train=[]
accuracy_list_validation=[]

for max_depth in max_depth_list:
  model=RandomForestClassifier(max_depth=max_depth,
                               random_state=RANDOM_STATE).fit(x_train, y_train)
  
  predictions_train=model.predict(x_train)
  predictions_val=model.predict(x_val)
  accuracy_train=accuracy_score(predictions_train, y_train)
  accuracy_val=accuracy_score(predictions_val, y_val)

  accuracy_list_train.append(accuracy_train)
  accuracy_list_validation.append(accuracy_val)

plt.title("Training x Validation Metrics")
plt.xlabel("Maximum Depth")
plt.ylabel("Accuracy")
plt.xticks(range(len(max_depth_list)), labels=max_depth_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_validation)
plt.legend(['Training', 'Validation'])

In [None]:
accuracy_list_train=[]
accuracy_list_validation=[]

for n_estimators in n_estimators_list:
  model=RandomForestClassifier(n_estimators=n_estimators,
                               random_state=RANDOM_STATE).fit(x_train, y_train)
  
  predictions_train=model.predict(x_train)
  predictions_val=model.predict(x_val)
  accuracy_train=accuracy_score(predictions_train, y_train)
  accuracy_val=accuracy_score(predictions_val, y_val)

  accuracy_list_train.append(accuracy_train)
  accuracy_list_validation.append(accuracy_val)

plt.title("Training x Validation Metrics")
plt.xlabel("Number of Estimators")
plt.ylabel("Accuracy")
plt.xticks(range(len(n_estimators_list)), labels=n_estimators_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_validation)
plt.legend(['Training', 'Validation'])

In [None]:
random_forst_model=RandomForestClassifier(n_estimators=100,
                                          max_depth=16,
                                          min_samples_split=10).fit(x_train, y_train)

In [None]:
print(f"Training Metrics:\n\tAccuracy Score: {accuracy_score(random_forst_model.predict(x_train), y_train)}")
print(f"Validation Metrics:\n\tAccuracy Score: {accuracy_score(random_forst_model.predict(x_val), y_val)}")

### XGBoost Model

The boosting methods train several trees, but instead of them being uncorrelated to each other, now the trees are fit one after the other in order to minimize the error.

- The learning rate is the size of the step on the Gradient Descent method that the XGBoost uses internally to minimize the error on each train step.

XGBoost can take in an evaluation dataset of the form `(X_val,y_val)`.
- On each iteration, it measures the cost (or evaluation metric) on the evaluation datasets.

- Once the cost (or metric) stops decreasing for a number of rounds (called early_stopping_rounds), the training will stop.

- More iterations lead to more estimators, and more estimators can result in overfitting.

- By stopping once the validation metric no longer improves, we can limit the number of estimators created, and reduce overfitting.

- `eval_set = [(X_train_eval,y_train_eval)]`:Here we must pass a list to the eval_set, because you can have several different tuples ov eval sets.

- `early_stopping_rounds`: This parameter helps to stop the model training if its evaluation metric is no longer improving on the validation set.

In [None]:
# 80% to train and 20% to test
n=int(len(x_train)*0.8)

In [None]:
x_train_fit, x_train_eval, y_train_fit, y_train_eval=x_train[:n], x_train[n:], y_train[:n], y_train[n:]

In [None]:
xgb_model=XGBClassifier(n_estimators=500, learning_rate=0.1, 
                        verbosity=1, random_state=RANDOM_STATE)

xgb_model.fit(x_train_fit, y_train_fit,
              eval_set=[(x_train_eval, y_train_eval)])

In [None]:
print(f"Training Metrics:\n\tAccuracy Score: {accuracy_score(xgb_model.predict(x_train), y_train)}")
print(f"Validation Metrics:\n\tAccuracy Score: {accuracy_score(xgb_model.predict(x_val), y_val)}")