<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***
* **Full Name** = Justin Pham   
* **UCID** = 30139323
***

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **27**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [50]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** missing data is generally unfavourable in machine learning since it can result in more inaccurate models and more difficult to tune and train.

In [51]:
# 1.1
# Add necessary code here.
# TO DO: Check if there are any missing values and fill them in if necessary
threshHold = 0.6 * len(df)
df.dropna(thresh=threshHold, axis=1, inplace=True)

print(df.isna().sum())
# print number of unique values
print(df.nunique())

age          0
sex          0
cp           0
trestbps     1
chol        23
fbs          8
restecg      1
thalach      1
exang        1
oldpeak      0
num          0
dtype: int64
age          38
sex           2
cp            4
trestbps     31
chol        153
fbs           2
restecg       3
thalach      71
exang         2
oldpeak      10
num           2
dtype: int64


<font color='Green'><b>Answer:</b></font>

- **1.2** I chose simpleImputer because it is fast and easy to use. I am only filling in missing data with the mean of that column. Because we filled in missing data with the mean, this may we result in our model having slightly less accuracy.

In [52]:
# 1.2
# Add necessary code here.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df.isna().sum())

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
num         0
dtype: int64


<font color='Green'><b>Answer:</b></font>

- **1.3** 

numerical_features: 

```['age', 'cp', 'trestbps', 'chol', 'restecg', 'thalach', 'oldpeak']```

binary_features:
```['sex', 'fbs', 'exang']```

In [53]:
# 1.3
# Add necessary code here.
y = df['num']
X = df.drop(columns='num')

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
# get features from X where column is numerical
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
# Binary features is the features in numerical features that have only 2 unique values
binary_features = X[numerical_features].nunique()[X[numerical_features].nunique() == 2].index
# Drop the binary features from numerical_features
numerical_features = numerical_features.drop(binary_features)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features),
        ("pt", "passthrough", binary_features)
    ]).set_output(transform ="pandas")
print("numerical_features: ", numerical_features)
print("categorial_features: ", categorical_features)
print("binary_features: ", binary_features)
X.nunique()

numerical_features:  Index(['age', 'cp', 'trestbps', 'chol', 'restecg', 'thalach', 'oldpeak'], dtype='object')
categorial_features:  Index([], dtype='object')
binary_features:  Index(['sex', 'fbs', 'exang'], dtype='object')


age          38
sex           2
cp            4
trestbps     31
chol        153
fbs           2
restecg       3
thalach      71
exang         2
oldpeak      10
dtype: int64

# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. Interpret the results and compare them with the baseline scores from the previous assignment. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1** 

The three modles I have chosen are decision tree classifier, support vector classifier, and KNN classifier.


I chose decision tree classifier because decision trees are great options for classification and regression models. This makes them flexible for different types of data.

Strengths:
1. Works for both classficiation and regression, making them very flexible
1. Can be combined in forests for even greater accuracy and performance.
1. Easy to interpret and visualize, making it clear how decisions are made.

Weaknesses:
1. Prone to overfitting if not properly tuned.
1. Not very powerful on it's own, usually combined with forests.
1. Sensitive to changes in the training data, potentially leading to unstable predictions for small variations.

I chose support vector classifier because SVC can capture relationships between different features.

Strengths:
1. Powerful for high-dimensional data and works well with small datasets.
1. Can handle complex non-linear relationships between features using kernel functions.

Weaknesses:
1. Hard to explain and understand
1. Can be computationally expensive for very large datasets due to the optimization process.
1. Choosing the right kernel function and its hyperparameters can be crucial for performance and can involve trial-and-error.


I chose knn classifier because it is simple to explain and understand.

Strengths:
1. Easy to explain and understand
1. Makes no assumptions about the underlying data distribution.
1. Minimal tuning necessary


Weaknesses:
1. Performance can be slow for high-dimensional data due to distance calculations for all neighbors.
1. Can suffer from the "curse of dimensionality" in high dimensions, where distances become meaningless.


In [75]:
# 2.1
# Add necessary code here.
from sklearn.pipeline import Pipeline
from sklearn import tree, svm
from sklearn.neighbors import KNeighborsClassifier

pipe_dt = Pipeline( [("preprocessor", preprocessor), ("classifier", tree.DecisionTreeClassifier())] )
pipe_svm = Pipeline([("preprocessor", preprocessor), ("classifier", svm.SVC())])
pipe_knn = Pipeline([("preprocessor", preprocessor), ("classifier", KNeighborsClassifier())])

<font color='Green'><b>Answer:</b></font>

- **2.2** 

Scores from first run of GridSearchCV:
```
DecisionTreeClassifier()
Best Parameters based on F1: {'classifier__max_depth': 3}

Best score:  0.6467727756114853


KNeighborsClassifier()
Best Parameters based on F1: {'classifier__n_neighbors': 9}

Best score:  0.7650912349299446


SVC()
Best Parameters based on F1: {'classifier__C': 1, 'classifier__gamma': 1, 'classifier__kernel': 'linear'}

Best score:  0.726170264370604
```

In [79]:
# 2.2
# Add necessary code here.
from sklearn.metrics import f1_score, accuracy_score, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

RANDOM_STATE = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

param_grid_dt = {
    "classifier__max_depth": [2, 3, 4],
}
param_grid_knn = {
    "classifier__n_neighbors": [8, 9, 10],
}
param_grid_svm = {
    "classifier__C": [0.01, 0.1, 1, 1.5, 2],
    "classifier__gamma": [0.01, 0.1, 1, 1.5, 2],
    "classifier__kernel" : ["linear", "poly", "rbf", "sigmoid"]
}
scoring = {
    'accuracy': make_scorer(accuracy_score),         # Scoring based on accuracy_score
    'f1_score': make_scorer(f1_score)                # Scoring based on F1_score
}

def gridSearchResults(pipe, param_grid):
    grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring=scoring, refit='f1_score',  n_jobs=-1)
    grid_search.fit(X_train, y_train)
    best_params = grid_search.best_params_
    results = grid_search.cv_results_
    print(pipe.get_params()['classifier'])
    print("Accuracy scores:", results['mean_test_accuracy'])  # Print mean_test_accuracy
    print("\nF1 scores:", results['mean_test_f1_score'])        # Print mean F1 scores
    print("\nBest Parameters based on F1:", best_params)  # Print best parameters based on F1 score
    print("\nBest score: ", grid_search.best_score_)
    print("\n")
    return best_params
    
for pipe, param_grid in zip([ pipe_dt, pipe_knn, pipe_svm], [ param_grid_dt, param_grid_knn, param_grid_svm]):
    gridSearchResults(pipe, param_grid)


DecisionTreeClassifier()
Accuracy scores: [0.8        0.77446809 0.77446809]

F1 scores: [0.67879121 0.64308047 0.64563919]

Best Parameters based on F1: {'classifier__max_depth': 2}

Best score:  0.6787912087912088




KNeighborsClassifier()
Accuracy scores: [0.81276596 0.83829787 0.81702128]

F1 scores: [0.70846347 0.76509123 0.72311303]

Best Parameters based on F1: {'classifier__n_neighbors': 9}

Best score:  0.7650912349299446


SVC()
Accuracy scores: [0.80851064 0.63829787 0.63829787 0.63829787 0.80851064 0.63829787
 0.63829787 0.63829787 0.80851064 0.81702128 0.63829787 0.63829787
 0.80851064 0.76170213 0.63829787 0.63829787 0.80851064 0.74042553
 0.63829787 0.63829787 0.82553191 0.63829787 0.63829787 0.63829787
 0.82553191 0.69787234 0.76170213 0.80425532 0.82553191 0.73191489
 0.63829787 0.76170213 0.82553191 0.68510638 0.63829787 0.77021277
 0.82553191 0.6893617  0.63829787 0.74893617 0.81702128 0.63829787
 0.80425532 0.80851064 0.81702128 0.82553191 0.80851064 0.80425532
 0.81702128 0.69787234 0.71914894 0.70638298 0.81702128 0.70638298
 0.68085106 0.74893617 0.81702128 0.70638298 0.65106383 0.71914894
 0.80851064 0.63829787 0.81276596 0.81276596 0.80851064 0.83404255
 0.81276596 0.80851064

<font color='Green'><b>Answer:</b></font>

- **2.3** I chose to use the RandomForestClassifier as the final_estimator since this model has high performance for both classification and regression data. In addition, Random Forest can capture complex interactions between features, making it more powerful for datasets where the relationship between features and the target is not linear.Due to its ensemble nature, Random Forest is less likely to overfit compared to a single decision tree. It combines the predictions of the base estimators by leveraging the strengths of each base model and learns how to best weigh their predictions with the meta-model. This can improve the performance and generalization ability of a single model, especially when the base models have different biases or assumptions.

In [83]:
# 2.3
# Add necessary code here.
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Define a list of base estimators for stacking
estimators = [
    ("Decision Tree", pipe_dt),
    ("SVM", pipe_svm),
    ("KNN", pipe_knn)
]

# Create a stacking_classifier with specified base estimators and final estimator as RandomForestClassifier
stacking_classifier = StackingClassifier(
    estimators=estimators,      # List of base estimators
    final_estimator=RandomForestClassifier()       # Final estimator for meta-regression (RandomForestClassifier in this case)
)

cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
accuracy_scores = cross_val_score(stacking_classifier, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
f1_scores = cross_val_score(stacking_classifier, X, y, scoring='f1', cv=cv, n_jobs=-1)
accuracy_mean = accuracy_scores.mean()
accuracy_std = accuracy_scores.std()
f1_mean = f1_scores.mean()
f1_std = f1_scores.std()

print(f"Accuracy: {accuracy_mean:.2f} ± {accuracy_std:.2f}")
print(f"F1 Score: {f1_mean:.2f} ± {f1_std:.2f}")


<font color='Green'><b>Answer:</b></font>

- **2.4** .....................

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font>