<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***


<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **25**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [None]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** This is reasonable because with a percentage that high, means that the values will not provide valuable insight since so much of it is missing. By keeping the columns, we would inroduce bias to the model.

In [None]:
# 1.1
# Check for columns with more than 60% missing values
missing_percentage = df.isnull().mean()
columns_to_drop = missing_percentage[missing_percentage > 0.6].index

df.drop(columns=columns_to_drop, inplace=True)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,1


<font color='Green'><b>Answer:</b></font>

- **1.2** I choose mean because it tends to maintain the overall distribution of the feature.

In [None]:
from sklearn.impute import SimpleImputer

# Separate numerical and categorical columns
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang']

numerical_cols = df[numerical_features]
categorical_cols = df[categorical_features]

# Impute missing values for numerical columns with mean
numerical_imputer = SimpleImputer(strategy='mean')
df[numerical_features] = numerical_imputer.fit_transform(df[numerical_features])

# Impute missing values for categorical columns with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_features] = categorical_imputer.fit_transform(df[categorical_features])


<font color='Green'><b>Answer:</b></font>

- **1.3** We do standardscaler because numerical features are continous. Since there is a wide range of values it should be scaled relative to that measrument. This ensures that they have the relative ranges and mangitudes to eachother. We do OnHotEncoder for categorical because it is working with discrete values. Because it can be one or the other, this function represents it a binary indicator variable.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Assign 'num' column to y and the rest to X
y = df['num']
X = df.drop(columns=['num'])

# List of feature names for each group
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang']
binary_features = ['sex', 'fbs', 'exang']

# Create ColumnTransformer object
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features) ,
        ('binary', 'passthrough', binary_features)
    ],
    remainder='drop'
)



# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1**
  - **Logisitc Regression**  is chosen because it works well with binary classifications, handle overfitting, and is easy to preform. As this is the first part of the pipeline it will create a more generalized prediction compared to the rest, one of the weaknesses is the linear assumption, that each dataset is based off of a linear relation, which in turn will give limited complexity. However, since it limits complexity it captures some generalization and handles overfitting for the rest of the pipeline
  - **RandomForestClassifier** is chosen to handle the non-linear relationships and roubustness to overfitting. It also provides valuable insight to how important each feature. A downside is how sensitive it is to hyperparameters, which im assuming may be hard with peipelines. In addition, it uses a lot of memory and is complex to do even do, compared to logisitic regression.
  - **SVC** is used to handle high dimensional data and complex data decsisions. This is the best for the last sector in pipeline as it can handle both linear and non-linear decissions. It is like the final decission for all the datasets. It is not that scalable esp with large dataasets which may provide a disadvantage.

In [None]:
# 2.1
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Create Pipeline with ColumnTransformer and Logistic Regression
pipeline_1 = Pipeline([
    ('preprocessor', preprocessor),  # ColumnTransformer from previous question
    ('classifier', LogisticRegression())  # Logistic Regression model
])

pipeline_2 = Pipeline([
    ('preprocessor', preprocessor),  # ColumnTransformer from previous question
    ('classifier', RandomForestClassifier())  # Random Forest Classifier model
])

pipeline_3 = Pipeline([
    ('preprocessor', preprocessor),  # ColumnTransformer from previous question
    ('classifier', SVC())  # Support Vector Classifier model
])


<font color='Green'><b>Answer:</b></font>

- **2.2** See print statement for further infromation

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer
from sklearn import metrics

param_grid_pipeline_1 = {
    'classifier__C': [0.1, 1, 10, 100],  # Regularization parameter for Logistic Regression
    'classifier__solver': ['liblinear', 'lbfgs'],  # Solver for Logistic Regression
}

param_grid_pipeline_2 = {
    'classifier__n_estimators': [50, 100, 200],  # Number of trees in Random Forest
    'classifier__max_depth': [5, 10, 20]  # Maximum depth of trees in Random Forest
}

param_grid_pipeline_3 = {
    'classifier__C': [0.1, 1, 10, 100],  # Regularization parameter for SVC
    'classifier__kernel': ['linear', 'rbf']  # Kernel for SVC
}

scoring = {
'accuracy': make_scorer(metrics.accuracy_score),
'f1_score': make_scorer(metrics.f1_score)
}

grid_search_rf = GridSearchCV(pipeline_2, param_grid_pipeline_2, cv=5, scoring=scoring, refit='f1_score', n_jobs=-1)
grid_search_lr = GridSearchCV(pipeline_1, param_grid_pipeline_1, cv=5, scoring=scoring, refit='f1_score', n_jobs=-1)
grid_search_svm = GridSearchCV(pipeline_3, param_grid_pipeline_3, cv=5, scoring=scoring, refit='f1_score', n_jobs=-1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the Logistic Regression model
grid_search_lr.fit(X_train, y_train)

# Fit the Random Forest model
grid_search_rf.fit(X_train, y_train)

# Fit the SVM model
grid_search_svm.fit(X_train, y_train)

best_params_lr = grid_search_lr.best_params_
best_params_rf = grid_search_rf.best_params_
best_params_svm = grid_search_svm.best_params_

pipeline_1.set_params(**best_params_lr)
pipeline_2.set_params(**best_params_rf)
pipeline_3.set_params(**best_params_svm)

print("Best parameters and best score for Pipeline 1 (Logistic Regression):")
print("Best Parameters:", grid_search_lr.best_params_)
print("Best Score (f1_score):", grid_search_lr.best_score_)
print()

print("Best parameters and best score for Pipeline 2 (Random Forest Classifier):")
print("Best Parameters:", grid_search_rf.best_params_)
print("Best Score (f1_score):", grid_search_rf.best_score_)
print()

print("Best parameters and best score for Pipeline 3 (Support Vector Classifier):")
print("Best Parameters:", grid_search_svm.best_params_)
print("Best Score (f1_score):", grid_search_svm.best_score_)
print()




Best parameters and best score for Pipeline 1 (Logistic Regression):
Best Parameters: {'classifier__C': 0.1, 'classifier__solver': 'liblinear'}
Best Score (f1_score): 0.7674005910045955

Best parameters and best score for Pipeline 2 (Random Forest Classifier):
Best Parameters: {'classifier__max_depth': 5, 'classifier__n_estimators': 100}
Best Score (f1_score): 0.7055281882868091

Best parameters and best score for Pipeline 3 (Support Vector Classifier):
Best Parameters: {'classifier__C': 0.1, 'classifier__kernel': 'linear'}
Best Score (f1_score): 0.7732034934621141



<font color='Green'><b>Answer:</b></font>

- **2.3** I ended up choosing Random Forest Classifer because it combines multiple base models to improve performance and generalization ability. Through the aggregation of predictions and the utilization of hyperparameter tuning, the stacking classifier aims to achieve improved predictive performance compared to any single base estimator.
- Random forests are known for their robustness against overfitting and their ability to handle various types of data. Since it is an ensemble like method, it imprvoes predictive performance by combining the models.


In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

# Choose a suitable meta-model
from sklearn.ensemble import RandomForestClassifier

# Combine pipelines and meta-model into a stacking classifier
stacking_classifier = StackingClassifier(estimators=[
    ('lr', pipeline_2),
    ('rf', pipeline_1),
    ('svm', pipeline_3)
], final_estimator=RandomForestClassifier())

# Perform cross-validation with StratifiedKFold
skf = StratifiedKFold(n_splits=5)

accuracy_scores = []
f1_scores = []

for train_index, test_index in skf.split(X_train, y_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[test_index]

    stacking_classifier.fit(X_train_fold, y_train_fold)
    y_pred = stacking_classifier.predict(X_val_fold)

    accuracy_scores.append(accuracy_score(y_val_fold, y_pred))
    f1_scores.append(f1_score(y_val_fold, y_pred))

# Calculate mean and standard deviation of accuracy and F1 scores
mean_accuracy = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)
mean_f1 = np.mean(f1_scores)
std_f1 = np.std(f1_scores)

# Report the results
print(f"Mean Accuracy: {mean_accuracy:.2f} ± {std_accuracy:.2f}")
print(f"Mean F1 Score: {mean_f1:.2f} ± {std_f1:.2f}")


Mean Accuracy: 0.82 ± 0.06
Mean F1 Score: 0.74 ± 0.08


<font color='Green'><b>Answer:</b></font>

- **2.4** The stacking classifier achieves high mean accuray and F1 score, based off of the diversty, base models, and weighting of predictions. The mean accuracy of 0.82 is indicating on average that it correctly labels 82% correctly. While as the mean F1 score suggests that there is a good balance between percision and recall.  Now for the comparison of individaul models:
  - Pipeline 1 (Logistic Regression): The best F1 score for Logistic Regression was 0.7674. The stacking classifier's mean F1 score of 0.74 is slightly lower, suggesting that stacking might have slightly deteriorated the performance compared to Logistic Regression alone.

  - Pipeline 2 (Random Forest Classifier): The best F1 score for Random Forest was 0.7055. The stacking classifier's mean F1 score of 0.74 is higher, indicating that stacking has improved the performance compared to Random Forest alone.

  - Pipeline 3 (Support Vector Classifier): The best F1 score for Support Vector Classifier was 0.7732. The stacking classifier's mean F1 score of 0.74 is slightly lower, suggesting that stacking might have slightly deteriorated the performance compared to Support Vector Classifier alone.
- The possible reasons for improvement:
  - Diverse Models: The variation of the basic models might be blamed for either a gain in performance or a decline in it.  
  - Predictive Weighting: In stacking, the meta-model gains the ability to forecast each base model's weight.
  - Data Characteristics: The features of the dataset affect how successful stacking is. Stacking can be useful in capturing the intricacies present in the dataset and enhancing prediction accuracy, particularly when the interactions between characteristics and the target are complicated.
  - Hyperparameter Tuning: The performance of the stacking classifier also depends on the hyperparameters chosen for each base model and the meta-model. Suboptimal hyperparameters can lead to suboptimal performance of the stacking classifier.

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font> <br>
1. Introduce more diverse base models:
- the more diverse a base model is the more it is able to capture differeent aspects of the data and their respective patterns which leads to much better generalization.
- you can do this by using diff models such as gradient boosting, k-nearest or deep learning models like neural networks.

2. Perform feature selection
- This can be done by crafting or selecting improtant features that will enhance the models ability to capture relevant informaiton. Apply in-depth feature engineering to develop new features or modify current ones in light of domain expertise or data-related insights.
