# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Nick Nikolov

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [156]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [157]:
# Import dataset (1 mark)
df = pd.read_csv("diabetes.csv")

#Split into target and feature vectors
y = df.pop('Outcome')

print(df.head)
print(df.shape)
print(y.shape)

<bound method NDFrame.head of      Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1      

### Questions (3 marks)

1. (1 mark) What is the source of your dataset? From Kaggle: https://www.kaggle.com/datasets/pentakrishnakishore/diabetes-csv?select=diabetes.csv 

1. (1 mark) Why did you pick this particular dataset? Often I find I'm interested in medical data, maybe because of the complexity of the human system. Seems very difficult to predict outcomes in humans.

1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use? Kaggle had a few interesting sets. I almost used a luxury watch price dataset but I couldn't think of anything interesting to predict with that set. Diabetes seemed to be a useful prediction.

*ANSWER HERE*

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [158]:
# Clean data (if needed)
#Check for any Null values

nulls = df.isnull().sum().sort_values(ascending=False)
print(nulls)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64


In [159]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df, y, random_state=0)

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why. 

There were no missing values in this dataset. Since I'm not sure which feature might be the most important in predicting diabetes, I would drop the row if there was a missing value. For example, I think glucose is a very important indicator of diabetes, so if a glucose measurement was missing it would have a big effect on the results. We also can't replace it with zero since a zero glucose measurement is also medically important (impossible?). This applies to many other features, BMI, blood pressure, Age etc, cannot be filled with an average value or filled with zeros. Therefore dropping the value is most appropriate.

2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types? 

The diabetes dataset is used to classify a patient as either having diabetes or not having diabetes. This is a categorical dataset. Have to apply scaling since most non-linear models require scaling. Since all of the data is numerical, encoding is not required.

*ANSWER HERE*

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [160]:
# Implement pipeline and grid search here. Can add more code blocks if necessary
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

#placeholders for now
pipe = Pipeline(steps=[('preprocessing', StandardScaler()), ('classifier', LogisticRegression(max_iter=10))])

param_grid = [{'classifier': [SVC()], 
               'classifier__C': [0.01, 0.1, 1.0, 10.0],
               'preprocessing': [StandardScaler(), None]
              },
              {'classifier': [RandomForestClassifier(random_state=0)], 
               'classifier__max_depth': [1, 3, 5, 7],
               'preprocessing': [None]
              }, 
             {'classifier': [LogisticRegression(max_iter=1000)], 
              'classifier__C': [1, 10, 100, 1000], 
              'preprocessing': [None]
             }]

grid = GridSearchCV(pipe, param_grid, cv=5)

In [161]:
grid.fit(X_train, y_train)

In [162]:
grid.best_estimator_

In [163]:
grid.best_params_

{'classifier': LogisticRegression(C=10, max_iter=1000),
 'classifier__C': 10,
 'preprocessing': None}

In [164]:
grid.best_score_

0.7638980509745128

In [165]:
print(f'Cross-Validation accuracy {grid.best_score_:.2f}')
print(f'Test accuracy {grid.score(X_test, y_test):.2f}')

Cross-Validation accuracy 0.76
Test accuracy 0.80


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset? 

My dataset uses classification, either the patient has diabetes or they do not. There is no range of values in the target vector.

1. (2 marks) Which models did you select for testing and why? 

I selected logisticRegression, RandomForrests, and SVC. Logisticregression is used since in high dimensional data linear models may perform better than non-linear models (RandomForrests). Random forrests were chosen since they typically perform very well. SVC is chosen as another non-linear model that works differently than RandomForrests.

1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset? 

The Logistic regression model performed the best. I thought the randomforrest would perform the best since randomForrests typically perform very well. From the notes, weaknesses of randomforrests are explained to be in high dimensional sparse data. However this dataset only has 8 features so randomforrests should perform well. Maybe the features selected simply don't predict diabetes well. 

If the data was high dimensional then the logistic regression model would be predicted to perform better than random Forrests. 

*ANSWER HERE*

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [166]:
# Calculate testing accuracy (1 mark)
from sklearn.metrics import recall_score

pipe = Pipeline(steps=[('preprocessing', StandardScaler()), ('classifier', LogisticRegression(C=10, max_iter=1000))])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(f'Recall Score {recall_score(y_test, y_pred):.2f}')

Recall Score 0.58



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 

Recall, since we want to maximize all of the true positive detections.

1. (1 mark) How do these results compare to those in part 3? Did this model generalize well? 

This model performed very poorly with a Recall of only 0.58.

1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

Definitely not. The best model was determined to be a logisticRegression with C=10. However, the calculated recall only scored 0.58. In a real-world setting we would be making a ton of false predictions on patients. Since these decisions could have a lifelong impact on patients, it would be immoral to apply this particular model in the real-world given the poor accuracy. Personally, I would be happy with >80% diagnoses accuracy from my doctor, so I would expect at least that score from this prediction model. 

I spent a lot of time trying other types of models with a different range of parameters but still the logisticregression model with a recall score of 0.58 was the best I could find. Not sure why it's performing so poorly. Maybe the features selection in the dataset simply don't predict diabetes very well. 

*ANSWER HERE*

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code? Lecture Slides, Tutorials (Scaling.ipynb, PipelineSteps.ipynb, Pipeline.ipynb, ApplyPipelines.ipynb), sci-kit learn documentation (https://scikit-learn.org/stable/modules/grid_search.html), https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

1. In what order did you complete the steps? Numerical order. 

1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not? 

I used ChatGPT for a few code syntax questions (how do I remove one column from a pandas dataframe). Saved me time from searching through stackoverflow or the pandas documentation. I know for simple syntax questions ChatGPT is very good.

1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

It took me a while to figure out the correct syntax for the parameter grid. Spent a lot of time trying to figure out why my models were performing so poorly and kept trying different inputs.

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.

It was fun searching through Kaggle to see what datasets I might find interesting. Also asking friends for dataset ideas and what they think machine learning is capable of doing. 

*ADD YOUR THOUGHTS HERE*