# Coding Homework 10: [Your Name]


- You can add new cells if you need (with the "+" button above); but, deleting cells could very likely cause your notebook to fail MarkUs autotesting (and you'd have to start over and re-enter your answers into a completely fresh version of the notebook to get things to work again...)
- In this homework, be careful not to rerun any cells that involve invoking a random number generator 

> TAs will mark this assignment by first checking ***MarkUs*** autotests for completion and general correctness, and then manually reviewing your written response to `Q6, Q8, Q10, Q11` and plotted figures for `Q2, Q7, Q9`
> - The following questions "automatically fail" during automated testing so that MarkUs exposes example answers for student review and consideration for these problems.  These "failed MarkUs tests" are not counted against the student: `Q6, Q8, Q10, Q11`


We begin by importing dataset and the libraries we will use.

In [None]:
import pandas as pd
import numpy as np
import graphviz as gv
from sklearn import tree, datasets
from sklearn.model_selection import GridSearchCV, train_test_split, ShuffleSplit
from sklearn.metrics import accuracy_score, recall_score, make_scorer
from sklearn.inspection import PartialDependenceDisplay

cancer_data = datasets.load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer_data.data, columns = cancer_data.feature_names)
cancer_classes = pd.DataFrame(cancer_data.target).replace({0:'Malignant',1:'Benign'})

## Part 1: GridSearchCV 

GridSearchCV is a powerful tool provided by sklearn for *hyperparameter tuning*, the process of refitting a model with many values of a hyperparameter (such as `max_depth`) in order to optimize performance. By exhaustively searching through a predefined grid of hyperparameter values, GridSearchCV allowes you to identify the best choices of parameters that yields the best model performance. This automated approach helps streamline the process of finding the right hyperparameters.

To practice using GridSearchCV we will be working with some breast cancer classification data availible from sklearn. You might  want to learn more about the dataset here:
- https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset
- https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

Before you start on the questions, use the below cells to play around with the dataset a bit. 

Now you might agree that the dataset seems very complicated, so we will start by 

### Q0: Preprocessing the dataset

We will begin by processing our data to be usable. This should be familiar as you did essentially the same thing for last week's homework. Do the following:
- Make a new dataframe names `dataset_cleaned` that contains only the following columns from the `cancer_df`:
    - `mean radius`, `mean texture`, `mean perimeter`, `mean fractal dimension`

In [None]:
np.random.seed(1959) # Do NOT change this line: it sets the "random number generation seed"

# Work here

### Q1: Use GridSearchCV to find the max_depth which optimizes your model for accuracy

Now use GridSearchCV to find the maximum depth in the range [0,20] which yields the most accurate decision tree (i.e. the decision tree which attains the highest training accuracy).

In [None]:
scoring = {'accuracy': make_scorer(accuracy_score),
          'sensitivity': make_scorer(recall_score,pos_label='Malignant')} # This will provide the values you need for the remaining questions

rs = ShuffleSplit(n_splits=2,test_size=0.20, random_state=30259) # Do NOT change this line: it sets the "random number generation seed"
#Feb 03 1959 is 'The Day the Music Died' https://en.wikipedia.org/wiki/The_Day_the_Music_Died 

# Create and fit your GridSearchCV instance here


In [None]:
# Q1:
Q1 = None

### Q2: Use GraphViz to plot the tree best tree you found in Q1

In [None]:
dot_data = tree.export_graphviz(clf_GS_1.best_estimator_, out_file=None, 
                                feature_names=features, class_names=['Malignant', 'Benign'],
                                filled=True, rounded=True, special_characters=True)
graph = gv.Source(dot_data, format='SVG')
graph # You can comment this line and re-run so the figure doesn't render if MarkUs notebook renderer gives you an error of 
# "nbconvert failed: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
# It's your responsibility to check this: TAs can't provide manual marks if your notebook doesn't render in MarkUs

### Q3: What is the best accuracy for the decision trees you found in Q1?

Provide your answers as decimal numbers with three signifiant digits.

In [None]:
# Q3:
Q3 = None

### Q4: What is the sensitivity for the most accurate decision tree you found in Q1?

You can extract the sensitivity from your GridSearchCV object by looking in the `cv_results_` field and considering the `mean_test_sensitivity`. Provide your answers as decimal numbers with three signifiant digits.

In [None]:
# Q4:
Q4 = None

### Q5: Re-answer Q1, Q3, and Q4 if you replace accuracy as the important metric with sensitivity when using GridSeachCV.

In [None]:
rs = ShuffleSplit(n_splits=2,test_size=0.20, random_state=30259) # Do NOT change this line: it sets the "random number generation seed"

# Create and fit your second GridSearchCV instance here

In [None]:
# Q5:
Q5_max_depth = None
Q5_accuracy = None
Q5_sensitivity = None
Q5 = (Q5_max_depth, Q5_accuracy, Q5_sensitivity)

### Q6: Use your answers to Q1-Q4 and to Q5 to compare differences between the outcome when different metric are prioritized.

#### Write a 1-2 sentence answer to this question in markdown cell below
- Compare your response to the answer given in the *MarkUs* output.

> Put your answer here 

## Part 2: Mean Effects of a Variable on Classification

In the last homework you worked with `Feature Importance` to determine how important different features were to the classification model you developed. However, as you might have noticed, `Feature Importance` does not tell you much about the relationship between a feature and the classification model's predictions, beyond that a given feature might be important to the prediction process. In linear regression, you can read off the regression coefficients to determine the numerical relationship between a predictor and the output. For more complicated general models, it is hard to reduce the relationship between prodictors and prediction to a single number, thus we will be using `Partial dependence plots` (PDPs) to try and understand these relationships better.

Here we will only be using one particular type of PDP, namely one-way PDPs for binary classification models (if you want to know what this means you can visit the documentation page below). A one-way PDP for a binary classification model shows the estimated probability of the model predicting the positive label against the value of some predictor varaible, as estimated by predicting on some sample dataset. Here is example of a PDP:


<center><img src="im/10/example_PDP.png"></center>

You can read off the following information from the above PDP as follows (at least on the sample dataset): If `area error` is between 0 and 20 or so, the model predicts a bit less than 40% of datapoints have the positive label, if `area error` is between 20 and 90 or so, the model predicts a bit more than 40% of datapoints have the positive label, and if `area error` is above 90, the model predicts all datapoints have the positive label. 

To learn how to create partial dependence plots you should read the documentation:
- https://scikit-learn.org/stable/modules/generated/sklearn.inspection.PartialDependenceDisplay.html#sklearn.inspection.PartialDependenceDisplay.from_estimator.

For more information you can also read this page:
 - https://scikit-learn.org/stable/modules/partial_dependence.html

### Q7: Create all the single-variable PDPs from the mode you found in Q1.

> Hint: You can do this is one line, and use the use full (cleaned) dataset for the sample required by PartialDependenceDisplay.from_estimator().

### Q8: Interpret the PDPs you made in Q7.

#### Write a 2-3 sentence answer to this question in markdown cell below
- Compare your response to the answer given in the *MarkUs* output.
> - Hint: Which variables are most influential? 
> - Hint: How would you describe the relationships between the variables and predicted label? 
> - Hint: Can you correspond anything in the PDPs to the decision tree itself?

> Put your answer here

### Q9: Create all the single-variable PDPs from the model you found in Q5.

> Hint: You can do this is one line, and use the use full (cleaned) dataset for the sample required by PartialDependenceDisplay.from_estimator().

### Q10: Interpret the PDPs you made in Q9.

#### Write a 2-3 sentence answer to this question in markdown cell below
- Compare your response to the answer given in the *MarkUs* output.
> - Hint: Which variables are most influential? 
> - Hint: How would you describe the relationships between the variables and predicted label? 
> - Hint: Can you correspond anything in the PDPs to the decision tree itself?

> Put your answer here

 ### Q11: Compare how you interpret PDPs vs coefficients in linear regression

#### Write a 1-2 sentence answer to this question in markdown cell below
- Compare your response to the answer given in the *MarkUs* output.
> - Hint: Consider the goals of classification models to the goals of linear regression.

> Put your answer here