### Review  / Questions [30 minutes]
1. See Nov08 TUT and Nov11 LEC
2. Review multiple linear regression and use the opportunity to introduce train test validation
    1. (For the same data) R-squared must increase with each additional predictor variable added to the model
    2. Adjusted R-squared attempts to "correct" this phenomenon based on a sort of "theoretical expectation" or "rule of thumb"
    3. Prediction performance could also be judge by how good R-squared is for "new" data that the model
        1. If R-squared reduces for "new" data the model wasn't trained is compared to "data the model has seen" there's a problem...
        2. The *Train-Test* framework gives us a way to construct "new" data to test in this way
            1. Watch out for variability in *Train-Test* scores across different random divisions of the data though!

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
cancer_data = datasets.load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer_data.data, columns = cancer_data.feature_names)

In [8]:
# Notice that we're not using 'target' which is binary
# with 'target_names': array(['malignant', 'benign'], dtype='<U9')
cancer_data

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [7]:
print(cancer_data['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [11]:
# randomly split the data into 80% training data and 20% testing data
np.random.seed(131)
training_indices = cancer_df.sample(frac=0.80, replace=False).index.sort_values()
testing_indices = cancer_df.index[~cancer_df.index.isin(training_indices)]
# now we'll use part of the data to "train" the MLR and "test" it with the other part of the data 

In [13]:
# The outcome variable is continuous (not binary) and so are the predictor variables
MLR = smf.ols('Q("mean compactness") ~ Q("mean radius") + Q("mean concavity")', 
              data=cancer_df.loc[training_indices,:])
# Fit the mulitple lienar regression model (MLR)
MLR_fit = MLR.fit()
np.corrcoef(MLR_fit.predict(cancer_df.loc[training_indices,:]),
            cancer_df.loc[training_indices,"mean compactness"])[0,1]**2

0.7982160760289563

In [14]:
# "In sample" performance based on the "training data" (above) 
# is different from 
# "out of sample" performance based on the "testing data" (below) 
np.corrcoef(MLR_fit.predict(cancer_df.loc[testing_indices,:]),
            cancer_df.loc[testing_indices,"mean compactness"])[0,1]**2

0.7809867032981558

In [None]:
# The fact that the model fit as evaluated by R-squared (proportion of variation explained)
# is worse for "new" data as opposed to "data the models was trained on and has already seen"
# is generally what we'd expect to see in an analysis like this...

# The good news is that the "out of sample" predictive (proportion of variation explained) performance 
# is not much worse than the "in sample performance" which we'd typically interpret as suggesting that
# the model seems to represent the data fairly well

### Demo [45 minutes] 
> Concept only, we will reserve coding for lecture 
1. Introduce the basic concept of a decision tree
2. Explain the confusion matrix and what it means 

~3. Introduce concept of Sensitivity, Specificity, Accuracy, Precision~

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20230424141242/dcr.png" alt="Optional Title" style="width: 500px; height: auto;"/>


<img src="https://miro.medium.com/v2/resize:fit:667/1*3yGLac6F4mTENnj5dBNvNQ.jpeg" alt="Optional Title" style="width: 500px; height: auto;"/>


##### ~Sensitivity (True Positive Rate)~
~Sensitivity measures the proportion of actual positives that are correctly identified.
$$\text{Sensitivity} = \frac{TP}{TP + FN}$$~

##### ~Specificity (True Negative Rate)~
~Specificity measures the proportion of actual negatives that are correctly identified.
$$\text{Specificity} = \frac{TN}{TN + FP}$$~

##### ~Accuracy~
~Accuracy measures the proportion of true results (both true positives and true negatives) in the population.
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$~

##### ~Precision (Positive Predictive Value)~
~Precision measures the proportion of positive identifications that were actually correct.
$$\text{Precision} = \frac{TP}{TP + FP}$$~


#### Communication [so this is being worked on / revised I'd say]


1. **[15 minutes]** Break into 5 groups of 5 and prepare a speech describing what is the purpose of train test verification
   
2. **[20 minutes]** Which parameter is the most important?  Sensitivity, Specificity, Accuracy, or Precision. Use GPT to answer this question and then discuss with group member about your conclusion.



#### Homework  -- To be completed before lecture--

> Code and write all your answers in a python notebook (in code and markdown cells) and save your python jupyter notebook in your own account and "repo" on [github.com](github.com) and submit a link to that notebook though Quercus for assignment marking.

1. Understanding Decision Trees
    1. "Explain Decision Trees: Write a brief paragraph explaining what a decision tree is. Include in your response a description of how decision trees are used in data analysis, the type of problems they can solve, and how they make decisions at each node. Provide an example of a real-world application where decision trees might be particularly useful."

2. Importance of Model Evaluation Metrics
    1. "Evaluate the Importance of Different Model Evaluation Metrics: For each of the following metrics - Sensitivity, Specificity, Accuracy, and Precision - provide a scenario where focusing on that particular metric is crucial. Explain why that metric is important in your given scenario and how it influences the decision-making process.


#### Homework  -- To be completed after lecture--

> Code and write all your answers in a python notebook (in code and markdown cells) and save your python jupyter notebook in your own account and "repo" on [github.com](github.com) and submit a link to that notebook though Quercus for assignment marking.


### Q0:  We begin by importing dataset and the libraries we will use. For each import write one comment briefly explaining the purpose of the import. Then play around with this data and provide some work demonstrating what the data is.
#### Remember if you don't know the answer, you can always use ChatGPT or other resources to help you figure things out. Just make sure you understand your answers by the time you submit them...

In [16]:
import pandas as pd
import numpy as np
from sklearn import tree, model_selection
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import graphviz as gv
from sklearn.metrics import accuracy_score, recall_score, make_scorer
from sklearn.inspection import PartialDependenceDisplay
ab = pd.read_csv("amazonbooks.csv", encoding="ISO-8859-1")

### Q1: Create a sequence of new dataframes `ab_reduced`, `ab_reduced_noNaN`, `ab_reduced_noNaN_train`, and `ab_reduced_noNaN_test` according to the following specifications.


1. `Weight_oz`, `Width`, and `Height` should be removed because they have a lot of missing values that would cause us to lose rows if they were kept in the analysis; but, we're going to be okay not using them for the analysis. 
2. All rows with `NaN` entries should be dropped.
3. `Pub year` and `NumPages` should be redefined to have the type `int`, and `Hard_or_Paper` should be redefined to have the type `category`.
4. Create an 80/20 split with 80% training set and 20% testing set.

#### To complete the final 80/20 split in a reproducible way, set a "random seed".

> Only remove rows with `NaN` entries once you've subset to the columns you're interested in. This will minimize potentially unnecessary data loss... Of course we might want to consider imputing missing data to further mitigate data loss, but the considerations for doing so are more advanced than the level of our course, so we'll not consider that for now. At any rate, `NaN` entries can't be used in their raw form with the `scikit-learn` methodologies below, so we do need to remove them to proceed with our analyses.

#### Remember if you don't know how to do something , you can always use ChatGPT or other resources to help you figure out what you need to.

### Q2: Tell ChatGPT that you are about to fit a "scikit-learn" `DecisionTreeClassifier` model and ask what the following preparation steps are doing. One you understand the answer, write a one to two sentence summary explaining these preparation steps in your own words (in the markdown cell below).

In [20]:
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']
X = ab_reduced_noNaN[['NumPages', 'Thick', 'List Price']]

### Q3: Train a classification tree `clf` using only the  `List Price` variable to predict whether or not a book is hard cover book or a paper cover book (with  `max_depth` parameter set to `2`) .

#### Use default values for all (tuning) parameters instantiating the Decision Tree Classifier.

> - Hint 1: in the end you'll need to use you'll need to use `DecisionTreeClassifier(...).fit(...)` where the first `...` represents the details of instantiating a specific version of the `DecisionTreeClassifier` and the second `...` represents the parameters of the `.fit(...)` of the `DecisionTreeClassifier(...)` object
method actually... asking ChatGPT about "`DecisionTreeClassifier .fit(...)" can be helpful here; and, for a deeper and more acurate explaination you could find the `scikit-learn` "Decision Trees" documentation online...
> - Hint 2: should you use the `ab_reduced_noNaN` data, or the `ab_reduced_noNaN_train` data, or the `ab_reduced_noNaN_test` data to initially fit the classification tree? Why?

Visualize your decision tree using the `tree.plot_tree(clf)` function shown in the `sklearn` documentation [here](
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#what-is-the-values-array-used-here) and [here](https://scikit-learn.org/stable/modules/tree.html)



Visualize your decision tree again, but this time make it more immediately readible using `graphviz`, which is demonstrated in the `sklearn` documentation [here](https://scikit-learn.org/stable/modules/tree.html#alternative-ways-to-export-trees) **once you've clicked the "Alternative ways to export trees" button**



And to limit the visualization itself we can add the `max_depth` parameter to our call of `export_graphviz`.

### Q4: How many observation are in the training data set and the test data set?

> - Hint: a single observation consists of all the measurements made on a single entity. In Machine Learning, the  "vector" of all values measured for a single entity comprise a single "observation" so this just corresponds (typically) to a row of a data frame.

### Q5: Assuming you correctly fit the classification tree with the `ab_reduced_noNaN_train` data as opposed to the `ab_reduced_noNaN` or the `ab_reduced_noNaN_test` data sets, why did you do this?

#### Write a one to two sentence answer to this question in markdown cell below

### Q6: Train classification tree `clf2` with features `NumPages`, `Thick` and `List Price` (to again predict if a book has hardback or paperback), use `GridSearchCV` to find best `max_depth`.

#### Use the same train/test split data used so far.

### Q7: Use the testing dataset you created in Q1 to create confusion matrices for `clf` and `clf2`. Report the sensitivity (true positive rate), specificity (true negative rate) and accuracy for each of the trees/models.

#### Provide your answers as decimal numbers with three signifiant digits, such as `0.123` (and not as percentages like `12.3%`), and treat “Good” life expectancy as the positive response and prediction class. 

> Here are few things that might be helpful to ask ChatGPT: 
> - Hint 0: How can I use `np.round()`   
> - Hint 1: Does the `y_true` or `y_pred` parameter go first in the `confusion_matrix` function?  
> - Hint 2: How to label a confusion matrix in `sklearn`
> - Hint 3: Confusion Matrices and Metrics explainations just below, you can always ask Chat GPT to explain it in detail or give you some examples

<a id='cf'></a>
# Confusion Matrices and Metrics

- **Accuracy** is the proportion of cases that are correctly identified.
- **Sensitivity** is the proportion of actual positive cases which are correctly identified to be positive (as true positives)
    - **Sensitivity** is also known as **true positive rate (TPR)**
- **Specificity** is the proportion of actual negative cases which are correctly identified to be negative (as true negative)
    - **Specificity** is also known as **true negative rate (TNR)**
- **False positive rates (FPR)** are defined to be the proportion of actually negative cases which are incorrectly identified (as false positives)
- **False negative rates (FNR)** are defined to be the proportion of actually positive cases which are incorrectly identified (as false negatives)
    - *but noticed how the FPR and FNR work in a sort of "flipped" manner in these definitions as they are defined with respect to the truth*

In formulas

\begin{align*}
\text{Accuracy}  & = {} (TP+TN)/\text{"total # of cases"}\\
TPR & = {} TP/(TP+FN) = 1-FNR \\
TNR & = {} TN/(TN+FP) = 1-FPR
\end{align*}

and you can read more and see a (greatly expanded) handy list of formulas at the following [wikipedia page.](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)


### Q8: Explain what is causing the differences between the following two confusion matrices below, and why the two confusion matrices above (for `clf2` and `clf3`) are better

#### Write a three to four sentence answer to this question in markdown cell below

In [None]:
ConfusionMatrixDisplay(confusion_matrix(ab_reduced_noNaN_train.life_exp_good, 
                                        clf.predict(ab_reduced_noNaN_train[['List Price']]), labels=[0, 1]), 
                       display_labels=["Paper","Hard"]).plot()
ConfusionMatrixDisplay(confusion_matrix(ab_reduced_noNaN_test.life_exp_good, 
                                        clf.predict(ab_reduced_noNaN_testtest[['List Price']]), labels=[0, 1]), 
                       display_labels=["Paper","Hard"]).plot()

# Feature Importance

Compared to understanding the contribution of different covariates towards the final predicted values of multiple linear regression models (where you can just read off the equation to see how predictions work), the extent to which we do not understand the overall contributions of the different features to the final predictions from our decision trees should feel a bit off-putting. To remedy this we can use so-called **Feature Importance** heuristics to judge how relatively important the different features are in the final decision tree predictions. 


### Q9: Ask ChatGPT or search on internet how to check feature Importance in `scikit-learn`, read the following paragraph and make sure you understand it 

> The way a decision tree is fit is that at each step in the construction process of adding a new decision node splitting rule to the current tree structure, all possible decision rules for all possible variables are considered, and the one that improves the prediction the most (as measured by the criterion of either "Gini impurity" or "Shannon entropy") legally and sufficiently according to the tuning parameters rules of the decision tree is added to the decision tree.  The overall "criterion" noted above improves with each new decision node splitting rule, so the improvement can thus be tracked and the contributions attributed to the feature upon which the decision node splitting rule is based.  This means the relative contribution of each feature to the overall explanatory power of the model can be calculated, and this is what the `.feature_importances_` attribute does. 

### Q10: Which predictor variable is most important for making predictions according to `clf2`?

#### Visualize the *Feature Importances* and report the name of most important feature and its numeric *Feature Importance* value

> - Hint 0: `.feature_importances_` &`.feature_names_in_` 
> - Hint 1: Visualize Feature Importances : )

In [None]:
Q10_most_important_feature = 
Q10_most_important_feature_percentage_score = 

### Q11: Describe the differences of interpreting coefficients in linear model regression versus feature importances in decision trees.

#### Write a couple sentences or so in markdown cell below to answer this question 

> Hint: linear model regression predicts continuous real-valued averages for a given configuration of covariate values (or, feature values, if we're using machine learning terminology instead of statistical terminology), whereas a binary classification model such as a binary classification tree predicts 0/1 ("yes" or "no") outcomes (and gives the probability of a 1 "yes" (or "success") outcome from which a 1/0 "yes"/"no" prediction can be made; but, this is not what is being asked here. This question is asking "what's the difference in the way we can interpret and understand how the predictor variables influence the predictions in linear model regression based on the coefficients versus in binary decision trees based on the Feature Importances?"

In [None]:
# Assignment questions would go here.