# STA130 TUT 08 (Nov15) <br> <u>Classification, decision Tree with sklearn, and confusion matrix <u>


## Review  / Questions [15 minutes]
1. Follow up questions and clarifications regarding the content of Nov08 TUT and Nov11 LEC 
    > 1. (For the same data) R-squared must increase with each additional predictor variable added to the model
    > 2. Adjusted R-squared attempts to "correct" this phenomenon based on a sort of "theoretical expectation" or "rule of thumb"

## Demo (Concept of Decision Tree and Confusion Matrix)[45 minutes] 
1. **[15 of the 45 minutes]** Introduce **Train-Test** validation


In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
cancer_data = datasets.load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer_data.data, columns = cancer_data.feature_names)

In [None]:
# Notice that we're not using 'target' which is binary
# with 'target_names': array(['malignant', 'benign'], dtype='<U9')
cancer_data

In [None]:
print(cancer_data['DESCR'])

In [None]:
# randomly split the data into 80% training data and 20% testing data
np.random.seed(131)
training_indices = cancer_df.sample(frac=0.80, replace=False).index.sort_values()
testing_indices = cancer_df.index[~cancer_df.index.isin(training_indices)]
# now we'll use part of the data to "train" the MLR and "test" it with the other part of the data 

In [None]:
# The outcome variable is continuous (not binary) and so are the predictor variables
MLR = smf.ols('Q("mean compactness") ~ Q("mean radius") + Q("mean concavity")', 
              data=cancer_df.loc[training_indices,:])
# Fit the mulitple lienar regression model (MLR)
MLR_fit = MLR.fit()
np.corrcoef(MLR_fit.predict(cancer_df.loc[training_indices,:]),
            cancer_df.loc[training_indices,"mean compactness"])[0,1]**2

In [None]:
# "In sample" performance based on the "training data" (above) 
# is different from 
# "out of sample" performance based on the "testing data" (below) 
np.corrcoef(MLR_fit.predict(cancer_df.loc[testing_indices,:]),
            cancer_df.loc[testing_indices,"mean compactness"])[0,1]**2

In [None]:
# The fact that the model fit as evaluated by R-squared (proportion of variation explained)
# is worse for "new" data as opposed to "data the models was trained on and has already seen"
# is generally what we'd expect to see in an analysis like this...

# The good news is that the "out of sample" predictive (proportion of variation explained) performance 
# is not much worse than the "in sample performance" which we'd typically interpret as suggesting that
# the model seems to represent the data fairly well

2. **[15 of the 45 minutes]** Introduce the **concept** of a decision tree, we will reserve coding for lecture.
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20230424141242/dcr.png" alt="Optional Title" style="width: 500px; height: auto;"/>
3. **[15 of the 45 minutes]** Introduce the **concept** of confusion matrix, we will reserve coding for lecture.
<img src="https://miro.medium.com/v2/resize:fit:667/1*3yGLac6F4mTENnj5dBNvNQ.jpeg" alt="Optional Title" style="width: 500px; height: auto;"/>

## Communication **[40 minutes]**

1. **[15 minutes]** Break into 5 groups of 5 and prepare a speech describing what is the purpose of train test verification
   
   
   
2. **[25 minutes]** Discuss the implications of false positives (Type I errors) and false negatives (Type II errors) in hypothesis testing. How do these types of errors impact the conclusions we can draw from a statistical test? Consider the following in your discussion:
> - What does a false positive (FP) represent in the context of hypothesis testing?
> - What does a false negative (FN) represent in the context of hypothesis testing?
> - How might the costs or consequences of these errors differ depending on the field of application (e.g., medical, environmental, manufacturing)?
-  **Note: This discussion will prepare you well for the pre-lecture homework!**

