# STA130 LEC Week 11 (Nov 18)

## Binary Classification Binary Decison Trees and Machine Learning

*Didn't get to the code at the end of the last lecture, but that was a complex demonstration so let's just restart.*

0. **Logistic Regression** 

    1. Makes **probability predictions** for **binary outcomes**
    2. The **train-test** versus **statistical hypothesis testing and inference**
    3. **Model Complexity** is number of **predictor variables** (and **interactions**)
    4. **Generalization** versus **Overfitting**


1. **Machine Learning** and **Regularization**

    1. **Binary Classification Binary Decison Trees** 
        1. **Regularization Tuning Parameters** (or, technically, **stopping parameters**)
        2. Decison Tree Construction AKA **Model Fitting**  
        3. What are **Decison Trees**?
            1. **Interactions**
            2. **Feature Space** _partitions_
            3. **Feature Importance**
            4. **Partial Dependency Plots**

    2. **Random Forests** (of **Bootstrapped Decision Trees**)


2. **Prediction**, **thresholding**, and different **Metrics**


3. **Self Evaluation \#1: what's the correlation of your understanding versus the true of the following items?<br>AKA what's your 0%-100% (or, techically -100%-100%) understanding level for the following topics?**
    1. Bootstrapped Confidence Intervals
    2. "Coin Flippling" sampling distribution hypothesis testing for "paired samples"
    3. Calculating p-values based on observed statistics and "sampling distributions under the null"
    4. Correlation
    5. The normal "Simple Linear Regression" model
    6. Fitting Simple Linear Regression models
    7. Making predictions from linear models
    8. Using Simple Linear Regression to evaluate the evidence of association between two continue variables
    9. Assessming the assumptions of Simple Linear Regression using residuals
    10. Hypothesis testing for two unpaired samples using a permutation test (as opposed to hypothesis testing based on differences for "paired samples")
    11. Hypothesis testing for two groups (unpaired samples) using indicator variables in Simple Linear Regression
    12. "Double" bootstrap confidence intervals estimating difference parameters for two groups (unpaired samples)


4. **Self Evaluation \#2: what's the correlation of your understanding versus the true of the following items?<br>AKA what's your 0%-100% (or, techically -100%-100%) understanding level for the following topics?**

    1. Multiple Linear Regression versus Simple Linear Regression
    2. Binary indicator variables
    3. Categorical variables
    4. Interactions
    5. Multicollinearity versus Statistical Inference
    6. Multicollinearity versus Prediction
    7. Logistic Regression
    8. Classification veresus Regression
    9. Machine Learning versus Statistical Inference
    10. Classification Decision Trees versus Multiple Linear Regression
    11. Classification Decision Trees versus Logistic Regression
    12. Model Complexity and Overfitting
    13. Model Complexity and Regularization Tuning Parameters
    
    
5. **Student Lecture Summary**

    


## 0. Restarting _Logistic Regression_ with _this new data set_

In [None]:
import pandas as pd 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

column_names = ["age", "workclass", "fnlwgt", "education", "education-num", 
                "marital-status", "occupation", "relationship", "race", "sex", 
                "capital-gain", "capital-loss", "hours-per-week", "native-country", 
                "income"]
data_raw = pd.read_csv(url, names=column_names, skipinitialspace=True)
data_use = data_raw.copy()
#data_use = data_use.drop(columns=['workclass', 'marital-status', 'occupation', 
#                                  'capital-gain', 'capital-loss', 'hours-per-week', 
#                                  'native-country', 'education-num', 'fnlwgt'])
display(data_use.head(), data_use.shape)

In [None]:
data_use.income.value_counts()

In [None]:
data_use.education.value_counts()

In [None]:
data_use.loc[data_use.education == 'Preschool', 'education'] = "<=6th"
data_use.loc[data_use.education == '1st-4th', 'education'] = "<=6th"
data_use.loc[data_use.education == '5th-6th', 'education'] = "<=6th"
data_use.education.value_counts()

In [None]:
data_use.workclass.value_counts()

In [None]:
data_use.loc[data_use.workclass == 'Without-pay', 'workclass'] = "?"
data_use.loc[data_use.workclass == 'Never-worked', 'workclass'] = "?"
data_use.workclass.value_counts()

In [None]:
data_use.occupation.value_counts()

In [None]:
data_use.loc[data_use.occupation == 'Armed-Forces', 'occupation'] = "?"
data_use.occupation.value_counts()


In [None]:
#data_use['workclass-occupation'] = data_use.workclass + " " + data_use.occupation
#data_use['workclass-occupation'].value_counts()
#for i,k in zip(data_use['workclassoccupation'].value_counts().index,data_use['workclass-occupation'].value_counts().values):
#    print(i, k)

## 0.2 The _train-test_ versus _statistical hypothesis testing and inference_

In [None]:
from sklearn import model_selection
import numpy as np

np.random.seed(130)
train, test = model_selection.train_test_split(data_use, train_size=0.8)

## 0. Logistic Regression

In [None]:
import statsmodels.formula.api as smf

formula = '''
I((income=='>50K').astype(int)) ~ scale(age) + I(scale(age)**2) + I(scale(age)**3)
                                + C(education, Treatment(reference='HS-grad'))
'''
logreg = smf.logit(formula, data=train)
logreg_fit = logreg.fit()
logreg_fit.summary()

## 0.1 Makes _probability predictions_ for _binary outcomes_

In [None]:
import numpy as np
np.corrcoef((train.income=='>50K'),(logreg_fit.predict(train)>0.5))#**2

In [None]:
from scipy import stats
stats.spearmanr((train.income=='>50K'),(logreg_fit.predict(train)>0.5))#[0]**2

## 0.2 The _train-test_ versus _statistical hypothesis testing and inference_

In [None]:
((train.income=='>50K')==(logreg_fit.predict(train)>0.5)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==(logreg_fit.predict(test)>0.5)).sum()/test.shape[0]

## 0.3 _Model Complexity_ is number of _predictor variables_ (and _interactions_)

In [None]:
train["sex"].value_counts()

In [None]:
train["marital-status"].value_counts()

In [None]:
train.relationship.value_counts()

In [None]:
train.race.value_counts()

In [None]:
formula = '''
I((income=='>50K').astype(int)) ~ scale(age) + I(scale(age)**2) + I(scale(age)**3)
                                + scale(Q("education-num")) 
                                + C(education, Treatment(reference='HS-grad'))
                                + C(Q("marital-status"), Treatment(reference='Married-civ-spouse')) 
                                + C(relationship, Treatment(reference='Husband'))
                                + C(sex, Treatment(reference='Male')) 
                                + C(race, Treatment(reference='White'))
                                + C(workclass) + C(occupation)
'''
logreg = smf.logit(formula, data=train)
logreg_fit = logreg.fit()
logreg_fit.summary()

## 0.4 **Generalization** versus **Overfitting**

In [None]:
((train.income=='>50K')==(logreg_fit.predict(train)>0.5)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==(logreg_fit.predict(test)>0.5)).sum()/test.shape[0]

In [None]:
formula = '''
I((income=='>50K').astype(int)) ~ scale(age) * I(scale(age)**2) 
                                * scale(Q("education-num")) 
                                * C(race, Treatment(reference='White'))
                                * C(sex, Treatment(reference='Male')) 
                                + C(education, Treatment(reference='HS-grad'))
                                + C(Q("marital-status"), Treatment(reference='Married-civ-spouse')) 
                                + C(relationship, Treatment(reference='Husband'))
                                + C(workclass) + C(occupation)
'''
logreg = smf.logit(formula, data=train)
logreg_fit = logreg.fit()
logreg_fit.summary()

## 0.1 Makes _probability predictions_ for _binary outcomes_
## 0.2 The _train-test_ versus _statistical hypothesis testing and inference_
## 0.3 _Model Complexity_ is number of _predictor variables_ (and _interactions_)
## 0.4 **Generalization** versus **Overfitting**

In [None]:
((train.income=='>50K')==(logreg_fit.predict(train)>0.5)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==(logreg_fit.predict(test)>0.5)).sum()/test.shape[0]

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((train.income=='>50K'), logreg_fit.predict(train)>0.5, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((test.income=='>50K'), logreg_fit.predict(test)>0.5, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

#### Accuracy
Accuracy measures the proportion of true results (both true positives and true negatives) in the population.
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

#### Specificity (True Negative Rate)
 Specificity measures the proportion of actual negatives that are correctly identified.
$$\text{Specificity} = \frac{TN}{TN + FP}$$

#### Sensitivity (True Positive Rate)
Sensitivity measures the proportion of actual positives that are correctly identified.
$$\text{Sensitivity} = \frac{TP}{TP + FN}$$

#### Precision (Positive Predictive Value)
Precision measures the proportion of positive identifications that were actually correct.
$$\text{Precision} = \frac{TP}{TP + FP}$$

> - **Negative Predictive Value** is the "negative" version of **precision** $\frac{TN}{TN + FN}$
> - **False negative rates (FNR)** are defined to be the proportion of actually positive cases which are incorrectly identified (as false negatives) $TNR = TN/(TN+FP) = 1-FPR$
> - **False positive rates (FPR)** are defined to be the proportion of actually negative cases which are incorrectly identified (as false positives) $TPR = TP/(TP+FN) = 1-FNR$
 


In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score
# in sklearn specificity is recall_score(y_true, y_pred, pos_label=0)
# while sensitivity recall_score(y_true, y_pred, pos_label=1) is the default 

print("In sample (training) sensitivity", recall_score(train.income=='>50K', logreg_fit.predict(train)>0.5, pos_label=True))
print("Out of sample (testing) sensitivity", recall_score(test.income=='>50K', logreg_fit.predict(test)>0.5, pos_label=True))
print("In sample (training) specificity", recall_score(train.income=='>50K', logreg_fit.predict(train)>0.5, pos_label=False))
print("Out of sample (testing) specificity", recall_score(test.income=='>50K', logreg_fit.predict(test)>0.5, pos_label=False))
print("In sample (training) precision", precision_score(train.income=='>50K', logreg_fit.predict(train)>0.5))
print("Out of sample (testing) precision", precision_score(test.income=='>50K', logreg_fit.predict(test)>0.5))

## 1. Machine Learning and Regularization


In [None]:
X_train = pd.get_dummies(train.iloc[:,:-1]).astype(float)
X_test = X_train[:0].copy()
X_test_tmp = pd.get_dummies(test.iloc[:,:-1])
for col in X_test_tmp:
    X_test[col] = X_test_tmp[col].astype(float)
X_test = X_test.fillna(0.0)

## 1.1 Binary Classification Binary Decison Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

clf = DecisionTreeClassifier(max_depth=15, random_state=42)
clf.fit(X=X_train, y=(train.iloc[:,-1]=='>50K').astype(int))

plt.figure(figsize=(10,5), dpi=200)
plot_tree(clf, feature_names=X_train.columns.tolist(), 
          class_names=['<=50k','>50k'],
          filled=True, rounded=True)
plt.show()

## 1.1.1. _Regularization Tuning Parameters_ (or, technically, _stopping parameters_)

In [None]:
((train.income=='>50K')==clf.predict(X_train)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==clf.predict(X_test)).sum()/test.shape[0]

In [None]:
clf = DecisionTreeClassifier(max_depth=30, random_state=42)
clf.fit(X=X_train, y=(train.iloc[:,-1]=='>50K').astype(int))

plt.figure(figsize=(10,5), dpi=200)
plot_tree(clf, feature_names=X_train.columns.tolist(), 
          class_names=['<=50k','>50k'],
          filled=True, rounded=True)
plt.show()

In [None]:
((train.income=='>50K')==clf.predict(X_train)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==clf.predict(X_test)).sum()/test.shape[0]

In [None]:
DecisionTreeClassifier?

In [None]:
clf = DecisionTreeClassifier(max_depth=30, random_state=42, 
                             min_samples_leaf=30, 
                             min_samples_split=100)
clf.fit(X=X_train, y=(train.iloc[:,-1]=='>50K').astype(int))

plt.figure(figsize=(10,5), dpi=200)
plot_tree(clf, feature_names=X_train.columns.tolist(), 
          class_names=['<=50k','>50k'],
          filled=True, rounded=True)
plt.show()

In [None]:
((train.income=='>50K')==clf.predict(X_train)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==clf.predict(X_test)).sum()/test.shape[0]

## 1.1.2. Decison Tree Construction AKA _Model Fitting_


## 1.1.3. What are **Decison Trees**?

### 1.1.3.1. Interactions

![](https://www.researchgate.net/publication/280032275/figure/fig6/AS:340436318212124@1458177751589/An-example-population-decision-tree-and-a-personalized-decision-path-Panel-a-gives-the.png)

### 1.1.3.2. Feature Space _partitions_

![](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1528907338/regression-tree_g8zxq5.png)

### 1.1.3.3. Feature Importance

### 1.1.3.4. Partial Dependency Plots


In [None]:
#https://stackoverflow.com/questions/52771328/plotly-chart-not-showing-in-jupyter-notebook
import plotly.offline as pyo
# Set notebook mode to work in offline
pyo.init_notebook_mode()

In [None]:
import plotly.express as px

feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns.tolist(),
    'Importance': clf.feature_importances_
}).sort_values(by='Importance', ascending=False).reset_index()

fig = px.bar(feature_importance_df[:20], y='Feature', x='Importance', 
             title='Feature Importance')
fig.show()

In [None]:
from sklearn.inspection import PartialDependenceDisplay

# X_train.columns=='education-num' # 2
_ = PartialDependenceDisplay.from_estimator(clf, X_train, (2,))


## 1.2. **Random Forests** (of **Bootstrapped Decision Trees**)


In [None]:
RandomForestClassifier?

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Fit 1000 Decision Trees with unlimited depth
rfc = RandomForestClassifier(n_estimators=1000, random_state=1,
                             min_samples_leaf=10, min_samples_split=30)
rfc.fit(X=X_train, y=(train.iloc[:,-1]=='>50K').astype(int))

In [None]:
((train.income=='>50K')==rfc.predict(X_train)).sum()/train.shape[0]

In [None]:
((test.income=='>50K')==rfc.predict(X_test)).sum()/test.shape[0]

In [None]:
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns.tolist(),
    'Importance': rfc.feature_importances_
}).sort_values(by='Importance', ascending=False).reset_index()

fig = px.bar(feature_importance_df[:60], y='Feature', x='Importance', 
             title='Feature Importance',
              width=800, height=1200)
fig.show()

In [None]:
_ = PartialDependenceDisplay.from_estimator(rfc, X_train, (2,))


In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((train.income=='>50K'), clf.predict(X_train)==1.0, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((train.income=='>50K'), rfc.predict(X_train)==1.0, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((test.income=='>50K'), clf.predict(X_test)==1.0, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((test.income=='>50K'), rfc.predict(X_test)==1.0, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((test.income=='>50K'), rfc.predict_proba(X_test)[:,1]>0.5, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

## 2. _Prediction_, _thresholding_, and different _Metrics_

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((test.income=='>50K'), rfc.predict_proba(X_test)[:,1]>0.8, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(
    confusion_matrix((test.income=='>50K'), rfc.predict_proba(X_test)[:,1]>0.2, 
    labels=[False, True]), display_labels=['<=50K','>50K'])
_ = cm_disp.plot()