# ~Simple/Multiple Linear Regression~ Binary Classification<br>with Decision Trees

- Remember ***extrapolation predictions*** $\tilde y_i = \hat \beta_0 + \hat \beta_1 \tilde x_i$...?<br><br>

    - Train-Test Framework: "In Sample" versus "Out of Sample" performance
      
      ```python 
np.random.seed(130)
train_index = ab_noNaN.sample(frac=0.80).index.sort_values()
isin_train_index = ab_noNaN.index.isin(train_index)
test_index = ab_noNaN.index[~isin_train_index]
```

    - Model Complexity and the Train-Test Framework
    
        - Multiple Linear Regression Model Building with `statsmodels`

          ```python 
        Q("x 0"), C(x1), I(x2**2), C(x3) * x2, + ...
        # C(x3) versus just x3? Training C(x3) versus testing C(x3)?
        modfit.model.exog # modfit.model.endog # modfit.model.formula`   
```

- `sklearn`
    
  ```python 
# Regression
from sklearn.linear_model import LinearRegression
y = df['continuous outcome']
X = pd.get_dummies(df[list_of_covariates], drop_first=True)
# Categorical variables with `k` levels only need `k-1` columns...
mod = LinearRegression(fit_intercept=True)#False if X already has a 1's column 
mod.fit(X, y) # don't need to assign to a `modfit` object like `statsmodels`
mod.intercept_, mod.coef_, mod.predict()
```
  ```python 
# Classification
from sklearn import tree
y = pd.get_dummies(df['binary outcome'], drop_first=True) 
# Classification generalizes to categorical outcomes... 
X = df[list_of_features].astype(float) # might need `pd.get_dummies`
tree_depth # Model complexity control
clf = tree.DecisionTreeClassifier(max_depth=tree_depth)
clf.fit(X,y)
```

    - Decision Tree visualization with `graphviz`
        - Decision Tree already use high order "interactions"...
        - Decision Trees therefore definitely need "complexity control"...
        
    - Confusion Matrices, Sensitivity, Specificity, Accuracy with 
        - `from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay`
        


In [2]:
import pandas as pd
ab = pd.read_csv("../../amazonbooks.csv", encoding="ISO-8859-1")#.dropna()
#print(ab.shape)
#ab.isnull().sum()
ab_noNaN = ab.drop(['Weight_oz','Width','Height'], axis=1).dropna()
ab_noNaN['Pub year'] = ab_noNaN['Pub year'].astype(int)
ab_noNaN['NumPages'] = ab_noNaN['NumPages'].astype(int)
ab_noNaN['Hard_or_Paper'] = ab_noNaN['Hard_or_Paper'].astype("category")
#print(ab_noNaN.shape)
#ab_noNaN.dtypes
ab_noNaN

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Thick
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304,Adams Media,2010,1605506249,0.8
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273,Free Press,2008,1416564195,0.7
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96,Dover Publications,1995,486285537,0.3
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672,Harper Perennial,2008,61564893,1.6
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720,Knopf,2011,307265722,1.4
...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192,HarperCollins,2004,60572345,1.1
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160,Worth Publishers,2011,1429233443,0.7
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224,St Martin's Griffin,2005,031233446X,0.7
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480,W. W. Norton & Company,2010,393934942,0.9


In [None]:
import numpy as np
np.random.seed(130)
#train_index
#test_index

In [7]:
#train_index

In [14]:
# import statsmodels.api as sm
import statsmodels.formula.api as smf

simple_linear_regression_model = smf.ols('Q("Amazon Price") ~ Q("List Price")', 
                                         data=ab_noNaN)
model_fit = simple_linear_regression_model.fit() # creates line of best fit
model_fit.summary()

0,1,2,3
Dep. Variable:,"Q(""Amazon Price"")",R-squared:,0.913
Model:,OLS,Adj. R-squared:,0.912
Method:,Least Squares,F-statistic:,3307.0
Date:,"Sun, 26 Nov 2023",Prob (F-statistic):,9.089999999999999e-170
Time:,11:38:19,Log-Likelihood:,-867.62
No. Observations:,319,AIC:,1739.0
Df Residuals:,317,BIC:,1747.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.6676,0.341,-7.825,0.000,-3.338,-1.997
"Q(""List Price"")",0.8500,0.015,57.506,0.000,0.821,0.879

0,1,2,3
Omnibus:,104.183,Durbin-Watson:,2.015
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1018.441
Skew:,1.036,Prob(JB):,7.05e-222
Kurtosis:,11.505,Cond. No.,38.2


# Demonstrate model building
# Demonstrate "out of sample" performance
# Demonstrate model complexity... *increasing model complexity*<br>*eventually degrades "out of sample" performance*

# Transition to `sklearn`

\begin{align*}
E[y] ={}& \beta_0 + \beta_1 \times \text{ListPrice} + \beta_2 \times 1_{[\text{'P'}]}(\text{Hard_or_Paper})\; + \\{}& \quad\;\;\; \beta_3 \times \text{ListPrice}\times 1_{[\text{'P'}]}(\text{Hard_or_Paper}) + \beta_4 \times \text{ListPrice}^2
\end{align*}
- not hardback: $\beta_0 + \beta_1 \times \text{ListPrice} + \beta_4 \times \text{ListPrice}^2$
- are hardback: $(\beta_0+\beta_2) + (\beta_1+\beta_3) \times \text{ListPrice} + \beta_4 \times \text{ListPrice}^2$

In [None]:
# scikit-learn: THE ML (machine learning) library 
from sklearn.linear_model import LinearRegression

# X features (ML) [covariates in stats]; y is the outcome
X = MLRM_6_fit.model.exog # X must be fully numeric
# or we'll get the error: `ValueError: could not convert string to float`
y = MLRM_6_fit.model.endog
mod = LinearRegression(fit_intercept=False)
mod.fit(X, y)
mod.intercept_, mod.coef_ # all sklearn gives is y-hats
# sklearn does not give any statistical analysis... because
# sklearn is a machine learning library only concerned with prediction...
# not statistical analysis...

In [None]:
MLRM_6_fit.summary()
MLRM_6_fit.summary().tables[1]

In [None]:
pd.get_dummies(ab_noNaN.loc[test_index,"Hard_or_Paper"])

In [None]:
y = pd.get_dummies(ab_noNaN["Hard_or_Paper"])['H']
y

In [None]:
ab_noNaN[['NumPages', 'Thick', 'List Price']]

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf

In [None]:
X = ab_noNaN[['NumPages', 'Thick', 'List Price']]

clf = clf.fit(X.loc[train_index,:], y[train_index])
# Can't do... `formula = "Hard_or_Paper ~ Height + Width"`

_ = tree.plot_tree(clf)

In [None]:
# _ = tree.plot_tree(clf)
import graphviz 
dot_data = tree.export_graphviz(clf, class_names=['Paperback', 'Hardback'], 
                                feature_names=X.columns,  
                                out_file=None, filled=True, rounded=True,
                                special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

In [None]:
# A confusion matrix is a way to evaluate classification predictions
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm_disp = ConfusionMatrixDisplay(confusion_matrix(y[train_index].values, 
                                                  clf.predict(X.loc[train_index,:]), 
                                                  labels=[0,1]),
                                 display_labels=['P','H'])
_ = cm_disp.plot()

# Confusion Matrices are often presented in terms of FP/FN(TP/TN)

|         |Pred P(-) |Pred H(+) |
|---------|----------|----------|
|True P(-)|TrueNegative(TN)|FalsePositive(FP)|       
|True H(+)|FalseNegative(FN)|TruePositive(TP)|

- **Sensitivity** [proportion of correct predictions in the actually positive class]:<br>TP/(TP+FN), e.g., up above our sensitivity is ?/? = ?% sensitivity
- **Specificity** [proportion of correct predictions in the actually negative class]:<br>TN/(TN+FP), e.g., up above our specificity is ?/? = ?% specificity
- **Accuracy** [overall proportion of correct predictions]:<br>(TP+TN)/(TP+TN+FP+FN)[=N], e.g., up above our accurac is ?/? = ?%

Note that these should be "out of sample metrics" -- this should be performance on test data -- not training used to fit the model (which would over optimistically estimate these metric scores)

https://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion

In [None]:
cm_disp = ConfusionMatrixDisplay(confusion_matrix(y[test_index].values, 
                                                  clf.predict(X.loc[test_index,:]), 
                                                  labels=[0,1]),
                                 display_labels=['P','H'])
_ = cm_disp.plot()

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=4)
clf

X = ab_noNaN[['NumPages', 'Thick', 'List Price']]

clf = clf.fit(X.loc[train_index,:], y[train_index])
# Can't do... `formula = "Hard_or_Paper ~ Height + Width"`

dot_data = tree.export_graphviz(clf, class_names=['Paperback', 'Hardback'], 
                                feature_names=X.columns,  
                                out_file=None, filled=True, rounded=True,
                                special_characters=True)  
graphviz.Source(dot_data) 

In [None]:
cm_disp = ConfusionMatrixDisplay(confusion_matrix(y[train_index].values, 
                                                  clf.predict(X.loc[train_index,:]), 
                                                  labels=[0,1]),
                                 display_labels=['P','H'])
_ = cm_disp.plot()

In [None]:
cm_disp = ConfusionMatrixDisplay(confusion_matrix(y[test_index].values, 
                                                  clf.predict(X.loc[test_index,:]), 
                                                  labels=[0,1]),
                                 display_labels=['P','H'])
_ = cm_disp.plot()