# Supervised Machine Learning - Regression and Decision Trees

REMINDER:  No Class Next week, next is the 24th

                       Assignment will be due at the start of that class

# Setup

In [0]:
## update the latest seaborn (0.9.0)
!pip install seaborn==0.9.0
!pip install prophet

In [4]:
## setup our environment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## date and timeseries forecasting tooling
import datetime
from fbprophet import Prophet

## machine learning/predictive analytics tools             
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier           # <-------- New imports start here!
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression 
from sklearn import metrics
import graphviz 
from sklearn import tree


## pandas print columns/rows option (100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

## set the styling for seaborn (white)
sns.set_style("dark")

ModuleNotFoundError: No module named 'graphviz'

# The Dataset

https://www.kaggle.com/ronitf/heart-disease-uci

In [5]:
# bring in the dataset
heart = pd.read_csv("/Users/Kyle_Staples/Documents/GitHub/IS834/datasets/heart-1.csv")

In [6]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB


In [0]:
heart.head()

In [7]:
# quick countplut of the target variable counts
sns.countplot(x="target", data=heart)

<matplotlib.axes._subplots.AxesSubplot at 0x1a139f82b0>



---



# Logistic Regression 

<h3> Another Classification Method </h3>

![](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1534281070/linear_vs_logistic_regression_edxw03.png)

When we want to classify a target variable (y) that has 2 categories.  Even though the target is numeric and almost always 0 / 1, a linear regression will not fit the data well.

Instead, we use a logistic regression, and the code of __1__ represents the target category of interest.  

For example:

- did the customer default on their bank loan
- did the actor win the award
- does a patient have an illness
- does a customer return next year
- does a stock go up in price tomorrow?



In [8]:
# instantiate the model
logreg = LogisticRegression()

In [9]:
# break out the dataset into X (features) and y (target)
X = heart.drop(columns=["target"])
y = heart.target

![](https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2018/08/1-16.png)

In [10]:
# train test splits
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = .25, random_state=33, stratify=y)

In [11]:
# fit the model
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
# what is the simple accuracy score: we use test dataset
logreg.score(X_test, y_test)

0.8289473684210527

In [13]:
# what was the baseline percentage
y.mean()

0.5445544554455446

> Did we improve our ability to gain insight from the dataset?  If so, why?

In [0]:
# generate the prdections
logreg_preds = logreg.predict(X_test)

In [0]:
# confusion matrix -- what we knew to be true, and what we predicted for the test set
cmatrix = metrics.confusion_matrix(y_test, logreg_preds)
cmatrix

In [0]:
# plot
fig = sns.heatmap(cmatrix, annot=True, cmap="Blues")
fig.set(xlabel='Predicted', ylabel='Actual')
plt.show()

In [0]:
# classififcation report -- support is number of records/occurrences
print(metrics.classification_report(y_test, logreg_preds))

In [0]:
# calculate the estiamted probabilities for the target class = 1, which is the s
logreg_y_probs = logreg.predict_proba(X_test)
logreg_y_probs[0:5]

In [0]:
# from above, we have two columns, and they align to the target values, 0, then 1,
# so we want the second column
logreg_y_probs = logreg_y_probs[:, 1]

In [0]:
# calulate the ROC curve and include AUC within the plot
auc_logreg = metrics.roc_auc_score(y_test, logreg_y_probs)
auc_logreg

<img src="https://monosnap.com/image/YSv8XTNdP1U4kt5mOgVnjpSPqwjL1e.png">

Source = http://gim.unmc.edu/dxtests/roc3.htm

In [0]:
# last but not least, look at the model -- its possible to get better output with statsmodels, but Id rather us fit some models and explore accuracy right now
print(X.columns)
logreg.coef_



---



## Logistic Regression Exercise

Using the sample dataset found here in Collab, fit a logistic regression to predict the value of the target variable.  

You should:

- Consider if the variables are appropriate to be included in the model
- calculate the accuracy and AUC of the model
- determine if the model performed better than guessing (the baseline)


Hint:

> The target variable is dervied directly from the median_house_value column and will inflate your results if included

In [0]:
# load the data
house = pd.read_csv("sample_data/california_housing_train.csv")
house['target'] = (house['median_house_value'] >  house['median_house_value'].mean()).astype("int")



---



# Decision Trees

<img src="https://monosnap.com/image/77aOgM9OjPnPRPzg0e8F7OKsAdh41m.png">

In [0]:
# instantiate the model
dtree = DecisionTreeClassifier()

In [0]:
# fit the model to the data we have already split
dtree.fit(X_train, y_train)

In [0]:
# evaluate the accuracy of the model
dtree.score(X_test, y_test)

In [0]:
# calculate predictions
dtree_preds = dtree.predict(X_test)

In [0]:
# classififcation report -- support is number of records/occurrences
print(metrics.classification_report(y_test, dtree_preds))

In [0]:
# calculate the probabilities for the target class of 1, the second column
dtree_y_probs = dtree.predict_proba(X_test)[:, 1]

In [0]:
# calculate the auc
auc_tree = metrics.roc_auc_score(y_test, dtree_y_probs)
auc_tree

In [0]:
# plot the performance of the two
perf = pd.DataFrame({'model':["Logistic Regression", "Decision Tree"], 
                     'auc': [auc_logreg, auc_tree],
                     'accuracy': [logreg.score(X_test, y_test), dtree.score(X_test, y_test)]})
perf.index = perf['model']
perf.plot(kind="barh")

In [0]:
dot_data = tree.export_graphviz(dtree, 
                                out_file=None, 
                                feature_names=X.columns,
                                filled=True,
                                rounded=True,
                                class_names=['0-No','1-Disease']) 
graph = graphviz.Source(dot_data) 
graph.render("Heart Disease")

> The file is located in our working directory.  We can download/view the file which we created called Heart Disease.pdf



---



## Decision Tree Exercise

Using the same California Housing Dataset from the Logistic Regression example above:

- Make sure that you are using the original dataset.  You may need to re-read in the dataset set you are working with the original copy
- Consider if the variables are appropriate to be included in the model
- calculate the accuracy and AUC of the model
- determine if the model performed better than guessing (the baseline)
- determine if the model performed better than the logistic regression exercise

In [0]:
# load the data
house = pd.read_csv("sample_data/california_housing_train.csv")
house['target'] = (house['median_house_value'] >  house['median_house_value'].mean()).astype("int")