In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Part 1: From linear to logistic regression...

We use again the `chimera_data.csv` from earlier lectures. We will try to predict employee exit, using only some of the key columns in the data set:

In [None]:
df = pd.read_csv("chimera_data.csv")
df = df[["boss_survey","salary","exit"]]

In [None]:
df.head()

In [None]:
np.mean(df.exit)

In [None]:
sns.relplot(x="boss_survey",y="salary",hue="exit",data=df)
plt.show()

We simplify this further, by looking only at the `boss_survey` result as an independent variable:

In [None]:
sns.relplot(x="boss_survey",y="exit",data=df)
plt.gca().invert_xaxis()
plt.show()

We can, of course, run a linear regression on this data. We will use scikit here.

In [None]:
X=df[["boss_survey"]]
Y=df[["exit"]]

lm = LinearRegression().fit(X, Y) # Fit a linear regression with vector Y as dependent and matrix X as independent

print("Intercept = ",lm.intercept_) # Print the resultant model intercept 
print("Model coefficients = ", lm.coef_) # Print the resultant model coefficients (in order of variables in X)
print("R^2 =",lm.score(X,Y)) # Print the resultant model R-squared

We plot the result using scikit's `predict` function together with `matplotlib`:

In [None]:
Y_pred=lm.predict(X)
sns.relplot(x="boss_survey",y="exit",data=df)
plt.plot(X,Y_pred,color="red")
plt.gca().invert_xaxis()
plt.show()

The predictions are rather problematic. Why?

## Univariate logistic regression

We use scikit-learn here to do Logistic Regression. The code looks very similar to Linear Regression.

In [None]:
X=df[["boss_survey"]]
y=df["exit"]

logm = LogisticRegression().fit(X, y) # Fit a logistic regression with vector Y as dependent and matrix X as independent

print("Intercept = ",logm.intercept_) # Print the resultant model intercept 
print("Model coefficients = ", logm.coef_) # Print the resultant model coefficients (in order of variables in X)
print("R^2 =",logm.score(X,y)) # Print the resultant model R-squared

To get the predictions, we proceed as with linear regression:

In [None]:
labels_pred=logm.predict(X)
sns.histplot(labels_pred,stat='percent')
plt.show()

The prediction above is based on an arbitrary threshold around the probability of leaving. We can, instead, look at that probability. For this, we use `.predict_proba(X)`. Be aware that this returns both sides (the probability of not leaving and the probability of leaving) for each employee:

In [None]:
probs_pred=logm.predict_proba(X)
print(probs_pred)

We can now plot the probability of leaving against the actual choices:

In [None]:
X_plot, y_plot = zip(*sorted(zip(X.values, probs_pred[:,1])))
sns.relplot(x="boss_survey",y="exit",data=df)
plt.plot(X_plot, y_plot,color="red")
plt.gca().invert_xaxis()
plt.show()

## Multivariate logistic regression

Of course, we can use more than one explanatory variable:

In [None]:
y=df["exit"] #creating the dependent variable
X=df.drop(columns=["exit"]) #dropping the dependent variable to get a matrix of independent features

Let's run the logit model again:

In [None]:
logm = LogisticRegression().fit(X, y) # Fit a logistic regression with vector Y as dependent and matrix X as independent

print("Intercept = ",logm.intercept_) # Print the resultant model intercept 
print("Model coefficients = ", logm.coef_) # Print the resultant model coefficients (in order of variables in X)
print("R^2 =",logm.score(X,y)) # Print the resultant model R-squared

# Part 2: Setting the threshold

So far, we have just predicted the model on the same data as the data on which we trained it. Of course, this is not ideal. Here, we will be using a train and a validation dataset to find the best threshold, then use this threshold to check how good our model is on the test data:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [None]:
y = df["exit"] #creating the dependent variable
X = df.drop(columns=["exit"]) #dropping the dependent variable to get a matrix of independent features

We split the dataset into train (50%), test (25%) and validation (25%).

In [None]:
trainX, otherX, trainY, otherY = train_test_split(X, Y, test_size=0.5,random_state = 726)

In [None]:
trainX

In [None]:
trainY

In [None]:
otherX

In [None]:
otherY

In [None]:
testX, validationX, testY, validationY = train_test_split(otherX, otherY, test_size=0.5,random_state = 1592)

In [None]:
print(X.shape)
print(Y.shape)
print(trainX.shape)
print(trainY.shape)
print(validationX.shape)
print(validationY.shape)
print(testX.shape)
print(testY.shape)

We fit the model to the dataset using scikit learn, **only on the training data**:

Note: `.values.ravel()` will turn the dataframe column-vector `trainY` into a 1-dimensional array and avoid warnings. It's not strictly necessary with the current version of Python, but it may avoid issues in future versions.

In [None]:
logm = LogisticRegression()
logm.fit(trainX, trainY.values.ravel()) # Fit a logistic regression with vector Y as dependent and matrix X as independent
print(logm.intercept_)
print(logm.coef_)

As before, we get prediction probabilities (this time on the validation dataset). However, we only care about one side (the probability of having "1", that is, of leaving).

In [None]:
logm.predict_proba(validationX)

In [None]:
Y_probs=logm.predict_proba(validationX)[:,1]

We can now display the ROC curve. For this, we use `roc_curve` from `sklearn.metrics` (the documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)). The function returns three lists, all indexed in the same way. In the first list, we can find the false positive rates, in the second list we find the true positive rates, and in the third list, we find the corresponding threshold.

In [None]:
from matplotlib import pyplot
fpr, tpr, thresholds = metrics.roc_curve(validationY,Y_probs)
pyplot.plot(fpr, tpr, linewidth=4)
plt.show()

We can stake out the different points on the graph manually:

In [None]:
print("The threshold at index 10 is " + str(thresholds[10]))
print("The false positive rate at this threshold is " + str(fpr[10]))
print("The true positive rate at this threshold is " + str(tpr[10])) 

In [None]:
print("The threshold at index 200 is " + str(thresholds[200]))
print("The false positive rate at this threshold is " + str(fpr[200]))
print("The true positive rate at this threshold is " + str(tpr[200])) 

In [None]:
print("The threshold at index 800 is " + str(thresholds[800]))
print("The false positive rate at this threshold is " + str(fpr[800]))
print("The true positive rate at this threshold is " + str(tpr[800])) 

The AUC summarizes the quality of our model by measuring (roughly) how close we can get to a perfect model. In particular, it gives an idea of how far to the top-left we can get in our ROC:

In [None]:
roc_auc_score(validationY,Y_probs)

Now, we need to choose a threshold (note, the default chosen in making predictions by sklean is 0.5). A natural threshold to choose is 0.1355 (why?)

In [None]:
chosen_threshold = np.min(thresholds[thresholds > 0.1355])
print(chosen_threshold)
threshold_idx = np.where(thresholds == chosen_threshold)[0][0]
print(threshold_idx)

The FPR and TPR at this threshold are:

In [None]:
print("At threshold  " + str(thresholds[threshold_idx]))
print("the false positive rate is " + str(fpr[threshold_idx]))
print("and the true positive rate is " + str(tpr[threshold_idx]))

With a choice of thresholds, we can now make predictions (on the validation set) and display the confusion matrix:

In [None]:
Y_pred = np.where(Y_probs > chosen_threshold, 1, 0)
Y_pred

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(validationY,Y_pred)
print(cm)

In [None]:
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TP = cm[1][1]

We can now verify the false and true positive rates:

In [None]:
1-TN/(FP+TN)

In [None]:
TP/(TP+FN)

We will discuss other methods for choosing thresholds in the tutorial.

# Part 3: Retraining the final model with training+validation, then testing it (time permitting)

In [None]:
trainX_final=pd.concat([trainX, validationX])
trainY_final=pd.concat([trainY, validationY])

We now train our model on `trainX_final` and `trainY_final` with treshold `0.1355` using scikit learn

In [None]:
logm = LogisticRegression().fit(trainX_final, trainY_final.values.ravel())

In [None]:
Y_test_probs=logm.predict_proba(validationX)[:,1]
threshold = 0.1355
Y_test_pred=np.where(Y_test_probs > threshold, 1, 0) #predict the classes for test data based on the threshold found via the validation data

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(testY,Y_test_pred))

# Part 4: Using CART decision trees for classification (time permitting - we will look at this in the tutorial)

There are other ways to perform classification, such as decision trees. While we won't talk a lot about the underlying theory (you will cover this in your machine learning class), they can be quite the powerful tool for classification (and, actually, also for prediction).

Compared to logistic regression, decision trees don't require any structural assumptions (remember, under logistic regression, we assume an exponential of a linear function). However, greater flexibility comes at a cost: there are a bunch more options to choose from when using those. For our purposes, we will concentrate on the `max_leaf_nodes`, which is usually the option with the biggest impact.

In [None]:
from sklearn.tree import DecisionTreeClassifier
import graphviz as gp
from sklearn.tree import export_graphviz

The usage in Python is quite intuitive - we only need to replace `LogisticRegression` with `DecisionTreeClassifier` (and define the `max_leaf_nodes`):

In [None]:
classifier_DT = DecisionTreeClassifier(max_leaf_nodes = 4)
classifier_DT.fit(trainX, trainY.values.ravel())

One of the major advantages of decision trees is that they are quite intuitive. Let's take a look:

In [None]:
dot_data = export_graphviz(classifier_DT, feature_names = trainX.columns, filled = True, rounded = True, class_names=["No exit","Exit"])
graph = gp.Source(dot_data)
graph

The usage is exactly as with the logistic regression. For example, we can get the probability of exit on the validation set:

In [None]:
Y_probs=classifier_DT.predict_proba(validationX)[:,1]

We can also print the ROC curve:

In [None]:
from matplotlib import pyplot
fpr, tpr, thresholds = metrics.roc_curve(validationY,Y_probs)
pyplot.plot(fpr, tpr, linewidth=4)
plt.show()

Finally, we can find the AUC to measure the quality of the model. We could, for example, vary `max_leaf_nodes` to get a higher AUC:

In [None]:
roc_auc_score(validationY,Y_probs)

There are even more advanced classifiers, such as
```
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
```
They bring in better predictive power at the cost of higher parameter-setting complexity. But the usage is pretty much the same as for logistic regression and CART, so feel free to try it out.

# Exercise: FP and FN

Give everyday life examples where it would be preferable to (i) have false positives rather than false negatives, or (ii) have false negatives rather than false positives.
