# Logistic Regression with one predictor

## Setup

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score

### Import data

In [None]:
# import and call data df
LINK = 'https://raw.githubusercontent.com/kirenz/datasets/master/resume.csv'



In [None]:
# only select variables received_callback and honors 


### Data structure

In [None]:
# show data head

In [None]:
# show info

In [None]:
# show value counts of received callback


In [None]:
# show value counts of honors


### Variable lists

In [None]:
# define outcome variable as y_label

# create feature data

# create response


### Data split

Make data split with test size of 30%. Use random state 1.

## Model

### Select model

[LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) is an estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters.

- Use 5 folds and random state 0
- call the new object `clf`

### Training

In [None]:
# train the model

In [None]:
# use the method .score to obtain accuracy for the training data


### Coefficients

In [None]:
# intercept

In [None]:
# slope

### Evaluation on test set

In [None]:
# Return the accuracy on the given test data and labels:


### Confusion matrix

In [None]:
# get confusion matrix for test data with ConfusionMatrixDisplay.from_estimator


### Classification report

In [None]:
# make predictions for test data. Call the object y_pred


In [None]:
# use classification report
# print(___(y_test, y_pred, target_names=['No', 'Yes'], zero_division=0))

*Note that recall is also sometimes called sensitivity or true positive rate.*

In a binary classification, we are mainly interested in the results for the category we want to predict. In our case, these are the results for the label 'Yes'.  

More measures:

- ``macro``: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

- ``weighted``: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters 'macro' to account for label imbalance.

Interpretation:

* High scores for both *precision* and *recall* show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

* The importance of precision vs recall depends on the use case at hand (and the costs associated with missclassification). 

* A system with *high recall* but *low precision* returns many results, but most of its predicted labels are incorrect when compared to the training labels. 

* A system with *high precision* but *low recall* is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. 
  
* An ideal system with high precision and high recall will return many results, with most results labeled correctly. 

### ROC Curve

In [None]:
# use RocCurveDisplay.from_estimator for test data


### AUC Score

In [None]:
# calculate auc score with roc_auc_Score 
# ___(y_test, clf.decision_function(X_test))

Option 2 to obtain AUC:

In [None]:
y_score = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_score)

### Change threshold

Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the person will receive a callback is 0.8) or convert the returned probability to a binary value (for example, this person will receive a callback, therefore we label him as "Yes").

A logistic regression model that returns 0.9 for a particular person is predicting that it is very likely that the person will receive a callback. In order to map a logistic regression value to a binary category (e.g., "Yes" or "No"), you must define a **classification threshold** (also called the decision threshold).

- A value above that threshold indicates "Yes", the person will get a callback
- A value below indicates "No", the person will not receive a callback 

Notice that the optimal classification threshold is problem-dependent and therefore a value that you must optimize (see [Google developers](https://developers.google.com/machine-learning/crash-course/classification/thresholding)).

In our example, we just demonstrate the process for one threshold. 

Use specific threshold

In [None]:
# obtain probabilities with predict_proba
# pred_proba = ___.___(X_test)

In [None]:
# set threshold to 0.25
df_25 = pd.DataFrame({'y_pred': ___[:,1] > .25})

In [None]:
# ConfusionMatrixDisplay.from_predictions(y_test, ___['y_pred']);

### Classification report

In [None]:
# print(classification_report(y_test, ___['y_pred'], target_names=['No', 'Yes'], zero_division=0))