# <font color='#31394d'> Practical Exercise: Logistic Regression </font>

In this module, we'll be exploring how to build a logistic regression model using scikit-learn. Remember, logistic regression is a binary classification algorithm, so instead of predicting a number (regression) or a group of labels (i.e. multi-class classification) we'll be predicting either true (1) or false (0). Let's start by importing our usual toolkit.

In [None]:
import pandas as pd
import numpy as np

## <font color='#31394d'> Breast Cancer Data </font>

We are going to use the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). It is a dataset that contains measurements taken on breast cancer cell images. The goal of the dataset is to predict whether a cancer tumor is benign or malignant.


![title](media/breast_cancer.png)

In [None]:
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()
cancer_data.keys()

In [None]:
print(cancer_data["DESCR"])

This dataset is a binary classification problem, where the target variable indicates whether a tumor is malignant or benign, encoded as 0 and 1, respectively. The features are measurements taken from the cell images, mostly measures of the cell nucleii.

In [None]:
cancer_data.target_names

In [None]:
pd.Series(cancer_data["target"]).value_counts(normalize=True).sort_index()

In this dataset, 62.7% of tumors are benign and 37.3% are malignant.

There are lots of features in this dataset. Let's limit ourselves to just the "mean" features:

In [None]:
cancer_data.feature_names

In [None]:
df = pd.DataFrame(cancer_data["data"], columns=cancer_data.feature_names)

df = df[df.columns[df.columns.str.startswith('mean')]]

df['target'] = cancer_data.target
print('The dataset has', df.shape[0], 'rows and', df.shape[1], 'features')
df.head()

Its always a good idea to check that our variables are of the expected data types. Any non-numerical variables would need to be converted to numeric variables (e.g. via one-hot encoding) before we can fit a model.

In [None]:
df.dtypes

Let's split our dataset into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1], df.target, test_size=0.2, random_state=12345)


print('The training set has', X_train.shape[0], 'rows')
print('The test set has', X_test.shape[0], 'rows')

Shortly, we will be fitting our first logistic regression model. From an optimisation point of view, this involves finding the $\beta$ coefficients that minimise the cross-entropy loss function. The internal procedures that perform this operation work best if the features are *standardized*; that is, for each feature, we subtract its mean and divide by its standard deviation. This ensures that all features have the same scale, but doesn't change the nature of the relationship between the features and the target variable.

We can do this in `sklearn` by instantiating a `StandardScaler` object from the `preprocessing` submodule. We use the `fit_transform` method to compute the means and standard deviations for the training data and then apply the transformation to the training data. We then use the `transform` method to standardize the test data using the means and standard deviations estimated from the training set. 

🙋‍♀️ <font color='#eb3483'> Question: </font> Why do we not apply the `fit_transform` method on the test set?

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [None]:
pd.DataFrame(X_train, columns=df.columns[:-1]) # just so you can see what the transformed data looks like

## <font color='#31394d'> Logistic Regression in Scikit-Learn </font>

In this section, we will fit a logistic regression model and use it to make predictions. Take note of how similar the process is to linear regression. This is the case for all estimators in `sklearn` and is the primary reason for the module's popularity. Even though the models are very different, the training and prediction steps look almost identical in `sklearn`!



In [None]:
from sklearn.linear_model import LogisticRegression
#?LogisticRegression

<font color='#eb3483'> A side note... </font>

Note (from the help file) that `LogisticRegression` has a `penalty` argument that is set to `l2` by default. If you completed the bonus material on linear regression, you will know a bit about *regularization* - a technique that introduces some bias (a simpler model) in an attempt to reduce sampling variability (the bias-variance trade-off!). In linear regression, an `l2` penalty leads to ridge regression, while the `l1` penalty corresponds to the lasso. These penalties can also be applied to the loss function in logistic regression - the only difference is that the loss function is now cross-entropy rather than the residual sum of squares. In the `LogisticRegression` function, the strength of the penalty is controlled by the additional argument `C`, where smaller values correspond to **more** strength (its the inverse of the $\lambda$ parameter that we discussed in the regularization notebook). The optimal value of `C` can be chosen by cross validation. Importantly, if you are going to use a regularized model, then you **MUST** always standardize your features.

We are going to ignore the penalty for the rest of this notebook and fit a standard logistic regression model...

In [None]:
clf = LogisticRegression(penalty='none') # instantiate the model

In [None]:
clf.fit(X=X_train, y=y_train) 

## <font color='#31394d'> Interpretation </font>

We can examine the estimated coefficients in the usual way. Note that because we standardized our features, the absolute values of the coefficients give us a measure of feature importance:

In [None]:
clf.coef_

In [None]:
df.columns[:-1]

In [None]:
pd.Series(clf.coef_[0], index=df.columns[:-1]).sort_values()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.barplot(x=df.columns[:-1], y=np.abs(clf.coef_[0]))
plt.xticks(rotation=90);


🚀 <font color='#eb3483'> Exercise: </font> Area and radius look like the most important features for classifying cancer tumors. How would you explain their coefficients to a medical practitioner who knows nothing about machine learning?

In [None]:
np.exp(clf.coef_[0])

## <font color='#31394d'> Predictions </font>

Let's now make some predictions on the test data:

In [None]:
y_pred = clf.predict(X_test)
y_pred[:20]

We see that the `predict` method directly outputs the predicted class (0 or 1), assuming a classification threshold of 0.5. If we want to see the underlying probabilities, we can use the `predict_proba` method:

In [None]:
y_prob = clf.predict_proba(X_test)
y_prob[:20,:]
# why are there two columns of results here?

🚀 <font color='#eb3483'> Exercise: </font> Using the estimated probabilities, confirm that the `predict` method does indeed use a 0.5 classification threshold.

In [None]:
# your code goes here

## <font color='#31394d'> Model Evaluation </font>

In binary classification, we have: 

- *Positive cases*: Cases of labelled as 1 (benign cancers)
- *Negative cases*: Cases of labelled as 0 (malignant cancers)

Since actual positive cases can be classified correctly as positive or incorrectly as negative, and actual negative cases can be classified incorrectly as positive or correctly as negative, we have four possible scenarios:

- *True Positives* (TP): the cancers that are benign and are correctly classified as benign
- *False Positives* (FP): malignant cancers that are incorrectly classified as benign
- *True Negatives* (TN): malignant cancers that are correctly classified as malignant
- *False Negatives* (FN): benign cancers that are incorrectly classified as malignant

![title](media/classification_errors.png)

### <font color='#31394d'> **Confusion Matrix** </font>

We can use a confusion matrix to easily examine how a classifier has performed in each one of these categories. The `metrics` submodule in `sklearn` has a `confusion_matrix` function. NB: Read the documentation (`?confusion_matrix`) to see what the rows and columns represent.

In [None]:
from sklearn.metrics import confusion_matrix 


C = confusion_matrix(y_test, y_pred)


pd.DataFrame(C, index=['actual0','actual1'], columns=['pred0','pred1'])



### <font color='#31394d'> Evaluation Metrics for Classification </font>

#### <font color='#eb3483'> Classification Accuracy </font>

Accuracy is a general measure of the model's performance. It simply measures the percentage of cases correctly classified.

$$\text{Accuracy}=\frac{\text{Number of correctly classified observations}}{\text{Total number of observations}}= \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$$

In [None]:
(C[0,0]+C[1,1])/C.sum()

In [None]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)

#### <font color='#eb3483'> Precision </font>

Precision measures the accuracy of a model's positive predictions - when a model says a case is positive, how confident can we be that it is correct?

$$\text{Precision}=\frac{\text{Number of positive cases correctly classified}}{\text{Number of cases classified as positive}}= \frac{\text{TP}}{\text{TP}+\text{FP}}$$

<img src="media/precision_accuracy.png" style="width:30em;">

In [None]:
C[1,1]/(C[1,1]+C[0,1])

In [None]:
metrics.precision_score(y_test, y_pred)

#### <font color='#eb3483'> Recall/ True Positive Rate (TPR) </font>
 
Recall gives us an idea of the model's ability to find (detect) the true positive cases.

$$\text{Recall}=\frac{\text{Number of positive cases correctly classified}}{\text{Number of positive classes}}= \frac{\text{TP}}{\text{TP}+\text{FN}}$$


![title](media/precision_recall.png)

In [None]:
C[1,1]/(C[1,1]+C[1,0])

In [None]:
metrics.recall_score(y_test, y_pred)

#### <font color='#eb3483'> ROC Curve and AUC </font>

The receiver operating characteristics [(ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a curve used to evaluate how recall (TPR) and the false positive rate (FPR) vary as we change the classification threshold (that is, the theshold we apply to the probabilities to determine which class a case belongs to). We have alsready defined the TPR above. The FPR is defined as the proportion of true negatives that are incorrected classified as positive:

$$\text{False positive rate} = \frac{\text{Number of negative cases incorrectly classified}}{\text{Total number of negative cases}} = \frac{\text{FP}}{\text{FP}+\text{TN}} $$

When the classification threshold is low, the FPR will be high since most cases (negative/positive will be classified as positive). At the same time, TPR will be high since most of the positive cases will be classified as positive. The opposite is true when the classification threshold is high. The ROC curve summarizes this trade off between correctly classifying the positive and negative cases as we vary the classification threshold. 

Classifiers that perform well have ROC curves that "hug" the top left hand corner of the plot. Poor classifiers have ROC curves that are close to diagonal. We can therefore use the area under the curve (AUC) as a classifier evaluation metric: a value close to 1 is good and a value close to 0.5 is bad.

In [None]:
metrics.plot_roc_curve(clf, X_test, y_test)

In [None]:
metrics.roc_auc_score(y_test, y_prob[:,1]) 
# NB: Give it the estimated *probabilities*, not the predicted classes

## <font color='#eb3483'> K-Nearest Neighbors for Classification </font>

🚀 <font color='#eb3483'> Exercise: </font> Repeat the above analysis but this time use a $K$-nearest neighbors classifier. Determine which model is best for this problem - logistic regression or a KNN classifier?

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
# your code goes here