In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Sklearn for Logistic Regression

The field of healthcare is being revolutionised by Machine Learning. For certain diseases, models can be trained using data from previous patients to identify whether a new patient has the disease or not. Often the performance of these diagnostic models exceeds the human experts.


<img src="../images/assignments/doctor.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

In this notebook you will use Scikit-Learn to perform Logistic Regression on a medical dataset.

We will use one of the datasets built into Scikit-Learn for practise purposes: [The Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset).

In [None]:
from sklearn import datasets
medical_data = datasets.load_breast_cancer()

A dataset from scikit-learn is a dictionary-like object that holds all the data and some metadata about the data. Let's investigate what information is available: 

In [None]:
medical_data.keys()

For instance, this dataset also contains the DESCR key, which we can use to access a description of the dataset.

In [None]:
print(medical_data['DESCR'])

We can inspect the data as a Pandas DataFrame.

In [None]:
medical_df = pd.DataFrame(medical_data.data, columns=medical_data.feature_names)
medical_df['class'] = medical_data.target
medical_df.head()

### Exercise: 

Perform some preliminary analysis on the dataset.

* How many patients is there data for?
* What datatypes does the dataset contain? Are there any missing values?
* How many predictive features are there? Will all the features be informative for the target?
* Which features do you think will be highly correlated. Investigate whether this is indeed the case.


In [None]:
medical_df['class'].value_counts()

### Exercise:

For this dataset, the target vector indicates whether a patient's sample is malignant or benign.

- What values does the column contain?
- Considering the targets for this problem, would linear regression be a suitable modelling choice?

<img src="../images/assignments/linear.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

## Logistic regression

Logistic regression is a common modelling choice for binary classification problems*.

<img src="../images/assignments/logistic.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

**Outputs with more than two values can be modeled by multinomial logistic regression; when there are multiple ordinal categories, ordinal logistic regression can be used.*

### Exercise: 

What makes the logistic "s" shape more suitable for binary classification compared to the "line" in linear regression?

### Derivation of logistic reggression formula

In linear regression, we modelled the target $y$ as a linear function of the predictor variables $x_1, x_2, x_3,...$:

$$ y = w_0 + w_1 x_1 (+ w_2 x_2 + ...)$$

For logistic regression, we model $y$ as a probability value between 0 and 1. We assume a linear relationship between the predictor variables and the log-odds* (also called logit) of the event that $y=1$.

$$ \log \frac {y}{1-y} = w_0 + w_1 x_1 (+ w_2 x_2 + ...)$$

*In statistics, the logit function or the log-odds is the logarithm of the odds ${\displaystyle {\frac {p}{1-p}}}$ where $p$ is a probability. It is a type of function that creates a map of probability values from ${\displaystyle (0,1)}$ to ${\displaystyle (-\infty ,+\infty )}$.*

Thus by alebraic manipulation, the probability $y$ is modelled by:

$$ y = S(w_0 + w_1 x_1 (+ w_2 x_2 + ...))$$

where $S$ is the sigmoid function, ${\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.}$

### interpretability

We achieve the logistic "s" shape by passing a linear regression model through a sigmoid function. This means the logistic regression maintains the interpretability of linear regression.

## Assignment

Create Predictor variables 'X' and Target Variable 'y'

In [None]:
X = medical_data['data']
y = medical_data['target']

In [None]:
X

In [None]:
y

## 1. Split the data into a training set and a testing set.

We will train our model on the training set and then use the test set to evaluate the model.

Reserve 20% of the data for testing and a random seed of 0.

## 2. Importing the model

Look up how to import a simple [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html)

Import the model and then instantiate it.

## 3. Training the model

Fit the model to the train set.

The model may not converge in the deafult number of iterations (100). Update the `max_iter` parameter (using `.set_params`) such that the model coverges.

## 4. Investigate the model's coefficients.

To do this you can use the code below (You'll have to swap `lr` for the name of your model)

In [None]:
#Coefficients
coeff_df = pd.DataFrame(lr.coef_.reshape(30,1),medical_data.feature_names,columns=['Coefficient'])
print(coeff_df)

### Interpreting the model

a) What do the coefficients represent?

b) Are the coefficients comparable?

## 5. Model Predictions

Use your model to make prediction on the test set.



## 6. Evaluating Predictions

a) print the accuracy of the model

b) print the classification report

c) Plot the non-normalized and normalized confusion matrix.

In [None]:
from sklearn.metrics import plot_confusion_matrix

labels = ['benign', 'malignant']



## 7. Interpretating the confusion matrix

Using the normalized confusion matrix, what is
- The true positive rate
- The true negative rate
- The false positive rate
- the false negative rate

In the context of diagnostic models, what is the most important rate to improve?

## Bonus

See if you can improve the hyper-paramters of your model, such as the rate you identified in the previous question improves.