In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Sklearn for Logistic Regression

The field of healthcare is being revolutionised by Machine Learning. For certain diseases, models can be trained using data from previous patients to identify whether a new patient has the disease or not. Often the performance of these diagnostic models exceeds the human experts.


<img src="../images/assignments/doctor.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

In this notebook you will use Scikit-Learn to perform Logistic Regression on a medical dataset.

We will use one of the datasets built into Scikit-Learn for practise purposes: [The Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset).

In [2]:
from sklearn import datasets
medical_data = datasets.load_breast_cancer()

A dataset from scikit-learn is a dictionary-like object that holds all the data and some metadata about the data. Let's investigate what information is available: 

In [3]:
medical_data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

For instance, this dataset also contains the DESCR key, which we can use to access a description of the dataset.

In [4]:
print(medical_data['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

We can inspect the data as a Pandas DataFrame.

In [5]:
medical_df = pd.DataFrame(medical_data.data, columns=medical_data.feature_names)
medical_df['class'] = medical_data.target
medical_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


### Exercise: 

Perform some preliminary analysis on the dataset.

* How many patients is there data for?
* What datatypes does the dataset contain? Are there any missing values?
* How many predictive features are there? Will all the features be informative for the target?
* Which features do you think will be highly correlated. Investigate whether this is indeed the case.


In [None]:
medical_df['class'].value_counts()

## Assignment

Create Predictor variables 'X' and Target Variable 'y'

In [None]:
X = medical_data['data']
y = medical_data['target']

In [None]:
X

In [None]:
y

## 1. Split the data into a training set and a testing set.

We will train our model on the training set and then use the test set to evaluate the model.

Reserve 20% of the data for testing and a random seed of 0.

## 2. Importing the model

Look up how to import a simple [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html)

Import the model and then instantiate it.

## 3. Training the model

Fit the model to the train set.

The model may not converge in the deafult number of iterations (100). Update the `max_iter` parameter (using `.set_params`) such that the model coverges.

## 4. Investigate the model's coefficients.

To do this you can use the code below (You'll have to swap `lr` for the name of your model)

```python
#Coefficients
coeff_df = pd.DataFrame(lr.coef_.reshape(30,1),
                        medical_data.feature_names,
                        columns=['Coefficient'])
print(coeff_df)
```

### Interpreting the model

a) What do the coefficients represent?

b) Are the coefficients comparable?

## 5. Model Predictions

Use your model to make prediction on the test set.



## 6. Evaluating Predictions

a) print the accuracy of the model

b) print the classification report

## 7. Interpretating the confusion matrix

Using the normalized confusion matrix, what is
- The true positive rate
- The true negative rate
- The false positive rate
- the false negative rate

In the context of diagnostic models, what is the most important rate to improve?

## Bonus

See if you can improve the hyper-paramters of your model, such as the rate you identified in the previous question improves.