## Healthy Blood Prediction

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

This dataset was collected to help predict from a blood exam if a patient is healthy or has hepatitis C ([source](https://archive.ics.uci.edu/ml/datasets/HCV+data)). It contains the laboratory results from the blood examinations of patients and their diagnosis.

While diagnostic pathways are based on expert rules (if-then-else rules), machine learning algorithms can go beyond these methods, and learn predictions rules directly from the data.

Our goal in this exercise is to implement a logistic regression model for prediction and understand some of its properties. In the second part of this exercise you will implement all methods by yourself, **without using the sklearn library**.

For reference, you find the `sklearn` Logistic Regression at [sklearn-logreg](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

## Preparing the dataset

First, load the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

In [None]:
df = pd.read_csv('data/blood.csv')
df.head()

The 'Category' column indicates the health status of the patient. 

In [None]:
df["Category"].value_counts()

#### 1. Create target variable

Create a binary variable called `healthy` that is one when the patient is a healthy (category `'0=Blood Donor'`), and zero otherwise. Inspect how many samples are healthy and how many are not, using the `value_counts()` method. 

Calculate the percentage of healthy patient in the dataset, and save it to the variable `perc_healthy`.

In [None]:
# Add your code below
# df["healthy"] = ...
# ...
# perc_healthy = ...


Once you have calculated `perc_healthy`, uncomment the cell below and print it out:

In [None]:
#perc_healthy

Note that the dataset is imbalanced, with much higher proportion of healthy samples. This imbalance makes the task harders, as there is a small number of samples from the unhealthy class to learn from.

#### Inspect the columns of the model with the 'info' method. 

In [None]:
df.info()

#### 2. Clean the dataset 

Notice in `.info()` that the Non-Null values indicate that the dataset has missing values. Print the number of missing values for each column.


We must make a decision about how to correct them. Possible solutions are imputation, deletion of the problematic variables (columns) or deletion of problematic samples (rows). As there is a small number of samples with missing values (and for the sake of time), we will drop them. 

Create a clean dataframe by removing samples that have non-valid values, and save it to `df_clean`. 

Turn the categorical variable `Sex` to a valid input representation. Use a binary representation: `True` if equal to `m`, `False` otherwise.

And drop any columns that won't be used in the prediction model. Use `.info()` again to check the new dataset is clean.

*Hint: use the `dropna()` and `drop` methods.*

In [None]:
# Add your code below
# df_clean = ...
# df_clean["Sex"] = ...
# ...


Once you have calculated `df_clean`, uncomment and run the cell below:

In [None]:
# df_clean.head(3)

## Single variable model

We will start by analyzing a single variable column, which makes it easier to visualize the main principles of Logistic Regression. We choose the feature `AST` arbitrarily.

#### 3. Visualize how the target `healthy` depends on the variable `AST` with a scatter plot.

Use the `df.plot.scatter()` method and pass `ax=ax`, and `c=k` (color black).

Note that the `fig` and `ax` variables can be reused in cells below to plot over this figure.

In [None]:
fig,ax = plt.subplots()
# Add your code below
#  df.plot...


#### 4. Implement a function that creates and fits a logistic regression model given the input and output data. 

In [None]:
# Add your code below
# def logreg(X,Y):
#    ...
#    return model


#### 5. Fit a model named `model_AST` that predict the status `healthy` from the variable `AST`. 

Note: for the test on KATE to pass, use the dataframe `df` (and not `df_clean`) to define the input features `X` and the labels `y` to train your model `model_AST`.

In [None]:
# Add your code below
# X = ...
# y = ...
# model_AST = ...


#### 6. Calculate the prediction probability for each data point. 

Using `model_AST`, calculate the prediction probabilities for all data points, saved as `y_prob`. 

Plot the result using `ax.scatter()`, which will reuse the plot created above, and the command `fig` will plot the resulting figure here. Use the color blue for this new plot, using the argument `color='tab:blue'`. 

Note: the `tab` colors are the new standard matplotlib colors (the "Tableau Palette").

In [None]:
# Add your code below
# Y_prob = ...
# plt.scatter...
fig

#### 7. Calculate the deterministic predictions. 

Calculate the deterministic predictions, deciding that it is healthy if the output probability is larger than 50%, in a variable `y_pred`. 

Plot the predictions for each data point, again superimposed on the results above using `ax.scatter()`. Use small red markers, using the arguments `c=tab:red` and `marker='.'`.

In [None]:
# Add your code below
# y_pred = ...
# ...
fig

#### 8. Notice that your model is deciding a sample is healthy if `ALS` is below a certain threhold. Calculate this threshold from the bias and weight parameters of the model. 
 
Draw the threshold as a vertical line, superimposed on the plots above using `ax.plot()`, and again in the color red. 

Note that this line represents a decision boundary: the model's decision is based on which side of it a sample lies.

*Hint: use the `model.intercept_` and `model.coef_` attributes as bias and weight attributes respectively, and that at the decision bourdary we have $w x - b = 0$, where $w$ is the weight and $b$ the bias.*

*Hint: to draw a vertical line at some value, `x`, use `ax.plot([x,x], [0,1])`.*

In [None]:
# Add your code below
# ALS_thres = ...
# plt...
fig

#### 9. Calculate and plot the ROC curve for the predictions. 

ROC Curves summarise the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.

ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Use the new axis for the plot, using `ax_roc.plot()`. Limit the plot range to valid values, between zero and one.

In [None]:
from sklearn.metrics import roc_curve
fig_roc, ax_roc = plt.subplots()
# Add your code below
# ROC = ...
# plt...


## Analysis of multivariate logistic regression model

#### Now, we will use all the input variables.

Let's start by uncommenting the cell below and inspecting the first 3 lines of our `df_clean` DataFrame.

In [None]:
# df_clean.head(3)

#### 10. Create a logistic regression model using all input columns from the clean data frame. 

#### Optimization algorithms can perform better with normalized input variables. To normalize the input features of your model (`X`), create a new variable called `X_norm` and use `X_norm` when training your model called `full_model`.

*Hint:  Use the z-score transformation $$\frac{x - \mu}{\sigma}$$ with the  `mean()` and `std()` methods.*

In [None]:
# Add your code below
# X = ...
# X_norm = ...
# y = ...
# full_model = ...


#### 11. Calculate the prediction probabilities. 

#### Plot the histograms of the prediction probabilities, separately for healthy cases and for unhealthy cases. Do you think the model is any good?

Set the numbers of bins to 30 (`bins=30`), use the `step` histogram type (`histtype='step'`), and set the density mode to on (`density=True`).

In [None]:
# Add your code below
# Y_prob2 = ...
# plt.hist...
# ...


#### The ROC curve indicate the false positive and negative error rates for different choices of decision thresholds, varying from zero to one.

#### 12. Calculate the ROC for the full model.

#### Superimpose the result over the ROC for the `ALS` model, reusing the `ax_roc` axis above. 

####  Which model has better predictions?

In [None]:
# Add your code below
# ROC2 = ...
# plt...
fig_roc

#### 13. Calculate the confusion matrix for the predictions using a 50% threshold. What's the error probability for healthy cases (false negatives) and unhealthy cases (false positives)?

You can review the elements of a confusion matrix in [link](https://www.nbshare.io/notebook/626706996/Learn-And-Code-Confusion-Matrix-With-Python/).

#### Notice that false positives and false negatives have very different probabilities. Think about why this is the case.

*Hint: Remember the imbalance in number of healthy and unhealthy samples.*

In [None]:
from sklearn.metrics import confusion_matrix
# Add your code below
# CM = ...
# false_neg = ...
# false_pos = ...


Once you have implemented the confusion matrix and the error probabilities for healthy and unhealthy cases, uncomment and run the cell below to inspect these values:

In [None]:
# print(CM)
# print(false_neg)
# print(false_pos)

#### 14. Calculate and print the decision probability threshold that would give same error probability for false positives and false negatives. 

*Hint: Find where in ROC curve false negatives become larger than false positives, using the `np.where()` function. Then use the third element of the ROC output (`ROC2[2]`), which tells the probability threshold at that point of the ROC curve.*

In [None]:
# Add your code below
# index_roc = ...
# decision_thres = ...


When you have calculated `decision_thres`, uncomment the cell below to print it out:

In [None]:
# decision_thres

#### 15. Mark over the ROC curve the point corresponding to the balanced error choice above.

#### Use `ax_roc.plot()` to plot over the ROC figure above. Plot a single marker by plotting the x and y values of the ROC curve in `ROC2` for the correct index. Use the plotting argument `gx` for a green X marker.

#### Also, calculate the ROC point for the 50% decision threshold considered above. Mark it as a red X marker in the same figure.

#### Which decision threshold is preferable? What would this preference depend on?

*Hint: Think about what is worse, false positive or false negatives?*

In [None]:
# Add your code below
# ROC_index = ...
# plt.plot(...)
fig_roc

## Implementing the model from scratch

Most machine learning models are trained with gradient descent, updating model parameters with small changes in the direction that reduces the training loss. 

#### To learn in detail what goes on when we use libraries to train our model,  we will now implement the gradient descent algorithm from scratch and compare the results with the scikit-learn package model. 

#### We give you here the gradient and loss functions for the logistic regression model.

In [None]:
sigma = lambda u: 1/(1+np.exp(-u))
def gradient(w,x,y):
    y_prob = sigma(x@w)
    return np.dot(x.T, (y_prob - y))/len(y)

def loss(w,x,y):
    y_prob = sigma(x@w)
    return -(y*np.log(y_prob)+(1-y)*np.log(1-y_prob)).mean()

#### 16. Implement a function that takes the input and output of the logistic regression model and fits it using gradient descent. 

Use random Gaussian initialization (with unit variance), no regularization, learning rate of 2., and 200 gradient descent steps. 

Also, calculate the total loss at each step. Return the final weight parameters and the total loss time series.

In [None]:
np.random.seed(0)
# Add your code below
# def fit(X,y):
#    ...
#    for t in ...
#      ...
#    return w, loss_t


#### 17. Fit the model to our clean and normalized inputs. 

#### Calculate and print the final weights, named `w`, and plot the loss over time, named `loss_t`. 

#### Does it learn over time? Has it nearly converged?

Remember to add a constant column to the input for the bias parameter.

In [None]:
# Add your code below
# X = ...
# Y = ...
# w, loss_t = ...
# print...
# plt...


#### 18. Calculate and plot the probability predictions histograms for healthy and unhealthy samples. 

#### Use the same histogram properties as above

#### Has this model learned well?

*Hint: the output probability of the logistic regression model is $p(x=1) = \sigma(w^T x-b)$*.

In [None]:
# Add your code below
# y_prob3 = ...
# plt.hist...
# ...


#### 19. Compare the coefficients learned with our model and the scikit-learn model above. 

#### Make a scatter plot  using `plt.scatter`, with the vector of coefficients of each of the model as arguments. 

#### As a reference, plot a black dashed line on the main diagonal of the plot (i.e. where x=y). 

*Hint: Use `plt.plot([a,a],[b,b])` for some values `a` and `b`, and `k--` as argument.* 

#### Also, calculate and print the correlation between these two vectors, using the method `np.corrcoef()`, named `correlation`. 

#### Are the two models similar? Why should they be different, similar or identical?

In [None]:
# Add your code below
# plt.scatter...
# ... 
