*Credits: Applied Data Analysis (ADA) course at EPFL (https://dlab.epfl.ch/teaching/fall2020/cs401/)*

## Applied Machine Learning

Welcome to our last tutorial of the course, congratulations for making it until here! In this tutorial, we will go through the main concepts learned during the course using a real-world use case.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from itertools import combinations 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sn

%matplotlib inline

### All you need is love… And a pet!

<img src="img/dataset-cover.jpg" width="920">

Here we are going to build a classifier to predict whether an animal from an animal shelter will be adopted or not (aac_intakes_outcomes.csv, available at: https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/version/1#aac_intakes_outcomes.csv). You will be working with the following features:

1. *animal_type:* Type of animal. May be one of 'cat', 'dog', 'bird', etc.
2. *intake_year:* Year of intake
3. *intake_condition:* The intake condition of the animal. Can be one of 'normal', 'injured', 'sick', etc.
4. *intake_number:* The intake number denoting the number of occurrences the animal has been brought into the shelter. Values higher than 1 indicate the animal has been taken into the shelter on more than one occasion.
5. *intake_type:* The type of intake, for example, 'stray', 'owner surrender', etc.
6. *sex_upon_intake:* The gender of the animal and if it has been spayed or neutered at the time of intake
7. *age_upon\_intake_(years):* The age of the animal upon intake represented in years
8. *time_in_shelter_days:* Numeric value denoting the number of days the animal remained at the shelter from intake to outcome.
9. *sex_upon_outcome:* The gender of the animal and if it has been spayed or neutered at time of outcome
10. *age_upon\_outcome_(years):* The age of the animal upon outcome represented in years
11. *outcome_type:* The outcome type. Can be one of ‘adopted’, ‘transferred’, etc.

### Data processing

First things first! Let's load the data into memory using Pandas:

In [None]:
# add your code here
original_data.head(5)

Let's check if there are any missing values in the DataFrame. [This website](https://datatofish.com/check-nan-pandas-dataframe/) gives a great overview on the possibilities that you have to check this. Try to print how many values of each column are missing. `isna`. `isnull`

In [None]:
# add your code here

Since the number of missing values is very small compared to the data size, and since most of the missing values correspond to the target variable `outcome_type`, we have decided to just drop the instances where there exists any null value. *Hint*: to do this, you may want to use pandas' `dropna`.

In [None]:
print('The length of the data with all rows is : {}'.format(len(original_data)))
original_data = ...
print('The length of the data without the rows with nan value is: {}'.format(len(original_data)))

How many different values does the column _outcome\_type_ have? Print them:

In [None]:
# add your code here

In this task, we will just focus on whether the animal was adopted or not. Create the column _adopted_, that will have a value 1 if the value for _outcome\_type_ is 'Adoption', and 0 otherwise. `apply`, `lambda`


In [None]:
data = original_data.copy()
data['adopted'] = ...

Now, drop the column _outcome\_type_, since we do not need it anymore. `drop`

In [None]:
data = ...
data.head()

Select the data features (all but _adopted_) and the data label (_adopted_) for the task. After this, split the data into a training set (80%) and a test set (20%). You may use sklearn's function `train_test_split`. Use a random_state=42. You can further check the documentation in: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
data_features = data.drop(columns=['adopted'])
data_label = data['adopted']

train_features, test_features, train_label, test_label = ...

print('Length of the train dataset : {}'.format(len(train_features)))
print('Length of the test dataset : {}'.format(len(test_label)))

The dataset contains categorial features. We need to convert this to a suitable numerical representation. We will use pandas' `get_dummies` function to use a dummy-variable encoding.

In [None]:
train_categorical = ...
train_categorical.head()

We will do the same with the test set. However, we have to take into account that the features in the test set must be matched with the ones in the training set.

In [None]:
# Make sure we use only the features available in the test set
test_categorical = ...
test_categorical.head()

Let's normalize the values of each feature in the data to have mean 0 and variance 1. For this, we will use sklearn's `StandardScaler` function. Check out more in its documentation https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Declare an instance of the scaler and fit it with the training features.

In [None]:
scaler = ...
# fit the scaler

Now, normalize the training features. *Hint:* use `.transform()`

In [None]:
scaled_features = ...

# The output of the .transform() function is a numpy matrix. We transform it back to a DataFrame
train_features_std = pd.DataFrame(scaled_features, index=train_categorical.index, columns=train_categorical.columns)
train_features_std.head()

We will also normalize the test features with the same scaler (mean and variance are extracted from the training columns). We do this because we assume that the training data is representative enough of our sample, and we should not look at the distribution of the test set and instead assume that it will be similar to the training set.

In [None]:
scaled_features = ...

# The output of the .transform() function is a numpy matrix. We transform it back to a DataFrame
test_features_std = pd.DataFrame(scaled_features, index=test_categorical.index, columns=test_categorical.columns)
test_features_std.head()

### Training and evaluation phases

Since this is a classification task, we will make use of Logistic Regression.

Declare and train a Logistic Regression Classifier on your training set. For this, you can use the constructor `LogisticRegression` from sklearn. Check out further information in https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Choose max_iter=10000

In [None]:
logistic = ...
# train the model

Print the predicted probabilities obtained for the test set. You can use `.predict_proba()` for this:

In [None]:
prediction_proba = ...
prediction_proba

Logistic Regression returns probabilities as predictions, so in order to arrive at a binary prediction, you need to put a threshold on the predicted probabilities. 

The function below computes a confusion matrix given the true labels, the prediction probabilities, and the chosen decision threshold. Complete the function by completing the formulas to calculate the true positives, true negatives, false positives, and false negatives.

After this, we will print the confusion matrix for a decision threshold of 0.5.

In [None]:
def compute_confusion_matrix(true_label, prediction_proba, decision_threshold): 
    
    # Get the predicted label based on the threshold chosen
    predict_label = (prediction_proba[:,1]>decision_threshold).astype(int)   
                                                                                                                       
    TP = np.sum(np.logical_and(predict_label==1, true_label==1))
    TN = np.sum(np.logical_and(predict_label==0, true_label==0))
    FP = np.sum(np.logical_and(predict_label==1, true_label==0))
    FN = np.sum(np.logical_and(predict_label==0, true_label==1))
    
    confusion_matrix = np.asarray([[TP, FP],
                                    [FN, TN]])
    return confusion_matrix


confusion_matrix_05 = compute_confusion_matrix(test_label, prediction_proba, 0.5)
confusion_matrix_05

Let's plot the confusion matrix (code complete):

In [None]:
def plot_confusion_matrix(confusion_matrix):
    [[TP, FP],[FN, TN]] = confusion_matrix
    label = np.asarray([['TP {}'.format(TP), 'FP {}'.format(FP)],
                        ['FN {}'.format(FN), 'TN {}'.format(TN)]])
    
    df_cm = pd.DataFrame(confusion_matrix, index=['Yes', 'No'], columns=['Positive', 'Negative']) 
    
    return sn.heatmap(df_cm, cmap='YlOrRd', annot=label, annot_kws={"size": 16}, cbar=False, fmt='')

plt.figure(figsize = (6,4)) 
ax = plot_confusion_matrix(confusion_matrix_05)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Confusion matrix for a 0.5 threshold')
plt.show()

The function below computes the accuracy, precision, recall, and F1-score with respect to the positive and the negative class. Complete the function by completing the formulas for all these metrics

After this, we will print all the scores for a decision threshold of 0.5.

In [None]:
def compute_all_score(confusion_matrix, t=0.5):
    [[TP, FP],[FN, TN]] = confusion_matrix.astype(float)
    
    accuracy = ...
    
    precision_positive = ...
    precision_negative = ...
    
    recall_positive = ...
    recall_negative = ...

    F1_score_positive = ...
    F1_score_negative = ...

    return [t, accuracy, precision_positive, recall_positive, F1_score_positive, precision_negative, recall_negative, F1_score_negative]


[t, accuracy, precision_positive, recall_positive, F1_score_positive, \
    precision_negative, recall_negative, F1_score_negative] = compute_all_score(confusion_matrix_05)

print("The accuracy of this model is {0:1.3f}".format(accuracy))
print("For the positive case, the precision is {0:1.3f}, the recall is {1:1.3f} and the F1 score is {2:1.3f}"\
      .format(precision_positive, recall_positive, F1_score_positive))
print("For the negative case, the precision is {0:1.3f}, the recall is {1:1.3f} and the F1 score is {2:1.3f}"\
      .format(precision_negative, recall_negative, F1_score_negative))

### Further visual analysis (code complete)

We will vary the value of the threshold in the range from 0 to 1 and visualize the value of accuracy, precision, recall, and F1-score (with respect to both classes) as a function of the threshold.

In [None]:
threshold = np.linspace(0, 1, 100)

The code below computes all the metrics for each of the threshold levels, and stores them into a pandas DataFrame

In [None]:
columns_score_name = ['Threshold', 'Accuracy', 'Precision P', 'Recall P', 'F1 score P', \
                                              'Precision N', 'Recall N', 'F1 score N']
threshold_score = pd.concat([pd.DataFrame([compute_all_score(compute_confusion_matrix(test_label, prediction_proba, t ),t)]\
                                             , columns=columns_score_name) for t in threshold], ignore_index=True)
threshold_score.set_index('Threshold', inplace=True)

threshold_score.head()

We will now plot the accuracy as a function of the threshold

In [None]:
threshold_score['Accuracy'].plot(grid=True).set_title('Accuracy')

We will now plot the rest of the metrics as a function of the threshold

In [None]:
fig, axs = plt.subplots(nrows=2, ncols=3, sharex=True, sharey=True, figsize=(15,7))

col_plot = ['Precision P', 'Recall P', 'F1 score P', 'Precision N', 'Recall N', 'F1 score N']

major_ticks = np.linspace(0,1,5)

for axe, col in zip(axs.flat, col_plot):
    threshold_score[col].plot(ax=axe, grid = True)
    axe.set_title(col)
    axe.set_xticks(major_ticks)    
    axe.grid(which='major', alpha=0.5)

What do you observe? What do you think a good value for the threshold might be?

### Feature analysis

Based on the Logistic Regression model trained, obtain the coefficients associated to each of the features. Check out the `coef_` attribute.

**Important**: the array must have 1 dimension only.

In [None]:
logistic_coefficients = logistic.coef_[0]
logistic_coefficients

We will create an array with the name of the features and the coefficient associated to it (code complete):

In [None]:
tmp = []
for name, value in zip(train_features_std.columns, logistic_coefficients):
    tmp.append({"name": name, "value": value})
    
features_coef = pd.DataFrame(tmp)
features_coef.head()

Sort this DataFrame in ascending order by value. `sort_values`

In [None]:
features_coef = features_coef.sort_values(by=['value'])
features_coef.head()

Let's plot in a bar chart the coefficients of the logistic regression sorted by their contribution to the prediction.

In [None]:
plt.subplots(figsize=(5,7))
plt.barh(features_coef.name, features_coef.value, alpha=0.6)
plt.show()

How can you interpret this information? **Hint**: recall that
$$P(y=1|x,\beta) =1/(1+\exp(-\beta^Tx)$$

In [None]:
#Insert your thoughts here#