#### Write a 500-1’000 word tutorial about the F1 score and how to use it in Scikit-learn with an example - you can choose the data and model. The target audience is familiar with Python and the sklearn library: estimators API (ex. fit, score), related tools (ex. pipelines), common classification models (ex. logistic regressions), but not yet with the precision and recall metrics.

Let's imagine we want to design a diagnostic kit for disease detection. After sampling a drop of blood from your fingertip, the result shown on the device is either 'yes' (positive) or 'no' (negative). However, the kit has a small detection error, and it could happen that the diagnostic is incorrect. So far this would mean there are four possible outcomes:

- The person has the disease and the kit correctly diagnoses having the disease --> true positive (TP)
- The person does not have the disease and the kit correctly diagnoses not having the disease --> true negative (TN)
- The person has the disease and the kit incorrectly diagnoses not having the disease --> false negative (FN)
- The person does not have the disease and the kit incorrectly diagnoses having the disease --> false positive (FP)

Depending on the disease, your physician might decide to choose a different type of kit:

- Incorrectly diagnosing tuberculosis (TB) could mean that you would be free to leave and could fatally spread the disease to other people, and therefore the physician would prefer to have a kit that maximizes true positives and minimizes false negatives. 
- Incorrectly diagnosing a common cold could mean that you might need further medical examination at a certain cost, and therefore the physician might prefer to have a kit that maximizes true positives and minimizes false positives. 

This simplified yet useful example introduces two key concepts in classification tasks:
- $\textrm{Recall} = \frac{TP}{TP+FN} \rightarrow$ proportion of positive predictions that are actually relevant
- $\textrm{Precision} = \frac{TP}{TP+FP} \rightarrow$ proportion of positive predictions that are actually correct

These two metrics are interdependent. Let's go back to the TB example and assume that, for the sake public health, we design a kit that always predicts that the person has the disease. This means that we would not miss any case (no false negatives!), and the recall of the kit would be exactly one. However, the prevalence of tuberculosis in Switzerland is 7 cases per 100k inhabitants, which means that we would be sending to hospital a huge number of people that actually do not have TB (false positives!), and hence our precision in this case would be close to zero. The opposite would happen if we design a kit that never misclassifies a patient not having the disease (no false positives!), so the precision of this new kit would be exactly one. However, if out of those 7 actual TB cases we only manage to pick up one with the kit, there would be 6 people in the nature (false negatives!) spreading a potentially fatal disease.

To include the influence of precision and recall in a single metric, we can use the F1 score, which is a harmonic average of precision and recall:
$$\textrm{F1 score} = 2 \times \frac{\textrm{precision} \times \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$

From the limit values of precision and recall, we can see that the F1 score can vary from zero to one. This metric is extremely useful in classification tasks with heavily imbalanced classes, where the accuracy metric is not a reliable measure of the predictive power of the model. Think again about the TB case and imagine we design a kit that always predict that the person does not have the disease. According to the prevalence in Switzerland, that would be a prediction accuracy of 99.993% - but yet a very big public health concern...

So, take-away key message: you need to choose the most suitabl metric for the problem you are trying to solve! And you can now add F1 score to your collection :-)

Let's go through a specific example using sklearn and the breast cancer Wisconsin (Diagnostic) Data Set that you can find in the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [48]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, ParameterGrid, GridSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier

In [49]:
input_data = pd.read_csv('~/Desktop/Analytics/wdbc.txt', sep=',', header=None) # read the txt data with pandas
input_data.columns = ['ID', 'Diagnosis'] + list(np.arange(1, 31).astype('object')) # rename the columns
input_data['target'] = input_data.Diagnosis.map({'B' : 0, 'M' : 1}) # assign the positive class to the malignant tumour.

In [55]:
# Apply a simple random forest model and calculate the f1_score, precision and accuracy using sklearn. This ensemble model has been chosen because of the class imbalance in the input data.

X = input_data.drop(columns=['Diagnosis', 'target']).values
y = input_data.target.values

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0) # create train and test sets

rf = RandomForestClassifier(n_estimators=500, max_depth=50) # random forest model

rf.fit(X_tr, y_tr) # fit the random forest model 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [56]:
# Once the model has been fitted, the F1 score, precision and recall can be calculated using sklearn's metric module:

from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = rf.predict(X_te) # predict the test classes
f1 = f1_score(y_te, y_pred) # calculate the F1 score
precision = precision_score(y_te, y_pred) # calculate the precision
recall = recall_score(y_te, y_pred) # calculate the recall 

print('The random forest model has an F1 score of {:.2f}'.format(f1), ', precision of {:.2f}'.format(precision), ', and recall of {:.2f}'.format(recall), 'for the test set')

The random forest model has an F1 score of 0.93 , precision of 0.91 , and recall of 0.95 for the test set
