### Fine-tuning
---

**How good is your model?**


**Classification Metrics**

- Measuring model performance with accuracy:
    - Fraction of correctly classified samples;
    - Not always a useful metric;
    
**Class imbalance examples: Emails**

- Spam classification: 
    - 99% of emails are real and only 1% are spam. 
- Could build a classifier that ALL emails as real:
    - This model would be correct 99% of the time. 99% accurate!
    - But horrible at actually classifying spam. 
    - It neves predicts spam at all, so it completely fails at its original purpose. 

The situation when one class is more frequent is called __class imbalance__, because the class of real emails contains way more instances than the class of spam. This is a very commom situation in pratice and requires a more nuanced metric to assess the perfomance of our model. 

**Diagnosing classification predictions**

Given a binary classifier, such as our spam email example, we can draw up a 2-by-2 matrix that summarizes predictive perfomance called a __confusion matrix__:

<img src="images/confusionmatrix.png" width="400" style="float:left"/>

Given any model, we can fill in the confunsion matrix according to this predictions. 
- In the top left square (true positive), we have the number of spam emails correctly labeled;
- In the bottom right square (true negative), we have the number of real emails correclty labeled; 
- In the top right, the number of spam emails incorrectly labeled;
- In the bottom left, the number of real emails incorrectly labeled. 

Usually, the "class of interest" is called the positive class. As we are trying to detect spam, this makes spam the positive class. Which class you call positive is really up to you. So why do we care about the confusion matrix? 

- Accuracy (you can retrieve acc from the confusion matrix): 
    - It's the sum of the diagnoal divided by the total sum of the matrix.
    
        $\frac{tp+tn}{tp+tn+fp+fn}$

**Metrics from the Confusion Matrix**

There are several other important metrics, that we can easily calculate from them confusion matrix.
- Precision: which is the number of true positives divided by the total number of true positives and false positives. It's also called the positive predictive value or PPV. 
(In our example, this is the number of correctly labeled spam emails, divided by the total number of emails classified as spam). 
- Recall: which is the number of true positives divided by the total number of true positives and false negatives. This is also called sensitivity, hit rate, or true positive rate.
- F1-score: is defined as two time the product of the precision and recall divided by the sum of the precision and recall, in other words, it's the harmonic mean of precision and recall. 

Precision: $\frac{tp}{tp+fp}$ 

Recall: $\frac{tp}{tp+fn}$

F1score: $2*\frac{precision*recall}{precision+recall}$

In [None]:
#Confusion matrix in scikit-learn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#Instatite our classifier, split the data into train and test
knn = KNeighborsClassifier(n_neighbors=8)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

#To compute the confusion matrix, we pass te test set labels 
#and the predicted labels to the function confusion matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

For all metrics in scikit-learn, the firts arguments is always the true label and the prediction is always the second argument. 

### Logistic regression and the ROC curve
---

Despite its name, logistic regression is used in classification problems, not regression problems. 

**Logistic regression for binary classification**
- Given one feature, log reg will output a probability, _p_, with respect to the target variable. 
- If the probability _p_ is greater than 0.5: we label the data as **"1"**;
- If the probability _p_ is less than 0.5: we label the data as **"0"**;


<img src="images/lineardescisionboundary.png" width="400" style="float:left"/>