# 4. Evaluation metrics for Classification

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [None]:
df = pd.read_csv('data-week-3.csv')

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

In [None]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn'] 

In [None]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

In [None]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()

# 4.2 Accuracy and Dummy Model

- Evaluate the model on different thresholds
- check the accuracy of dummy baselines

In the last lesson we calculate that our model has an accuracy of 80% on validation data. Now we want to know, whether this is a good value or not.
Accuracy tells us about the fraction of correct predictions.

What we did is, we check for all customers of validation dataset, whether the decision of churning was correct or incorrect. This decision based on our threshold of 0.5, which means a customer with a predicted value of greater or equal to 0.5 is equal to a churning customer. Values below that threshold equals to not churning customer.
We have 1409 customers in this dataset and we made 1132 correct decisions. That means the accuracy equals to 1132/1409 = 0.80

In [None]:
len(y_val)
# Output: 1409

(y_val == churn_decision).sum()
# Output: 1132

1132 / 1409
# Output: 0.8034

(y_val == churn_decision).mean()
# Output: 0.8034

The question now. Do we chose a good value for that threshold? So what we can do is, we can move this threshold and validate again. By doing this we can see whether it improves the accuracy or not. We can use the linspace function of NumPy to get an array with thresholds (21 values between 0 and 1). For each of them we can calculate the accuracy and look at the best threshold value.

In [None]:
thresholds = np.linspace(0, 1, 21)
thresholds

# Output:        
# array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
#        0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])

In [None]:
scores = []

for t in thresholds:
    churn_decision = (y_pred >= t)
    score = (y_val == churn_decision).mean()
    print('%.2f %.3f' % (t, score))
    scores.append(score)

#scores

# Output: 
# 0.00 0.274
# 0.05 0.509
# 0.10 0.591
# 0.15 0.666
# 0.20 0.710
# 0.25 0.739
# 0.30 0.760
# 0.35 0.772
# 0.40 0.785
# 0.45 0.793
# 0.50 0.803
# 0.55 0.801
# 0.60 0.795
# 0.65 0.786
# 0.70 0.766
# 0.75 0.744
# 0.80 0.735
# 0.85 0.726
# 0.90 0.726
# 0.95 0.726
# 1.00 0.726

We see that 0.5 is indeed the best threshold

In [None]:
# To get a nice representation, we can plot this.
# x-axis thresholds, y-axis score

plt.plot(thresholds,scores)

We used our own function to calculate the accuracy but we can use the already existing one in scikit learn package.

In [None]:
from sklearn.metrics import accuracy_score

thresholds = np.linspace(0, 1, 21)
scores = []

for t in thresholds:
    score = accuracy_score(y_val, y_pred >= t)
    print('%.2f %.3f' % (t, score))
    scores.append(score)

# Output: 
# 0.00 0.274
# 0.05 0.509
# 0.10 0.591
# 0.15 0.666
# 0.20 0.710
# 0.25 0.739
# 0.30 0.760
# 0.35 0.772
# 0.40 0.785
# 0.45 0.793
# 0.50 0.803
# 0.55 0.801
# 0.60 0.795
# 0.65 0.786
# 0.70 0.766
# 0.75 0.744
# 0.80 0.735
# 0.85 0.726
# 0.90 0.726
# 0.95 0.726
# 1.00 0.726

There are two interesting values here - the first and the last one.
We know that the accuracy of our model is 80% and the accuracy for a Dummy model with threshold 1 equals 73% (this model predicts all customers as not churning).
Why are we doing all this only to get an increase of 7%? That is the main issue with accuracy. Accuracy doesn't tell us how good the model is for this particular case.

In [None]:
from collections import Counter

Counter(y_pred >= 1.0)
# Output: Counter({False: 1409})

In [None]:
# Distribution of y_val
Counter(y_val)

# Output: Counter({0: 1023, 1: 386})

1023 / 1409
# Output: 0.7260468417317246

y_val.mean()
# Output: 0.2739531582682754

1 - y_val.mean()
# Output: 0.7260468417317246

We see that are a lot more non-churning customers than churning ones. We see that only 27% are churning customer and 73% are non-churning customers.
That means the dummy model is only incorrect for every churning customer and is correct for every non-churning one. We see we have a problem that is called class imbalance. That means we have a class with lot more customers in one group than in the other one.
What that means here is, that accuracy is not a good metric when dealing with problems with class imbalanced.

# 4.3 Confusion table
- Different types of errors and correct decisions
- Arranging them in a table

Here we'll talk about confusion table (also called confusion matrix). That is a way of looking at different errors and correct decisions that our binary classification model makes.
From the last lesson we know, that we need a different way of evaluating the quality of our model that is not affected by the class imbalance.

Let's look at all 4 cases that can happen.

                                                                    g(xi)
                                                        <t                               >=t
                                                    NEGATIVE                           POSITIVE
                                                    NO CHURN                            CHURN
                                    1. C didn't churn   2. C churned        3. C didn't churn   4. C churned
                                    1. correct          2. incorrect        3. incorrect        4. correct
                                    TRUE NEGATIVE       FALSE NEGATIVE      FALSE POSITIVE      TRUE POSITIVE
                                       TN                   FN                  FP                  TP
                                    g(xi) < t & y = 0   g(xi) < t & y = 1   g(xi) >= t & y = 0  g(xi) >= t & y = 1


In [None]:
# people who are going to churn
actual_positive = (y_val == 1)
# people who are not going to churn
actual_negative = (y_val == 0)

In [None]:
t = 0.5
predict_positive = (y_pred >= t)
predict_negative = (y_pred < t)

In [None]:
# We look at the cases where both values predict_positive & actual_positive are true
# This is what the "&" operator is doing --> this is logical AND
predict_positive & actual_positive
# Output: array([False, False, False, ..., False,  True,  True])

tp = (predict_positive & actual_positive).sum()
tp
# Output: 210
tn = (predict_negative & actual_negative).sum()
tn
# Output: 922

fp = (predict_positive & actual_negative).sum()
fp
# Output: 101
fn = (predict_negative & actual_positive).sum()
fn
# Output: 176

That was preparation for understanding the confusion table. The confusion table is a way to put all these values (tp, tn, fp, fn) into a single table. This table is a table with 4 cells (2 by 2 table). 
- In the columns of this table we have the predictions (NEGATIVE g(xi) < t)  and POSITIVE g(xi)>=t)
- In the rows we have the actual values (NEGATIVE y=0 and POSITIVE y=1)

Now we want to implement this confusion matrix in NumPy

In [None]:
confusion_matrix = np.array([
    [tn, fp],
    [fn, tp]
])

confusion_matrix
# Output:
# array([[922, 101],
#        [176, 210]])

We see that we have more false negatives than false positives. False positives are customers who get the email even though they are not going to churn, so we actually loose some money by giving them the discount. False negtives are customers who don't get the email, so that they leave. Again we loose some money here. Both situations we want to avoid.
Instead of absolute numbers we can also get relative numbers.

In [None]:
(confusion_matrix / confusion_matrix.sum()).round(2)

# 4.4 Precision and Recall

Precision and Recall are metrics for evaluating binary classification models.
Precision tells us the fraction of positive predictions turned out to be correct. So it means that we predict some customers as churning and then out of those how many are identified correctly.
Precision = true positives / #POSITIVE PREDICTION = true positives /(true positive + false positive)
Recall tells us the fraction of correctly identified positive examples. Here we're looking at all customers that are churning and some of them we identify correctly.
Recall = true positives / #POSITIVE OBSERVATION = true positives /(true positive + false negative)

In [None]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
accuracy
# Output: 0.8034066713981547

precision = tp / (tp + fp)
precision
# Output: 0.6752411575562701

# --> promotional email goes to 311 people, but 210 are actually going to churn (--> 33% are mistakes)
tp + fp
# Output: 210

recall = tp / (tp + fn)
recall
# Output: 0.5440414507772021

# --> For 46% of people who are churning we failed to identify them
tp + fn
# Output: 386

When we look at the accuracy we might think the model performs quite good, but when we see the values of precision and recall, we see that our model is not that good for the purpose we want to use. We want to identify churning customers. But for this purpose accuracy is not the best metric and can be misleading. Especially in cases when we have class imbalance like for this churn prediction one here that's why it's useful to look at metrics like precision and recall.

# 4.5 ROC Curve (Receiver Operating Characteristics)
ROC Curves are a way of describing the performance of a binary classification model.

Here we are interested in two numbers, that are computed from the values of the confusion matrix:
- FPR (False Positive Rate) = Fraction of false positives among all positive examples = FP / (TN + FP)
    --> we want this as small as possible --> MINIMIZE FPR
- TPR (True Positive Rate) = Fraction of true positives among all negative examples = TP / (FN + TP)
    --> we want this as high as possible --> MAXIMIZE TPR

In [None]:
tpr = tp / (tp + fn)
tpr
# Output: 0.5440414507772021
recall
# Output: 0.5440414507772021
# --> tpr = recall

fpr = fp / (fp + tn)
fpr
# Output: 0.09872922776148582

The ROC curve is good for looking at these two rate for all the possible thresholds.

In [None]:
scores = []
thresholds = np.linspace(0, 1, 101)

for t in thresholds:
    actual_positive = (y_val == 1)
    actual_negative = (y_val == 0)

    predict_positive = (y_pred >= t)
    predict_negative = (y_pred < t)

    tp = (predict_positive & actual_positive).sum()
    tn = (predict_negative & actual_negative).sum()

    fp = (predict_positive & actual_negative).sum()
    fn = (predict_negative & actual_positive).sum()

    scores.append((t, tp, tn, fp, fn))

scores

# Output: 
# [(0.0, 386, 0, 1023, 0),
# (0.01, 385, 110, 913, 1),
# (0.02, 384, 193, 830, 2),
# (0.03, 383, 257, 766, 3),
# (0.04, 381, 308, 715, 5),
# (0.05, 379, 338, 685, 7),
# (0.06, 377, 362, 661, 9),
# (0.07, 372, 382, 641, 14),
# (0.08, 371, 410, 613, 15),
# (0.09, 369, 443, 580, 17),
# (0.1, 366, 467, 556, 20),
# (0.11, 365, 495, 528, 21),
# (0.12, 365, 514, 509, 21),
# (0.13, 360, 546, 477, 26),
# (0.14, 355, 570, 453, 31),
# (0.15, 351, 588, 435, 35),
# (0.16, 347, 604, 419, 39),
# (0.17, 346, 622, 401, 40),
# (0.18, 344, 639, 384, 42),
# (0.19, 338, 654, 369, 48),
# (0.2, 333, 667, 356, 53),
# (0.21, 330, 682, 341, 56),
# (0.22, 323, 701, 322, 63),
# (0.23, 320, 710, 313, 66),
# (0.24, 316, 719, 304, 70),
# ...
# (0.96, 0, 1023, 0, 386),
# (0.97, 0, 1023, 0, 386),
# (0.98, 0, 1023, 0, 386),
# (0.99, 0, 1023, 0, 386),
# (1.0, 0, 1023, 0, 386)]

We end up with 101 confusion matrices evaluated for different thresholds. Let's turn that into a dataframe.

In [None]:
columns = ['threshold', 'tp', 'tn', 'fp', 'fn']
df_scores = pd.DataFrame(scores, columns=columns)
df_scores

In [None]:
# We can look at each tenth record by using this column 10 operator
# This works by printing every record from very first record to the last record move with increments of 10.
df_scores[::10]

In [None]:
df_scores['tpr'] = df_scores.tp / (df_scores.tp + df_scores.fn)
df_scores['fpr'] = df_scores.fp / (df_scores.fp + df_scores.tn)
df_scores[::10]

In [None]:
plt.plot(df_scores.threshold, df_scores['tpr'], label='TPR')
plt.plot(df_scores.threshold, df_scores['fpr'], label='FPR')
plt.legend()

## Random model

In [None]:
np.random.seed(1)
y_rand = np.random.uniform(0, 1, size=len(y_val))
y_rand.round(3)
# Output: array([0.417, 0.72 , 0.   , ..., 0.774, 0.334, 0.089])

In [None]:
# Accuracy for our random model is around 50%
((y_rand >= 0.5) == y_val).mean()

Let's put the previously used code into a function.

In [None]:
def tpr_fpr_dataframe(y_val, y_pred):
    scores = []
    thresholds = np.linspace(0, 1, 101)

    for t in thresholds:
        actual_positive = (y_val == 1)
        actual_negative = (y_val == 0)

        predict_positive = (y_pred >= t)
        predict_negative = (y_pred < t)

        tp = (predict_positive & actual_positive).sum()
        tn = (predict_negative & actual_negative).sum()

        fp = (predict_positive & actual_negative).sum()
        fn = (predict_negative & actual_positive).sum()

        scores.append((t, tp, tn, fp, fn))

    columns = ['threshold', 'tp', 'tn', 'fp', 'fn']
    df_scores = pd.DataFrame(scores, columns=columns)

    df_scores['tpr'] = df_scores.tp / (df_scores.tp + df_scores.fn)
    df_scores['fpr'] = df_scores.fp / (df_scores.fp + df_scores.tn)

    return df_scores

In [None]:
df_rand = tpr_fpr_dataframe(y_val, y_rand)
df_rand[::10]

In [None]:
plt.plot(df_rand.threshold, df_rand['tpr'], label='TPR')
plt.plot(df_rand.threshold, df_rand['fpr'], label='FPR')
plt.legend()

Let's look at one threshold example. The x-axis are our thresholds. So we take t = 0.6. For this value we have a TPR and a FPR of 0.4.
The reason for that is we're almost throwing a coin. Our model predicts in 60% of the cases that this customer is non-churning, and in 40% of the cases that the customer is churning. In other words the rates mean that with a probability of 40% for a customer it predicts that customer is churning. And with probability of 60% it predicts that this customer is non-churning. That means that in 40% of the cases this model is incorrect for non-churning customers.

## Ideal model
Now we want to talk about the ideal model that outputs the correct prediction for everyone. Let's implement that. First we need to know the number of negative examples (number of people who are not churning).

In [None]:
num_neg = (y_val == 0).sum()
num_pos = (y_val == 1).sum()
num_neg, num_pos

# Output: (1023, 386)

In [None]:
# We create y_ideal (as validation set) that first contains only negative observations and then it contains only positive observations.
# For that we use the np.repeat() function, so we want to create an array that first contains zeros and then ones.
# In our case here it should contain 1023 zeros and then 386 ones.
y_ideal = np.repeat([0, 1], [num_neg, num_pos])
y_ideal

# Output: array([0, 0, 0, ..., 1, 1, 1])

In [None]:
# Now we need to create our predictions that are just numbers between 0 and 1.
y_ideal_pred = np.linspace(0, 1, len(y_ideal))
y_ideal_pred

# Output: 
# array([0.00000000e+00, 7.10227273e-04, 1.42045455e-03, ...,
#       9.98579545e-01, 9.99289773e-01, 1.00000000e+00])

In [None]:
1 - y_val.mean()
# Output: 0.7260468417317246

In [None]:
accuracy_ideal = ((y_ideal_pred >= 0.726) == y_ideal).mean()
accuracy_ideal

# Output: 1.0

This model doesn't exist in reality usually, but this helps us to benchmark our model that we have.

In [None]:
df_ideal = tpr_fpr_dataframe(y_ideal, y_ideal_pred)
df_ideal[::10]

In [None]:
plt.plot(df_ideal.threshold, df_ideal['tpr'], label='TPR')
plt.plot(df_ideal.threshold, df_ideal['fpr'], label='FPR')
plt.legend()

What we see here is that TPR almost always stays around 1 and starts to go down at the threshold (of 0.726). So this model (until this threshold) can identify the churning customers correctly. For people who are not churning when it says the customer is churning, the model is not always correct. But this detection becomes always true after the threshold (of 0.726).
Take another example of threshold 0.4. The FPR is around 45% and the model makes some mistakes. So for around 32% (0.726-0.4) of people that are predicted as non-churning, but they're simply below that threshold, we predict them as churning even though they are not.

## Putting everything together

Now let's try to plot all the models together so we can hold the benchmarks together.

In [None]:
plt.plot(df_scores.threshold, df_scores['tpr'], label='TPR')
plt.plot(df_scores.threshold, df_scores['fpr'], label='FPR')

#plt.plot(df_rand.threshold, df_rand['tpr'], label='TPR')
#plt.plot(df_rand.threshold, df_rand['fpr'], label='FPR')

plt.plot(df_ideal.threshold, df_ideal['tpr'], label='TPR', color = 'black')
plt.plot(df_ideal.threshold, df_ideal['fpr'], label='FPR', color = 'black')

plt.legend()

We see that our TPR is far apart from the ideal model. We want this as close as possible to 1.
We also see that our FPR is far apart from the ideal model.
Plotting against the threshold is not always intuitive because for example for our model the best threshold is 0.5 as we know (at least in terms of accuracy). But for the ideal model as we saw the best threshold is 0.726. So they have different thresholds.
What we can do however is to plot FPR against TPR.
On the x-axis we'll have FPR and on the y-axis we'll have TPR.
To make it easier to understand, we can also add the benchmarks.

In [None]:
plt.figure(figsize=(5,5))

plt.plot(df_scores.fpr, df_scores.tpr, label='model')
plt.plot([0,1], [0,1], label='random')
#plt.plot(df_rand.fpr, df_rand.tpr, label='random')
#plt.plot(df_ideal.fpr, df_ideal.tpr, label='ideal')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.legend()

In the curve of ideal model there is one important point. The so called north star (=ideal spot) in the upper left corner where TPR is 100% and FPR is 0%. This point we want to reach with our model. This is how a ROC curve looks like. We plot here the TPR against the FPR and we usually add this random baseline. We want our model curve to be as close as possible to this ideal spot - that means in the same time as far as possible from this random baseline. We can say, if our model is close to this random baseline model then it's not a good model.

In [None]:
# We can also use the ROC functionality of scikit learn package
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_val, y_pred)

plt.figure(figsize=(5,5))

plt.plot(fpr, tpr, label='Model')
plt.plot([0,1], [0,1], label='Random', linestyle='--')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.legend()

### What kind of information do we get from ROC curve
Let's start in the lower left corner where both TPR and FPR are 0. This happens at bigger thresholds like 1.0. So this is for every customer we predict that they are non-churning, so our TPR is 0, because we don't predict anyone as churning. FPR is 0 as well, because there are no FP we only have TN.
When in the lower left corner the threshold starts with 1.0 we finish in the upper right corner with threshold of 0.0. This is where our model has 100% TPR, because we predict everyone as churning - so we're able to identify all churning customers, but we also make a lot of mistakes. We incorrectly identify non-churning ones. That's why we have here TPR = FPR = 100%
When we moving the threshold, we're predicting more customers as churning. That means our TPR increases, but FPR also increases in the same time.
Using the ROC curve we can see how the model behaves at different thresholds. Each point on the ROC curve is TPR and FPR evaluated at a particular threshold. After plotting we see how far it is from the ideal spot and how far it is from the random baseline. But it can also be used to compare different models, because it's easy to see which one is the better one (closer to ideal spot is better, closer to random baseline is worse).

There is a very interesting metric that is derived from ROC - that is AUC which means area under the curve. 

# 4.6 ROC AUC
- Area under the ROC curve - useful metric
- Interpretation of AUC 

One way of quantifying how close we are to the ideal point is measuring the area under the ROC curve.
AUC = 0.5 for random baseline. AUC = 1.0 for ideal curve. That means our model has to have an AUC somewhere between 0.5 and 1.0.
When AUC < 0.5 we've made a mistake. AUC = 0.8 is good, 0.9 is great but 0.6 is poor.
We can calculate AUC with the help of an scikit learn package.

In [None]:
# auc is not specifically for roc curves, this is for any curve.
# It can calculate area under any curve.

from sklearn.metrics import auc
# auc needs values for x-axis and y-axis
auc(fpr, tpr)
# Output: 0.843850505725819

In [None]:
auc(df_scores.fpr, df_scores.tpr)
# Output: 0.8438732975754537

In [None]:
auc(df_ideal.fpr, df_ideal.tpr)
# Output: 0.9999430203759136

In [None]:
fpr, tpr, thresholds = roc_curve(y_val, y_pred)
auc(fpr, tpr)

# Output: 0.843850505725819

In [None]:
# There is a shortcut in scikit learn package
from sklearn.metrics import roc_auc_score

roc_auc_score(y_val, y_pred)
# Output: 0.843850505725819

## AUC Interpretation

AUC tells us what is the probability that a randomly selected positive example has a score that is higher than a randomly selected negative example.

In [None]:
neg = y_pred[y_val == 0]
pos = y_pred[y_val == 1]

In [None]:
import random
pos_ind = random.randint(0, len(pos) -1)
neg_ind = random.randint(0, len(neg) -1)

In [None]:
# We want to compare the score for this positive example with the score of the negative example
pos[pos_ind] > neg[neg_ind]
# Output: True

# So in this case this is true.

In [None]:
# We can do this 100,000 times and look at the performance
n = 100000
success = 0

for i in range(n):
    pos_ind = random.randint(0, len(pos) -1)
    neg_ind = random.randint(0, len(neg) -1)

    if pos[pos_ind] > neg[neg_ind]:
        success += 1

success / n

# Output: 0.84389
# That result is quite close to roc_auc_score(y_val, y_pred) = 0.843850505725819

Instead of implementing this manually we can use NumPy.

In [None]:
# Be careful np.random.randint(low, high, size, dtype) low is inclusive and high is exclusive
n = 50000

np.random.seed(1)
pos_ind = np.random.randint(0, len(pos), size=n)
neg_ind = np.random.randint(0, len(neg), size=n)
pos[pos_ind] > neg[neg_ind]
# Output: array([False,  True,  True, ...,  True,  True,  True])

(pos[pos_ind] > neg[neg_ind]).mean()
# Output: 0.84646


Because of this interpretation AUC is quite popular as a way of measuring the performance of binary classification models.
It's quite intuitive and we can use it to see how well our model orders positive and negative examples and how well it seperates positive examples from negative examples. 

# 4.7 Cross-Validation
- Evaluating the same model on different subsets of data
- Getting the average prediction and the spread within predictions

In this lesson we'll talk about parameter tuning. Parameter tuning is the process of selecting the best parameter. 
What we usually do is splitting our entire dataset in three parts (train, validation, test). We use the validation dataset to find the best parameter for formula g(xi). So we find the best parameters for training the model.
For now we forget about the test set and go along with our full train dataset (train + validate).
Now we split our data into k part. Let's say k = 3

            FULL TRAIN
            1    2    3

We can take dataset 1 and 2 and train our model based on this two datasets and validate on dataset 3. Then we compute AUC on validation  (3).

             TRAIN   VAL
            1    3    2

Next step is to train another model based on 1 and 3 and validate this model on dataset 2. Again compute the AUC on validation data (2)

             TRAIN   VAL
            2    3    1
            
Next step is to train another model based on 2 and 3 and validate this model on dataset 1. Again compute the AUC on validation data (1)

Then we get three AUC values. We can compute the mean and standard deviation of this values. Standard deviation shows how stable the model is, how much the scores differ across different folds.

K-Fold Cross-Validation is a way of evaluating the same model on different subsets of our dataset.

In [None]:
def train(df_train, y_train):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression()
    model.fit(X_train, y_train)

    return dv, model

In [None]:
dv, model = train(df_train, y_train)

In [None]:
# Here is an error in the video at 5:22 alexey uses: dicts = df_train[categorical + numerical].to_dict(orient='records')
def predict(df, dv, model):
     dicts = df[categorical + numerical].to_dict(orient='records')

     X = dv.fit_transform(dicts)
     y_pred = model.predict_proba(X)[:,1]

     return y_pred

In [None]:
y_pred = predict(df_val, dv, model)
y_pred

# Output: array([0.00899722, 0.20451861, 0.2122173 , ..., 0.13639118, 0.79976555, 0.83740295])

Now we have train and predict function. Let's implement the k-fold cross-validation.

In [113]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  

In [117]:
kfold.split(df_full_train)
# Output: <generator object _BaseKFold.split at 0x2838baf20>

train_idx, val_idx = next(kfold.split(df_full_train))
len(train_idx), len(val_idx)
# Output: (5070, 564)

len(df_full_train)
# Output: 5634

5634

In [118]:
# We can use iloc to select a part of this dataframe
df_train = df_full_train.iloc[train_idx]
df_val = df_full_train.iloc[val_idx]

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
1814,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.70,258.35,0
5946,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.90,3160.55,1
3881,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
2389,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
611,4765-oxppd,female,0,yes,yes,9,yes,no,dsl,yes,...,yes,yes,no,no,month-to-month,no,mailed_check,65.00,663.05,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2763,2250-ivbwa,male,0,yes,yes,64,yes,no,fiber_optic,yes,...,no,no,no,no,month-to-month,no,electronic_check,81.05,5135.35,0
5192,3507-gasnp,male,0,no,yes,60,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.95,1189.90,0
3980,8868-wozgu,male,0,no,no,28,yes,yes,fiber_optic,no,...,yes,no,yes,yes,month-to-month,yes,electronic_check,105.70,2979.50,1
235,1251-krreg,male,0,no,no,2,yes,yes,dsl,no,...,no,no,no,no,month-to-month,yes,mailed_check,54.40,114.10,1


In [123]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  
scores = []

for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [124]:
from sklearn.model_selection import KFold
!pip install tqdm
from tqdm.auto import tqdm

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  
scores = []

for train_idx, val_idx in tqdm(kfold.split(df_full_train)):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

zsh:1: command not found: pip


ModuleNotFoundError: No module named 'tqdm'

In [128]:
scores
# Output: 
# [0.8479398247539081,
# 0.8410581683168317,
# 0.8557214756739697,
# 0.8333552794008724,
# 0.8262717121588089,
# 0.8342657342657342,
# 0.8412569195701727,
# 0.8186669829222013,
# 0.8452349192233585,
# 0.8621054754462034]

print('%.3f +- %.3f' % (np.mean(scores), np.std(scores)))
# Output: 0.841 +- 0.012

0.841 +- 0.012


We talked about parameter tuning. Our model LogisticRegression has a parameter C. That parameter C is equivalent to regularization parameter we talked about.
The default value for that is 1.0. We can add that parameter to our train function. In this case if C is very small, then the regularization is strong.
There is an annoying message, what we can fix by setting the max_iter value to 1000.

In [138]:
def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)

    return dv, model

In [133]:
dv, model = train(df_train, y_train, C=0.001)

In [144]:
# We can iterate over different values for C
# Cannot use 0.0 as C because of
# InvalidParameterError: The 'C' parameter of LogisticRegression must be a float in the range (0.0, inf]. Got 0.0 instead.

from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  

for C in [0.001, 0.01, 0.1, 0.5, 1, 5, 10]:
    
    scores = []

    for train_idx, val_idx in kfold.split(df_full_train):
        df_train = df_full_train.iloc[train_idx]
        df_val = df_full_train.iloc[val_idx]

        y_train = df_train.churn.values
        y_val = df_val.churn.values

        dv, model = train(df_train, y_train, C=C)
        y_pred = predict(df_val, dv, model)

        auc = roc_auc_score(y_val, y_pred)
        scores.append(auc)

    print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

# Output:
# C=0.001 0.826 +- 0.012
# C=0.01 0.840 +- 0.012
# C=0.1 0.841 +- 0.011
# C=0.5 0.841 +- 0.011
# C=1 0.840 +- 0.012
# C=5 0.841 +- 0.012
# C=10 0.841 +- 0.012

C=0.001 0.826 +- 0.012
C=0.01 0.840 +- 0.012
C=0.1 0.841 +- 0.011
C=0.5 0.841 +- 0.011
C=1 0.840 +- 0.012
C=5 0.841 +- 0.012
C=10 0.841 +- 0.012


In [137]:
# same for tqdm
# We can iterate over different values for C

from sklearn.model_selection import KFold

n_splits = 5

for C in tqdm([0.001, 0.01, 0.1, 0.5, 1, 5, 10]):   
    scores = []

    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)  

    for train_idx, val_idx in kfold.split(df_full_train):
        df_train = df_full_train.iloc[train_idx]
        df_val = df_full_train.iloc[val_idx]

        y_train = df_train.churn.values
        y_val = df_val.churn.values

        dv, model = train(df_train, y_train, C=C)
        y_pred = predict(df_val, dv, model)

        auc = roc_auc_score(y_val, y_pred)
        scores.append(auc)

    print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

NameError: name 'tqdm' is not defined

After that we want to train our final model on the full train dataset and validate on test dataset. 

In [145]:
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc
# Output: 0.8572386167896259

0.8572386167896259

What we can see, the AUC is a little bit better than what we had seen before in k-fold cross-validation, but not too much higher.
It shouldn't be a big surprise as long as it is only a little difference. 

When should I use cross-validation and when should I use usual hold out validation? Most of the time usual holdout dataset should work fine, especially when the dataset is quite large. In case you have a smaller dataset or you also want to have standard deviation to understand how stable your model is and how much it varies across the different folds then ou can use cross-validation. For bigger datasets the number of splits could be two or three, for smaller datasets the number of splits could be 10 or sth like this.

# 4.8 Summary
- Metric - a single number that describes the performance of a model
- Accuracy - fraction of correct answers; sometimes misleading
- Confusion table - a way to describe different types of errors and correct decisions and arrange them visually in a table
- Precision and recall are less misleading when we have class imbalance
- ROC Curve - a way to evaluate the performance at all thresholds; okay to use with imbalance
- AUC - in essence the AUC tells us how well our model separates positive and negative classes
- K-Fold CV - more reliable estimate for performance (mean + std)



# 4.9 Explore more
- Check the precision and recall of the dummy classifier that always predict "FALSE"
- F1 score = 2 * P * R / (P + R)
- Evaluate precision and recall at different thresholds, plot P vs. R - this way you'll get the precision/recall curve (similar to ROC curve)
- Area under the PR curve is also a useful metric

Other projects:
- Calculate the metrics for datasets from the previous week  