# Evaluating Machine Learning Models with Confusion and Weighted Error Matrices

Confusion matrices are often used to evaluate the performance of machine learning models. Model predictions resulting in true positives and negatives indicate how precise the model was. Meanwhile, false positives and negatives reveal the number of mistakes committed by the model.

In most real scenarios, however, types of errors have different costs. Usually, when a model is evaluating if a pacient has a disease, if the model predicts incorrectly, it is better that the prediction ends up being a false positive than a false negative. It is better to have a healthy pacient getting treatment than to have a diseased patient not receiving any help. 

To regulate the cost of each type of error in a confusion matrix, we can combine it with a weighted error matrix. In the error matrix we can attribute a value for each cell that will be the weight for a specific error.

In the next code cell we load up a dataset having weather information for Australia. The objective for models here is to predict if it is going to rain in the next day or not. Note that this is a modified version of the original "weatherAUS" dataset. In this version, there are no information about wind conditions, city names or dates. As the dataset columns carry distinct information (temperature, humidity) with varying intervals for values, we normalize the data before using it to train models.

In [26]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# read file and transform 'No' and 'Yes' values to 0 and 1, so that we can use them in classifiers
with open('../data/raw/weatherAUS_modified.csv') as csv_file:
    csv_file.readline()
    dataset = [list(map(float, x.replace('\n', '').replace('No', '0').replace('Yes', '1').split(','))) for x in csv_file] 

dataset = np.array(dataset)
x = dataset[:,:-1]
y = dataset[:,-1]

# center mean and set unit standard deviation
scaler = StandardScaler(with_mean=True, with_std=True)
x = scaler.fit_transform(x)

Besides from evaluating a model, it would also be good to compare its performance with other models. Fortunately, the weighted error is even more useful for that. In this simple testbed, we are going to train a simple model, Naive Bayes, and a more complex one, Random Forest. We are going to use stratified cross-validation so that we can better evaluate models while mitigating random effects.

After training and predicting with a model, we take the confusion matrix for the prediction result and multiply it by the weighted error matrix passed to the function. In this manner, we can attribute different weights to false positives and false negatives.

In [27]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix

# train and test a given model using stratified cross-validation with 10 folds
def test_model(model, _x, _y, n_splits=10):
    skf_10 = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)
    
    # lists for each split model cost and number of FPs and FNs
    cost_list = []
    fp_list = []
    fn_list = []
    for train_ind, test_ind in skf_10.split(_x, _y):  
        # train and test every model using a cross-validation data split
        x_train, x_test = x[train_ind], x[test_ind]
        y_train, y_test = y[train_ind], y[test_ind]
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)

        # get the confunsion matrix for the model in this split
        cm = confusion_matrix(y_test, y_pred)
        
        # calculate cost by weighing the confusion matrix with the error weight matrix
        cost = (cm[0, 0] * error_cost[0, 0]) + (cm[0, 1] * error_cost[0, 1]) + (cm[1, 0] * error_cost[1, 0]) + (cm[1, 1] * error_cost[1, 1])
        cost_list.append(cost)
        fn_list.append(cm[1, 0])
        fp_list.append(cm[0, 1])
        
    return cost_list, fn_list, fp_list

For this weather prediction scenario, we follow the field standard to weigh errors. It is worse to say that it won't rain when it will because people need to prepare for rain. In the other case, when it does not rain, even if people prepared for rain, they can probably continue doing their things regardless.

Therefore, we are going to attribute a weight of 1 for false positives of raining in the next day, and 5 for false negatives. With this error matrix, we can see that even an overall better model like Random Forest can be surpassed by the simpler Naive Bayes. As we can see, the Random Forest model makes less mistakes in general, but it commits more false negatives mistakes. In this scenario, when false negatives are worse, the weight of these errors make Random Forest have the worst performance. Conclusively, we can see that Naive Bayes has the smaller average cost.

In [28]:
# error weight matrix with cost 1 for false positives, 5 for false negatives and 0 for TP/TN
error_cost = np.array([[0, 1], [5, 0]])

# instantiate and test model with default hyperparameters
gnb = GaussianNB()
nb_list = test_model(gnb, x, y)
print('Naive Bayes\' average cost: %f' % np.mean(nb_list[0]))
print('Number of false negatives with Naive Bayes: %d' % sum(nb_list[1]))
print('Number of false positives with Naive Bayes: %d' % sum(nb_list[2]))

# instantiate and test model with default hyperparameters
rf = RandomForestClassifier(random_state=0)
rf_list = test_model(rf, x, y)
print('Random Forest\'s average cost: %f' % np.mean(rf_list[0]))
print('Number of false negatives with Random Forest: %d' % sum(rf_list[1]))
print('Number of false positives with Random Forest: %d' % sum(rf_list[2]))

Naive Bayes' average cost: 4218.900000
Number of false negatives with Naive Bayes: 6725
Number of false positives with Naive Bayes: 8564
Random Forest's average cost: 4604.200000
Number of false negatives with Random Forest: 8572
Number of false positives with Random Forest: 3182
