Cropley and Marrone report accuracy and confusion matrices, but no measures that account for class imbalances. Reverse engineering some of those measures here.

In [44]:
import pandas as pd
import re
#from sklearn.metrics import precision_score, recall_score, f1_score

def calculate_class_metrics(confusion_matrix: pd.DataFrame):
    metrics = {}
    for class_name in confusion_matrix.columns:
        true_positives = confusion_matrix.loc[class_name, class_name]
        false_positives = confusion_matrix[class_name].sum() - true_positives
        false_negatives = confusion_matrix.loc[class_name].sum() - true_positives
        
        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_positives + false_negatives)
        
        f1 = 2 * (precision * recall) / (precision + recall)
        
        metrics[class_name] = {"precision": precision, "recall": recall, "F1": f1}

    return pd.DataFrame(metrics).T

def raw_to_df(raw):
    rows = []
    for line in raw.split('\n'):
        line = re.sub('\s(\d)', r',\1', line.strip())
        rows.append(line.split(','))
    df= pd.DataFrame(rows)
    df.columns = ['actual_class'] + df[0].tolist()
    df = df.set_index('actual_class')
    df = df.astype(int)
    return df


def calculate_metrics(confusion_matrix: pd.DataFrame):
    metrics = {}
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0
    for class_name in confusion_matrix.columns:
        true_positives = confusion_matrix.loc[class_name, class_name]
        false_positives = confusion_matrix[class_name].sum() - true_positives
        false_negatives = confusion_matrix.loc[class_name].sum() - true_positives
        
        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_positives + false_negatives)
        
        f1 = 2 * (precision * recall) / (precision + recall)

        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives
        
        metrics[class_name] = {"precision": precision, "recall": recall, "F1": f1}

    metrics_df = pd.DataFrame(metrics).T

    micro_precision = total_true_positives / (total_true_positives + total_false_positives)
    micro_recall = total_true_positives / (total_true_positives + total_false_negatives)
    micro_f1 = 2 * (micro_precision * micro_recall) / (micro_precision + micro_recall)

    macro_precision = metrics_df["precision"].mean()
    macro_recall = metrics_df["recall"].mean()
    macro_f1 = 2 * (macro_precision * macro_recall) / (macro_precision + macro_recall)

    overall_metrics = pd.DataFrame({
        "micro": {"precision": micro_precision, "recall": micro_recall, "F1": micro_f1},
        "macro": {"precision": macro_precision, "recall": macro_recall, "F1": macro_f1},
    }).T

    return metrics_df, overall_metrics


One interesting, somewhat puzzling choice is that the authors narrowed the bins, dropping data at the cusp of the classes. I think that perhaps they were imagining a use case where the test is scored with easily differentiated boundaries aligned with the banded norms of the test, but this had the effect of throwing out the 'hard' data and the edge of the bands. Thankfully, their Model 5 ran the same exercise without dropping data, making it more sound and easier to evaluate. I'll start there.

0.8615384615384616

In [62]:
model5 = '''A 9 0 0 0 0 0 0
B 1 10 1 0 0 0 0
C 0 0 19 1 0 0 0
D 0 0 3 6 1 0 0
E 0 0 0 0 7 1 0
F 0 0 0 1 0 3 0
G 0 0 0 0 0 0 2'''
df = raw_to_df(model5)
print("Overall test set size", df.values.sum())
print("Accuracy", df.values.diagonal().sum() / df.values.sum())
class_metrics, overall_metrics = calculate_metrics(df)
display(class_metrics)
display(overall_metrics)

Overall test set size 65
Accuracy 0.8615384615384616


Unnamed: 0,precision,recall,F1
A,0.9,1.0,0.947368
B,1.0,0.833333,0.909091
C,0.826087,0.95,0.883721
D,0.75,0.6,0.666667
E,0.875,0.875,0.875
F,0.75,0.75,0.75
G,1.0,1.0,1.0


Unnamed: 0,precision,recall,F1
micro,0.861538,0.861538,0.861538
macro,0.871584,0.858333,0.864908


As noted, the testset size is 65 here, which approximately matches their reporting of 15% of 414.

In [56]:
414*.15

62.099999999999994

Model 3 - this was where they split the continuous variable into 5 bins. About 1/4 of the data was dropped.

In [63]:
model3 = '''Very Low 9 0 0 0 0
Low 0 8 1 0 0
Medium 0 0 14 0 1
High 0 0 0 4 0
Very High 0 0 1 0 10'''
df = raw_to_df(model3)
print("Overall test set size", df.sum().sum())
class_metrics, overall_metrics = calculate_metrics(df)
display(class_metrics)
display(overall_metrics)

Overall test set size 48


Unnamed: 0,precision,recall,F1
Very Low,1.0,1.0,1.0
Low,1.0,0.888889,0.941176
Medium,0.875,0.933333,0.903226
High,1.0,1.0,1.0
Very High,0.909091,0.909091,0.909091


Unnamed: 0,precision,recall,F1
micro,0.9375,0.9375,0.9375
macro,0.956818,0.946263,0.951511


Model 4 - Where the continuous variable is binned into 7 classes. As seen here, about half of the harder-to-classify data (32/65) was removed by the narrowed bins.

In [64]:
model4 = '''A 8 1 0 0 0 0 0
B 0 5 0 0 0 0 0
C 0 0 5 0 0 0 0
D 0 0 0 5 0 0 0
E 0 0 0 0 3 0 0
F 0 0 0 0 1 3 0
G 0 0 0 0 0 0 2'''
df = raw_to_df(model4)
print("Overall test set size", df.sum().sum())
class_metrics, overall_metrics = calculate_metrics(df)
display(class_metrics)
display(overall_metrics)

Overall test set size 33


Unnamed: 0,precision,recall,F1
A,1.0,0.888889,0.941176
B,0.833333,1.0,0.909091
C,1.0,1.0,1.0
D,1.0,1.0,1.0
E,0.75,1.0,0.857143
F,1.0,0.75,0.857143
G,1.0,1.0,1.0


Unnamed: 0,precision,recall,F1
micro,0.939394,0.939394,0.939394
macro,0.940476,0.948413,0.944428
