This notebook aims to fulfil the second milestone in the LiveProject "Anomaly Detection using scikit-learn" (https://www.manning.com/liveproject/using-scikit-learn).

The milestone requires downloading and performing some basic exploratory analysis of the UC Irvine thyroid disease dataset available at: http://odds.cs.stonybrook.edu/thyroid-disease-dataset/ . The problem is to determine whether a patient referred to the clinic is hypothyroid. Therefore three classes are built: normal (not hypothyroid), hyperfunction and subnormal functioning. For outlier detection, 3772 training instances are used, with only 6 real attributes. The hyperfunction class is treated as outlier class and other two classes are inliers, because hyperfunction is a clear minority class.

The proportion of outliers is low: 0.024655 or 2.466%.

# STEP 1: PRE-PROCESSING, TRAIN-TEST SPLIT, DEFINING CLASSIFICATION METRICS

In [1]:
import pandas as pd 
import numpy as np

import sklearn
from sklearn.svm import OneClassSVM
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score

from collections import Counter

In [2]:
df = pd.read_csv('thyroid.csv')

In [3]:
df.head()

Unnamed: 0,V0,V1,V2,V3,V4,V5,y
0,0.774194,0.001132,0.137571,0.275701,0.295775,0.236066,0.0
1,0.247312,0.000472,0.279886,0.329439,0.535211,0.17377,0.0
2,0.494624,0.003585,0.22296,0.233645,0.525822,0.12459,0.0
3,0.677419,0.001698,0.156546,0.175234,0.333333,0.136066,0.0
4,0.236559,0.000472,0.241935,0.320093,0.333333,0.247541,0.0


In [4]:
df.tail()

Unnamed: 0,V0,V1,V2,V3,V4,V5,y
3767,0.817204,0.000113,0.190702,0.287383,0.413146,0.188525,0.0
3768,0.430108,0.002453,0.232448,0.287383,0.446009,0.17541,0.0
3769,0.935484,0.024528,0.160342,0.28271,0.375587,0.2,0.0
3770,0.677419,0.001472,0.190702,0.242991,0.323944,0.195082,0.0
3771,0.483871,0.003566,0.190702,0.212617,0.338028,0.163934,0.0


In [5]:
val = df['y'].value_counts()

In [6]:
print(val)

0.0    3679
1.0      93
Name: y, dtype: int64


In [7]:
print(val[1.0])

93


In [8]:
n_outliers = val[1.0]

In [9]:
frac_outliers = n_outliers/df.shape[0]

## Train-test Split

The next step is to split the sample into train and test sets. A stratified split such as the one below results in only 19 outliers in the test set. Since the training in this particular algorithm is only based on non-outlying observations, this seems like a waste of outlier data. Additionally, 19 observations is a very small set on which to base accuracy calculations: each one of these observations can change the % of correct/false positives by over 5%. Therefore, the analysis will be done for two versions of the test set: for comparison with other outlier-detection algorithms, the stratified set will be used first; then, for more reliable anlysis of the accuracy of outlier detection, a test set containing all 93 outlying observations will be used.

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['V0', 'V1', 'V2', 'V3', 'V4', 'V5']], df['y'], test_size=0.2, random_state=1, stratify=df['y'])

In [16]:
y_test.value_counts()

0.0    736
1.0     19
Name: y, dtype: int64

In [17]:
# Removing outliers from the training sample

# Create new matrices of outliers
X_outliers = X_train[y_train==1]
y_outliers = y_train[y_train==1]

# Redefine X_train and y_train in terms of the inliers only
X_train = X_train[y_train==0]
y_train = y_train[y_train==0]

In [18]:
y_train.value_counts()

0.0    2943
Name: y, dtype: int64

Alternative version: split the inliers randomly but instead of wasting outliers, move them to the training set. This should give us a more accurate measure of prediction accuracy.

In [19]:
# X_train remains unchanged
# X_test and y_test change

# Concatenate the outliers back in
y_test1 = pd.concat([y_test, y_outliers])
X_test1 = pd.concat([X_test, X_outliers])

In [20]:
y_test.head()

1124    0.0
1675    0.0
1567    0.0
2904    0.0
2311    0.0
Name: y, dtype: float64

In [21]:
y_test1.head()

1124    0.0
1675    0.0
1567    0.0
2904    0.0
2311    0.0
Name: y, dtype: float64

In [22]:
y_test.tail()

2494    0.0
969     0.0
2324    0.0
2562    0.0
2314    0.0
Name: y, dtype: float64

In [23]:
y_test1.tail()

1336    1.0
1275    1.0
1633    1.0
1057    1.0
2209    1.0
Name: y, dtype: float64

In [24]:
X_test.head()

Unnamed: 0,V0,V1,V2,V3,V4,V5
1124,0.784946,0.002642,0.190702,0.25,0.361502,0.183607
1675,0.83871,0.009623,0.185009,0.245327,0.399061,0.167213
1567,0.268817,0.003019,0.1926,0.32243,0.399061,0.213115
2904,0.483871,0.004151,0.156546,0.184579,0.375587,0.130574
2311,0.204301,0.003019,0.128083,0.219626,0.375587,0.155361


In [25]:
X_test1.head()

Unnamed: 0,V0,V1,V2,V3,V4,V5
1124,0.784946,0.002642,0.190702,0.25,0.361502,0.183607
1675,0.83871,0.009623,0.185009,0.245327,0.399061,0.167213
1567,0.268817,0.003019,0.1926,0.32243,0.399061,0.213115
2904,0.483871,0.004151,0.156546,0.184579,0.375587,0.130574
2311,0.204301,0.003019,0.128083,0.219626,0.375587,0.155361


In [26]:
X_test1.tail()

Unnamed: 0,V0,V1,V2,V3,V4,V5
1336,0.55914,0.345283,0.128083,0.028037,0.525822,0.014754
1275,0.290323,0.262264,0.080645,0.079439,0.333333,0.062295
1633,0.376344,0.041509,0.014231,0.004673,0.384977,0.003279
1057,0.483871,0.018302,0.042694,0.081776,0.248826,0.080328
2209,0.182796,0.901887,0.086338,0.100467,0.521127,0.052459


In [27]:
# Recode inliers to 1, outliers to -1 (SVM, Elliptic Envelope, iForest, LOF)
# Apparently all outlier detection algorithms in scikit-learn use +1 for inliers and -1 for outliers, see https://amueller.github.io/aml/03-unsupervised-learning/03-outlier-detection.html#elliptic-envelope

y_test[y_test==1]=-1
y_test[y_test==0]=1

y_test1[y_test1==1]=-1
y_test1[y_test1==0]=1

## Classification Metrics

In [28]:
# Model Evaluation
# This function is custom-defined instead of using the one from the template, because the labels in SVM are -1,1

def custom_classification_metrics_function(y_test, test_pred):
    
    # Confusion matrix
    confusion_matrix_test_object = confusion_matrix(y_test, test_pred, labels=[1,-1])
    
    # Initialize a dictionary to store the metrics we are interested in
    metrics_dict = Counter()

    metrics_dict['TN'] = confusion_matrix_test_object[0][0]
    metrics_dict['TP'] = confusion_matrix_test_object[1][1]
    metrics_dict['FN'] = confusion_matrix_test_object[1][0]
    metrics_dict['FP'] = confusion_matrix_test_object[0][1]
    
    # Sensitivity
    metrics_dict['Sensitivity'] = float("{0:.4f}".format(metrics_dict['TP']/(metrics_dict['TP']+metrics_dict['FN'])))
    
    # Specificity
    metrics_dict['Specificity'] = float("{0:.4f}".format(metrics_dict['TN']/(metrics_dict['TN']+metrics_dict['FP'])))
    
    # Precision
    metrics_dict['Precision'] = float("{0:.4f}".format(metrics_dict['TP']/(metrics_dict['TP']+metrics_dict['FP'])))
    
    # Recall
    metrics_dict['Recall'] = float("{0:.4f}".format(metrics_dict['TP']/(metrics_dict['TP']+metrics_dict['FN'])))
    
    # Accuracy
    metrics_dict['Accuracy']  = float("{0:.4f}".format(accuracy_score(y_test, test_pred)))

    # The following are more useful than the accuracy
    metrics_dict['F1'] = float("{0:.4f}".format(f1_score(y_test, test_pred, pos_label=-1)))
    
    return metrics_dict

In [29]:
def model_evaluation(X_train, X_test, y_test, model, description):
    model.fit(X_train)
    y_hat = model.predict(X_test)
    classification_metrics = custom_classification_metrics_function(y_test, y_hat)
    print(classification_metrics)
    metrics = pd.DataFrame(classification_metrics, index=[description])
    return metrics

# MILESTONE 2: ONE-CLASS SVM

### Define the Models

In [30]:
# First specification: RBF Kernel
model0 = OneClassSVM(kernel='rbf', nu=4*frac_outliers) # Why 4*n_outliers though?
model1 = OneClassSVM(kernel='poly', degree=2, nu=4*frac_outliers)
model2 = OneClassSVM(kernel='poly', degree=3, nu=4*frac_outliers)

### Version 1: Stratified Sample

In [32]:
# Accuracy metrics

# Initialise the metrics_all dataframe
metrics_all = model_evaluation(X_train, X_test, y_test, model0, 'RBF, stratified')

metrics = model_evaluation(X_train, X_test, y_test, model1, 'Poly 2, stratified')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test, y_test, model2, 'Poly 3, stratified')
metrics_all = pd.concat([metrics_all, metrics])

Counter({'TN': 662, 'FP': 74, 'TP': 16, 'FN': 3, 'Specificity': 0.8995, 'Accuracy': 0.898, 'Sensitivity': 0.8421, 'Recall': 0.8421, 'F1': 0.2936, 'Precision': 0.1778})
Counter({'TN': 655, 'FP': 81, 'FN': 12, 'TP': 7, 'Specificity': 0.8899, 'Accuracy': 0.8768, 'Sensitivity': 0.3684, 'Recall': 0.3684, 'F1': 0.1308, 'Precision': 0.0795})
Counter({'TN': 657, 'FP': 79, 'FN': 12, 'TP': 7, 'Specificity': 0.8927, 'Accuracy': 0.8795, 'Sensitivity': 0.3684, 'Recall': 0.3684, 'F1': 0.1333, 'Precision': 0.0814})


In [33]:
metrics_all

Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333


### Version 2: All Outliers Used in the Test Set

In [34]:
# Accuracy metrics
# Note that only the test sets are different in this case; the train set is the same

metrics = model_evaluation(X_train, X_test1, y_test1, model0, 'RBF, test outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test1, y_test1, model1, 'Poly 2, test outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test1, y_test1, model2, 'Poly 3, test outliers')
metrics_all = pd.concat([metrics_all, metrics])

Counter({'TN': 662, 'TP': 82, 'FP': 74, 'FN': 11, 'Specificity': 0.8995, 'Accuracy': 0.8975, 'Sensitivity': 0.8817, 'Recall': 0.8817, 'F1': 0.6586, 'Precision': 0.5256})
Counter({'TN': 655, 'FP': 81, 'FN': 57, 'TP': 36, 'Specificity': 0.8899, 'Accuracy': 0.8335, 'Sensitivity': 0.3871, 'Recall': 0.3871, 'F1': 0.3429, 'Precision': 0.3077})
Counter({'TN': 657, 'FP': 79, 'FN': 57, 'TP': 36, 'Specificity': 0.8927, 'Accuracy': 0.8359, 'Sensitivity': 0.3871, 'Recall': 0.3871, 'F1': 0.3462, 'Precision': 0.313})


In [35]:
metrics_all

Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333
"RBF, test outliers",662,82,11,74,0.8817,0.8995,0.5256,0.8817,0.8975,0.6586
"Poly 2, test outliers",655,36,57,81,0.3871,0.8899,0.3077,0.3871,0.8335,0.3429
"Poly 3, test outliers",657,36,57,79,0.3871,0.8927,0.313,0.3871,0.8359,0.3462


Save the metrics for future use

In [36]:
metrics_all.to_csv('metrics.csv', index=False)

### Discussion

This is very interesting: specificity, sensitivity, accuracy and recall do not change much but the F1 score and precision change dramatically as a result of having more outliers in our test set.

This stems from the fact that precision is defined as the ratio of true positives to all observations classed as positive (TP and FP). The number of false positives is almost exactly the same whether or not we use all 93 outliers in the test set, or whether we use a subset of them. However, when all outliers are used in the test set, the number of true postives increases by a factor of 5. Since the F1 score is based on precision, it too is affected.

This explains why we are meant to use stratification for test/train - a test set which reflects the properties of the underlying data in its proportion of positives and negatives will give us the most accurate measure of a model's performance.

However, from the point of view of the actual application, such as thyroid detection, we may wish to use all outlying observations in the test set in order to test the performance of our model against the widest range of outliers and to understand the characteristics of those anomalies which remain undetected. Given that in the case of thyroid illness a false positive outcome is much less costly than a false negative, we could use a metric such as sensitivity instead.

The changes in precision and F1 score resulting from including more outliers in the data highlight the limitations of these metrics when applied to highly imbalanced datasets. A metric such as sensitivity is invariant to the class imbalance (since it only focuses on those observations which are actually positive) and it can give a better measure of meeting the project's objectives - in this case reliably detecting abnormal thyroid function.

# MILESTONE 3: ELLIPTIC ENVELOPE

In [39]:
from sklearn.covariance import EllipticEnvelope

In [40]:
# Defining the models

model3 = EllipticEnvelope(contamination=frac_outliers)
model4 = EllipticEnvelope(contamination=2*frac_outliers)
model5 = EllipticEnvelope(contamination=4*frac_outliers)

In [41]:
# Following the approach in https://machinelearningmastery.com/one-class-classification-algorithms/, the model is fit on the inliers (majority class) only

metrics = model_evaluation(X_train, X_test, y_test, model3, 'EE, c=frac_outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test, y_test, model4, 'EE, c=2*frac_outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test, y_test, model5, 'EE, c=4*frac_outliers')
metrics_all = pd.concat([metrics_all, metrics])

Counter({'TN': 726, 'TP': 18, 'FP': 10, 'FN': 1, 'Specificity': 0.9864, 'Accuracy': 0.9854, 'Sensitivity': 0.9474, 'Recall': 0.9474, 'F1': 0.766, 'Precision': 0.6429})
Counter({'TN': 715, 'FP': 21, 'TP': 18, 'FN': 1, 'Specificity': 0.9715, 'Accuracy': 0.9709, 'Sensitivity': 0.9474, 'Recall': 0.9474, 'F1': 0.6207, 'Precision': 0.4615})
Counter({'TN': 674, 'FP': 62, 'TP': 18, 'FN': 1, 'Sensitivity': 0.9474, 'Recall': 0.9474, 'Accuracy': 0.9166, 'Specificity': 0.9158, 'F1': 0.3636, 'Precision': 0.225})


In [42]:
metrics_all

Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333
"RBF, test outliers",662,82,11,74,0.8817,0.8995,0.5256,0.8817,0.8975,0.6586
"Poly 2, test outliers",655,36,57,81,0.3871,0.8899,0.3077,0.3871,0.8335,0.3429
"Poly 3, test outliers",657,36,57,79,0.3871,0.8927,0.313,0.3871,0.8359,0.3462
"EE, c=frac_outliers",726,18,1,10,0.9474,0.9864,0.6429,0.9474,0.9854,0.766
"EE, c=2*frac_outliers",715,18,1,21,0.9474,0.9715,0.4615,0.9474,0.9709,0.6207
"EE, c=4*frac_outliers",674,18,1,62,0.9474,0.9158,0.225,0.9474,0.9166,0.3636


#### Discussion

The model has very high sensitivity: only 1 out of the 19 outliers in the test set was incorrectly classified. As can be expected, specificity, precision and F1 score all fall as the contamination parameter increases. This is due to the fact that a larger proportion of the overall test set are classified as outliers, resulting in more false positives and a deterioration in these three metrics. At the same time, since all but 1 of the outliers are correctly classified when a contamination parameter of frac_outliers is used, increasing that parameter can only marginally improve the accuracy of detection of positives, while it significantly reduces the accuracy of detection of negatives.

Importantly, the performance of this algorithm relies on accurately specifying the contamination parameter. While this is possible with labelled data, it would not be possible with unlabelled data, leading to a possible deterioration in performance if an ad hoc heuristic such as 1% or 5% is used in the absence of known proportions.

For reference, below is an example of the model's fit when the actual number of outliers is siginificantly higher in the test set than the fraction specified in model fitting. As the example of model3 shows, sensitivity can suffer when the fraction of outliers is underestimated (in this case the fraction of outliers is underestimated by a factor of 5).

In [47]:
metrics = model_evaluation(X_train, X_test1, y_test1, model3, 'EE test outliers, c=frac_outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test1, y_test1, model4, 'EE test outliers, c=2*frac_outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test1, y_test1, model5, 'EE test outliers, c=4*frac_outliers')
metrics_all = pd.concat([metrics_all, metrics])

metrics_all

Counter({'TN': 726, 'TP': 81, 'FN': 12, 'FP': 10, 'Specificity': 0.9864, 'Accuracy': 0.9735, 'Precision': 0.8901, 'F1': 0.8804, 'Sensitivity': 0.871, 'Recall': 0.871})
Counter({'TN': 716, 'TP': 87, 'FP': 20, 'FN': 6, 'Specificity': 0.9728, 'Accuracy': 0.9686, 'Sensitivity': 0.9355, 'Recall': 0.9355, 'F1': 0.87, 'Precision': 0.8131})
Counter({'TN': 674, 'TP': 89, 'FP': 62, 'FN': 4, 'Sensitivity': 0.957, 'Recall': 0.957, 'Accuracy': 0.9204, 'Specificity': 0.9158, 'F1': 0.7295, 'Precision': 0.5894})


Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333
"RBF, test outliers",662,82,11,74,0.8817,0.8995,0.5256,0.8817,0.8975,0.6586
"Poly 2, test outliers",655,36,57,81,0.3871,0.8899,0.3077,0.3871,0.8335,0.3429
"Poly 3, test outliers",657,36,57,79,0.3871,0.8927,0.313,0.3871,0.8359,0.3462
"EE, c=frac_outliers",726,18,1,10,0.9474,0.9864,0.6429,0.9474,0.9854,0.766
"EE, c=2*frac_outliers",715,18,1,21,0.9474,0.9715,0.4615,0.9474,0.9709,0.6207
"EE, c=4*frac_outliers",674,18,1,62,0.9474,0.9158,0.225,0.9474,0.9166,0.3636
"EE test outliers, c=frac_outliers",726,81,12,10,0.871,0.9864,0.8901,0.871,0.9735,0.8804


# MILESTONE 4: ISOLATION FOREST

In [48]:
from sklearn.ensemble import IsolationForest

In [49]:
# Define the models
model6 = IsolationForest(n_estimators = 50)
model7 = IsolationForest(n_estimators = 100)
model8 = IsolationForest(n_estimators = 200)

In [50]:
# The model is fit on the inliers (majority class) only (see liveProject instructions and https://machinelearningmastery.com/one-class-classification-algorithms/)

metrics = model_evaluation(X_train, X_test, y_test, model6, 'iF, n_e=50')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test, y_test, model7, 'iF, n_e=100')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test, y_test, model8, 'iF, n_e=200')
metrics_all = pd.concat([metrics_all, metrics])

Counter({'TN': 667, 'FP': 69, 'TP': 19, 'Sensitivity': 1.0, 'Recall': 1.0, 'Accuracy': 0.9086, 'Specificity': 0.9062, 'F1': 0.3551, 'Precision': 0.2159, 'FN': 0})
Counter({'TN': 676, 'FP': 60, 'TP': 19, 'Sensitivity': 1.0, 'Recall': 1.0, 'Accuracy': 0.9205, 'Specificity': 0.9185, 'F1': 0.3878, 'Precision': 0.2405, 'FN': 0})
Counter({'TN': 674, 'FP': 62, 'TP': 19, 'Sensitivity': 1.0, 'Recall': 1.0, 'Accuracy': 0.9179, 'Specificity': 0.9158, 'F1': 0.38, 'Precision': 0.2346, 'FN': 0})


In [51]:
metrics_all

Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333
"RBF, test outliers",662,82,11,74,0.8817,0.8995,0.5256,0.8817,0.8975,0.6586
"Poly 2, test outliers",655,36,57,81,0.3871,0.8899,0.3077,0.3871,0.8335,0.3429
"Poly 3, test outliers",657,36,57,79,0.3871,0.8927,0.313,0.3871,0.8359,0.3462
"EE, c=frac_outliers",726,18,1,10,0.9474,0.9864,0.6429,0.9474,0.9854,0.766
"EE, c=2*frac_outliers",715,18,1,21,0.9474,0.9715,0.4615,0.9474,0.9709,0.6207
"EE, c=4*frac_outliers",674,18,1,62,0.9474,0.9158,0.225,0.9474,0.9166,0.3636
"EE test outliers, c=frac_outliers",726,81,12,10,0.871,0.9864,0.8901,0.871,0.9735,0.8804


#### Discussion

In this case, the sensitivy of the estimator is 100%: all outliers are correctly classified. However, the number of false positives is very high compared to elliptic envelope with the contamination fraction set to the fraction of outliers. Thus, specificity and precision are lower. The values of precision show that a larger number of trees per forest (captured by a larger n_estimators parameter) does not improve precision significantly: the positive predictive value is still only 20-25%, suggesting that only 1 in 5 to 1 in 4 of those observations which are predicted as positive are actually positives.

Importantly, the Isolation Forest algorithm is the first algorithm in this notebook which has not required specifying a parameter dependent on the fraction of outliers. As a result, the key evaluation metrics such as sensitivity should remain invariant to the number of outliers in the test data. The estimation below investigates this.

In [53]:
metrics = model_evaluation(X_train, X_test1, y_test1, model6, 'iF test outliers, n_e=50')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test1, y_test1, model7, 'iF test outliers, n_e=50')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation(X_train, X_test1, y_test1, model8, 'iF test outliers, n_e=50')
metrics_all = pd.concat([metrics_all, metrics])

metrics_all

Counter({'TN': 680, 'TP': 92, 'FP': 56, 'FN': 1, 'Sensitivity': 0.9892, 'Recall': 0.9892, 'Accuracy': 0.9312, 'Specificity': 0.9239, 'F1': 0.7635, 'Precision': 0.6216})
Counter({'TN': 675, 'TP': 92, 'FP': 61, 'FN': 1, 'Sensitivity': 0.9892, 'Recall': 0.9892, 'Accuracy': 0.9252, 'Specificity': 0.9171, 'F1': 0.748, 'Precision': 0.6013})
Counter({'TN': 679, 'TP': 92, 'FP': 57, 'FN': 1, 'Sensitivity': 0.9892, 'Recall': 0.9892, 'Accuracy': 0.93, 'Specificity': 0.9226, 'F1': 0.7603, 'Precision': 0.6174})


Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333
"RBF, test outliers",662,82,11,74,0.8817,0.8995,0.5256,0.8817,0.8975,0.6586
"Poly 2, test outliers",655,36,57,81,0.3871,0.8899,0.3077,0.3871,0.8335,0.3429
"Poly 3, test outliers",657,36,57,79,0.3871,0.8927,0.313,0.3871,0.8359,0.3462
"EE, c=frac_outliers",726,18,1,10,0.9474,0.9864,0.6429,0.9474,0.9854,0.766
"EE, c=2*frac_outliers",715,18,1,21,0.9474,0.9715,0.4615,0.9474,0.9709,0.6207
"EE, c=4*frac_outliers",674,18,1,62,0.9474,0.9158,0.225,0.9474,0.9166,0.3636
"EE test outliers, c=frac_outliers",726,81,12,10,0.871,0.9864,0.8901,0.871,0.9735,0.8804


As expected, prevision and the F1 score change significantly, due to the change in the class balance. Interestingly, sensitivity remains close to 100%, reflecting strong performance of this model in detecting anomalous thyroid function.

# MILESTONE 5: LOCAL OUTLIER FACTOR

In [64]:
from sklearn.neighbors import LocalOutlierFactor

In [67]:
# Define the models
model9 = LocalOutlierFactor(n_neighbors = 3, novelty=True)
model10 = LocalOutlierFactor(n_neighbors = 10, novelty=True)
model11 = LocalOutlierFactor(n_neighbors = 20, novelty=True)
model12 = LocalOutlierFactor(n_neighbors = 50, novelty=True)

In [69]:
# We need to redefine our model fit function due to an error with feature names

def model_evaluation1(X_train, X_test, y_test, model, description):
    model.fit(X_train.values)
    y_hat = model.predict(X_test)
    classification_metrics = custom_classification_metrics_function(y_test, y_hat)
    print(classification_metrics)
    metrics = pd.DataFrame(classification_metrics, index=[description])
    return metrics

In [70]:
# The model is fit on the inliers (majority class) only, as detailed in the liveProject instructions

metrics = model_evaluation1(X_train, X_test, y_test, model9, 'LOF, n_n=3')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation1(X_train, X_test, y_test, model10, 'LOF, n_n=10')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation1(X_train, X_test, y_test, model11, 'LOF, n_n=20')
metrics_all = pd.concat([metrics_all, metrics])

metrics = model_evaluation1(X_train, X_test, y_test, model12, 'LOF, n_n=50')
metrics_all = pd.concat([metrics_all, metrics])

Counter({'TN': 692, 'FP': 44, 'TP': 11, 'FN': 8, 'Specificity': 0.9402, 'Accuracy': 0.9311, 'Sensitivity': 0.5789, 'Recall': 0.5789, 'F1': 0.2973, 'Precision': 0.2})
Counter({'TN': 700, 'FP': 36, 'TP': 14, 'FN': 5, 'Specificity': 0.9511, 'Accuracy': 0.9457, 'Sensitivity': 0.7368, 'Recall': 0.7368, 'F1': 0.4058, 'Precision': 0.28})
Counter({'TN': 709, 'FP': 27, 'TP': 15, 'FN': 4, 'Specificity': 0.9633, 'Accuracy': 0.9589, 'Sensitivity': 0.7895, 'Recall': 0.7895, 'F1': 0.4918, 'Precision': 0.3571})
Counter({'TN': 699, 'FP': 37, 'TP': 15, 'FN': 4, 'Specificity': 0.9497, 'Accuracy': 0.9457, 'Sensitivity': 0.7895, 'Recall': 0.7895, 'F1': 0.4225, 'Precision': 0.2885})


In [71]:
metrics_all

Unnamed: 0,TN,TP,FN,FP,Sensitivity,Specificity,Precision,Recall,Accuracy,F1
"RBF, stratified",662,16,3,74,0.8421,0.8995,0.1778,0.8421,0.898,0.2936
"Poly 2, stratified",655,7,12,81,0.3684,0.8899,0.0795,0.3684,0.8768,0.1308
"Poly 3, stratified",657,7,12,79,0.3684,0.8927,0.0814,0.3684,0.8795,0.1333
"RBF, test outliers",662,82,11,74,0.8817,0.8995,0.5256,0.8817,0.8975,0.6586
"Poly 2, test outliers",655,36,57,81,0.3871,0.8899,0.3077,0.3871,0.8335,0.3429
"Poly 3, test outliers",657,36,57,79,0.3871,0.8927,0.313,0.3871,0.8359,0.3462
"EE, c=frac_outliers",726,18,1,10,0.9474,0.9864,0.6429,0.9474,0.9854,0.766
"EE, c=2*frac_outliers",715,18,1,21,0.9474,0.9715,0.4615,0.9474,0.9709,0.6207
"EE, c=4*frac_outliers",674,18,1,62,0.9474,0.9158,0.225,0.9474,0.9166,0.3636
"EE test outliers, c=frac_outliers",726,81,12,10,0.871,0.9864,0.8901,0.871,0.9735,0.8804


# Discussion

While accuracy is relatively high, this reflects the low number of false positives in the sample, which is the majority class. The accuracy of detecting outliers (sensitivity) is very low compared to IsolationForest and Elliptic Envelope. Performance is particularly poor when only 3 nearest neighbours are used.

Overall, IsolationForest appears to be the most accurate in detecting outliers (highest sensitivity). At the same time, it maintains specificity of over 90%. The performance of the EllipticEnvelope is very similar: although sensitivity is different by approx. 5%, this reflects the small number of outliers in the test set as the one mis-classified outlier counts as approx. 5% of the outlier sample. At the same time, the EllipticEnvelope achieves a higher specificity. It is instructive to compare the performance of IF and EE when more outliers are used in the test set. In this case the performance of IF remains almost completely unchanged, while the specificity of EE falls for some parameters. In particular, if the contamination parameter in EE severely underestimates the fraction of outliers, the specificity is more significantly affected.
