# Local Outlier Factor method trained on barrier, distance, chi-1_fa and S-2_fa variables

In [1]:
import matplotlib.pyplot as plt 
import csv 
import pandas as pd 
from sklearn.ensemble import IsolationForest
from sklearn import preprocessing
from sklearn.neighbors import LocalOutlierFactor
import seaborn as sns

Read CSV data into panda DataFrame "distance_barrier", print first five rows and plot the data in a scatterplot:

In [None]:
distance_barrier = pd.read_csv("data/vaskas_features_properties_smiles_filenames.csv", usecols=[1, 26, 90, 91])
print(distance_barrier.head())
distance_barrier.plot(kind='scatter', x='distance', y='barrier')

In [None]:
distance_barrier.plot(kind='scatter', x='chi-1_fa', y='barrier')


In [None]:
distance_barrier.plot(kind='scatter', x='S-2_fa', y='barrier')

In [None]:
distance_barrier.info()

Define variables for the Local Outlier Factor model and fit it to the data. Set contamination percentage = percentage of outliers. Define anomaly scores (continuous variable) and anomaly score (descrete variable [1, -1]). 

In [None]:
anomaly_inputs = ['distance', 'barrier', 'S-2_fa', 'chi-1_fa']
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.01)
distance_barrier['anomaly'] = lof.fit_predict(distance_barrier[anomaly_inputs])
distance_barrier['anomaly_scores'] = lof.negative_outlier_factor_ * -1

In [None]:
distance_barrier.info()

In [None]:
distance_barrier.loc[:, ['distance', 'barrier', 'chi-1_fa', 'S-2_fa', 'anomaly_scores', 'anomaly'] ]

Define a function with Seaborn to plot outliers and inliers in scatterplots, where the anomaly_score is color coded. 

In [None]:
def outlier_plot(data, outlier_method_name, x_var, y_var, h_var, s_var, xaxis_limits=[0,1], yaxis_limits=[0,1]):
    
    print(f'Outlier Method: {outlier_method_name} (normalized: distance vs barrier vs chi-1_fa vs S-2_fa)')

    method = f'{outlier_method_name}_anomaly'

    print(f"Number of anomalous values {len(data[data['anomaly']==-1])}")
    print(f"Number of non anomalous values {len(data[data['anomaly']==1])}")
    print(f"Total number of values: {len(data)}")

    g = sns.relplot(data=data, x=x_var, y=y_var, col='anomaly', hue=h_var, size=s_var) 
    g.fig.suptitle(f'Outlier method: {outlier_method_name} (distance vs barrier vs chi-1_fa vs S-2_fa)', y=1.10, fontweight='bold')
    g.set(xlim=xaxis_limits, ylim=yaxis_limits)
    axes = g.axes.flatten()
    axes[0].set_title(f"Outliers\n{len(data[data['anomaly']== -1])} points")
    axes[1].set_title(f"Inliers\n{len(data[data['anomaly']== 1])} points")
    return g
    
    

In [None]:
outlier_plot(distance_barrier, "Local Outlier Factor", "distance", "barrier", "anomaly_scores", "S-2_fa", [0.8, 1.1], [0, 30])

In [None]:
outlier_plot(distance_barrier, "Local Outlier Factor", "chi-1_fa", "barrier", "anomaly_scores", "S-2_fa", [200, 1100], [0, 30])

In [None]:
outlier_plot(distance_barrier, "Local Outlier Factor", "S-2_fa", "barrier", "anomaly_scores", "chi-1_fa", [25, 150], [0, 30])

The suitability of the outlier method is clearly sensitive to the underlying data-distribution. The Local Outlier Factor method seems to tackle clustered data distributions better than the Isolation Forest model. The Local Outlier Factor model is based on k nearest neightbor distance metric. However, when fitted to a multidimensional data, some vectors (chi-1_fa and S-2_fa) seem to influence the anomaly score more than others (distance). Here the different vectors are not normalized, so I will check if explicit normalization helps (or if normalization is already included in the model fitting). Pre-print with extensive benchmark of outlier method on different datasets:  https://arxiv.org/abs/2305.00735 