# PyOD - HistogramBasedOutlierFactor

## Carga de datos

Cargamos librerías y datos:

Las típicas (pandas, matplotlib, numpy)...

Funciones de sklearn de preprocesado y métricas.

Modelos y métricas de PyOD.

In [1]:
import pandas as pd
import numpy as np
from time import time

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import cohen_kappa_score

from pyod.models.hbos import HBOS

random_state = np.random.RandomState(42)

In [2]:
df = pd.read_csv('../../../Datasets/Dataset_2.csv',index_col='fecha')
df.index = pd.to_datetime(df.index)
df.head(2)

Unnamed: 0_level_0,FormacionNIRHumedadPV,FibraticPredNIRHumedadPV,Hum_Pred,Etapa2MWHumedadPV,ExtractorVelocidadPV,FormacionAlturaMantaPV,FormadoraVelocidadPV,FormadoraSiloNivel,SiloFibraNivel,SiloFibraVelocidadPV,...,ScalperReservaIzqPosPV,FormacionNIRPH,FormacionNIRHumedadPV_std,FibraticPredNIRHumedadPV_std,Hum_Pred_std,Etapa2MWHumedadPV_std,Negro,CurvaCola,Congelado,Hum
fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-02-10 10:00:00,0.0,6.465569,10.92092,13.14157,50.55861,120.965,37.6,70.74133,12.59019,26.1,...,66.029085,0.0,0.0,0.022282,0.163959,0.221663,1,0,1,0
2021-02-10 10:01:00,0.0,6.355772,10.604865,12.412745,50.55929,112.285,29.6,72.317965,9.990133,21.4,...,68.50844,0.0,0.0,0.029155,0.060918,0.151328,1,0,1,0


Unimos todas las anomalías en una columna, para posteriormente realizar las métricas tanto para todas las anomalías como para cada una en concreto.

Para ello, sumo las columnas de anomalías, y después las que son mayores que cero las establezco como uno (porque significa que alguna de las columnas si tenía anomalía registrada), las demás como cero.

In [3]:
df['Anomalia'] = df['Negro'] + df['Congelado'] + df['Hum']
df['Anomalia'] = df['Anomalia'].map(lambda x: 1 if x!=0 else 0)

df = df.drop(df[df['Anomalia']==1].index)
df = df.drop(['Negro','Congelado','Hum','Anomalia'], axis=1)

lista_anomalias = ['CurvaCola']

atributos = df.columns.drop(lista_anomalias)
len(atributos)

22

## Escogemos un subconjunto del dataset para entrenamientos más cortos

In [4]:
#df = df.iloc[3000:60000,:]

## Preparación

Separo conjuntos de train, validation y test, y estandarizo:

Separo los atributos en X y las anomalias en Y. De esta manera, al realizar el train_test_split, se mantendrán las proporciones de cada anomalía, con muestreos temporales aleatorios.

Primero separo en train-test (80-20) y después separo el test en test-validation (50-50), para así obtener finalmente train-validation-test (80-10-10).

Una vez separado, entreno el StandardScaler() con el conjunto de entrenamiento, y se lo aplico al conjunto de validación y test.

In [5]:
#Separo los atributos para el entrenamiento de la salida
X = df.loc[:, atributos]
Y = df.loc[:, lista_anomalias]

#Calculo la proporcion de outliers presentes
proporcion_outliers = round(np.count_nonzero(Y) / len(Y),3)

#Separo entrenamiento y test (80-20)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = random_state)

#Normalizo
standarizer = StandardScaler()
standarizer.fit(X_train)
X_train_standarized = standarizer.transform(X_train)
X_test_standarized = standarizer.transform(X_test)

In [6]:
X

Unnamed: 0_level_0,FormacionNIRHumedadPV,FibraticPredNIRHumedadPV,Hum_Pred,Etapa2MWHumedadPV,ExtractorVelocidadPV,FormacionAlturaMantaPV,FormadoraVelocidadPV,FormadoraSiloNivel,SiloFibraNivel,SiloFibraVelocidadPV,...,SierrasAnchoPV,ScalperPosPV,ScalperReservaMediaPV,ScalperReservaDerPosPV,ScalperReservaIzqPosPV,FormacionNIRPH,FormacionNIRHumedadPV_std,FibraticPredNIRHumedadPV_std,Hum_Pred_std,Etapa2MWHumedadPV_std
fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-02-10 16:24:00,7.800000,6.454529,12.314780,15.733730,78.541285,144.23000,52.8,69.99834,39.99929,24.0,...,2492.260,383.49000,68.861110,68.861110,68.861110,5.680000,0.209343,0.091988,0.215115,0.082579
2021-02-10 16:34:00,10.720000,7.315059,7.544955,10.172350,77.805810,144.24000,49.0,79.99815,39.99929,27.6,...,2492.270,368.25500,72.978775,72.978775,72.978775,5.600000,0.040575,0.093870,0.329550,0.836433
2021-02-10 16:45:00,11.900000,9.257516,10.707760,8.764314,69.160100,144.24000,46.5,84.99940,39.99929,24.8,...,2492.260,357.02000,73.274790,73.274790,73.274790,5.640000,0.265850,0.148466,0.639368,0.075495
2021-02-10 16:47:00,12.550000,8.276335,9.088927,11.422155,72.124070,144.24000,44.6,84.99940,49.99762,23.8,...,2492.260,358.33000,71.411195,71.411195,71.411195,5.600000,0.098494,0.114497,0.217076,0.605878
2021-02-10 16:50:00,11.290000,8.855763,11.416160,14.635690,71.233220,144.24000,46.9,84.99940,49.99762,25.1,...,2492.260,377.63000,68.851880,68.851880,68.851880,5.650000,0.037717,0.025316,0.244615,0.628972
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-24 01:08:00,8.010856,8.116809,9.678380,12.812745,70.140915,84.87000,65.4,84.99973,39.99929,35.0,...,2491.430,328.00750,52.196620,52.196620,52.196620,5.545441,0.021566,0.051716,0.048957,0.205136
2021-12-24 01:09:00,8.066128,8.155981,9.782755,13.050980,70.152370,84.87000,65.1,84.99973,39.99929,35.0,...,2491.440,328.77500,53.229580,53.229580,53.229580,5.558593,0.018150,0.033550,0.082396,0.259534
2021-12-24 01:10:00,8.077693,8.190700,9.663385,12.535290,70.086910,84.86499,64.4,84.99973,39.99929,35.0,...,2491.430,329.40990,53.327510,53.327510,53.327510,5.554653,0.011538,0.016949,0.066463,0.282053
2021-12-24 01:11:00,8.034976,8.220307,10.015770,13.541960,70.158910,84.87000,63.9,84.99973,29.60058,35.0,...,2491.430,328.89000,52.571210,52.571210,52.571210,5.549932,0.025811,0.021008,0.117752,0.221532


## Entrenamiento

Para ello, establezco primero la lista de hiperparámetros a entrenar y la de métricas a analizar.

creo dos arrays de ceros, uno para las scores y otro para las labels.

Realizo un bucle, estableciendo un entrenamiento por cada hiperparámetro.

Después, extraigo scores y labels, almacenándolas en un array, y calculo las métricas, haciendo lo mismo. Una vez termina el bucle, estos dos vectores los paso a un dataframe para su visualización.

In [7]:
t0 = time()

#Establecemos parametros
metricas_list = ['roc_auc','accuracy','precision','kappa','sensibilidad','especificidad']
anomalia = 'CurvaCola'

#Entrenamiento
clf = HBOS()
clf.fit(X_train_standarized)

#Prediccion
Y_pred = clf.predict(X_test_standarized)

#Metricas
roc_auc = roc_auc_score(Y_test[anomalia], Y_pred)
accuracy = accuracy_score(Y_test[anomalia],Y_pred)
precision = precision_score(Y_test[anomalia],Y_pred)
kappa = cohen_kappa_score(Y_test[anomalia],Y_pred)
sensibilidad = recall_score(Y_test[anomalia],Y_pred)
especificidad = recall_score(Y_test[anomalia],Y_pred, pos_label=0)

valores = [roc_auc,accuracy,precision,kappa,sensibilidad,especificidad]
metricas = pd.DataFrame(valores)
metricas.index = metricas_list
metricas.columns = [anomalia]

#Tiempo
t1 = time()
duration = round(t1 - t0, ndigits=4)
print('Tiempo: ', duration)

Tiempo:  2.5223


Analizo las métricas:

In [8]:
metricas.T

Unnamed: 0,roc_auc,accuracy,precision,kappa,sensibilidad,especificidad
CurvaCola,0.509429,0.830272,0.102804,0.017534,0.119991,0.898867


In [9]:
Y_test['Y_pred'] = Y_pred
Y_test.to_csv('Resultados/HBOS_completo.csv')