<br>

## __Exercício: Detecção de Anomalias__

<br>

__1:__

Utilizando a classe DetectorAnomalias criada ao longo do módulo, __vamos avaliar um detector de anomalias.__

O dataset utilizado pode ser importado através da função getData. 

Nesse conjunto de dados, possuímos 6 variáveis explicativas, $X_1, .., X_6$ e uma variável com a marcação se a instância é uma anomalia ou não.

Utilizando a __metodolodia__ discutida ao longo do módulo, __teste diferentes modelos (variando o limiar $\epsilon$)__ a fim de encontrar o que __melhor fita os dados.__

Justifique as escolhas do $\epsilon$, bem como quais as métricas de performance abordadas. 

<br>

__2:__ 

Aborde o problema num contexto de aprendizado supervisionado, ou seja, treine modelos de classificação binária com o objetivo de detectar anomalias.

Compare os resultados entre as metodologias.

In [1]:
import pandas as pd 
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

In [2]:
class DetectorAnomalias():
    
    def __init__(self, epsilon):
        self.epsilon = epsilon
        
    def fit(self, X):
        medias = X.mean(axis = 0)
        desvios = X.std(axis = 0)
        gaussianas = [st.norm(loc = m, scale = d) for m, d in zip(medias, desvios)]  
        self.gaussianas = gaussianas
        self.X = X
        
    def prob(self, x):
        p = 1
        for i in range(self.X.shape[1]):
            gaussiana_i = self.gaussianas[i]
            x_i = x[i]
            p *= gaussiana_i.pdf(x_i)
        return p
    
    def isAnomaly(self, x):
        return int(np.where(self.prob(x) < self.epsilon, 1, 0))

In [3]:
def getData():
    return pd.read_csv("dataframe_anomalias_exercicio.csv")

In [4]:
df = getData()
df

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,7.731153,23.299155,-0.367453,4.715372,9.306179,16.780965,0.0
1,11.466833,16.943695,-0.245131,7.060311,10.462826,19.821289,0.0
2,11.501272,20.196011,1.206049,-4.957189,7.771262,19.100079,0.0
3,10.893921,16.072385,2.738045,-3.684228,7.373334,23.225524,0.0
4,10.091706,19.253894,0.996895,-9.504052,8.883988,17.903298,0.0
...,...,...,...,...,...,...,...
10095,11.192286,18.451987,-0.953650,-14.362996,10.875826,17.056541,0.0
10096,12.014177,19.461815,1.985099,-7.119190,11.079922,17.582755,0.0
10097,10.745460,18.175951,0.206037,-1.897015,9.888329,17.963324,0.0
10098,9.893969,22.333270,-1.465981,4.137382,7.690620,21.570097,0.0


In [5]:
df.anomalia.value_counts()

0.0    10046
1.0       54
Name: anomalia, dtype: int64

In [6]:
grupo = df.groupby(df.anomalia)

In [7]:
df0 = grupo.get_group(0)
df1 = grupo.get_group(1)

In [8]:
df0 = df0.reset_index(drop = True)
df1 = df1.reset_index(drop = True)

In [9]:
from random import sample

In [10]:
li0 = list(range(1,len(df0)))
li1 = list(range(1,len(df1)))

In [11]:
index_tr0 = sample(li0, 6000)

In [12]:
df_treino = df0.loc[index_tr0].reset_index(drop = True)

In [13]:
df_treino

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,8.021768,19.527491,3.761158,4.035694,10.620308,20.227034,0.0
1,9.453442,19.576260,0.123265,3.682848,11.700878,20.253846,0.0
2,12.429006,17.299178,0.002719,0.536989,10.190113,19.336599,0.0
3,11.346914,16.428516,-0.672503,-0.860574,11.165897,18.886485,0.0
4,9.003950,17.269056,0.158642,1.759902,10.054462,18.596760,0.0
...,...,...,...,...,...,...,...
5995,12.333911,17.885234,-0.304686,5.144954,9.548418,17.764709,0.0
5996,11.070579,18.197645,0.124504,-1.174700,9.521245,22.199427,0.0
5997,8.351500,19.005128,3.009145,9.765791,11.268913,21.645014,0.0
5998,9.385479,19.603972,-1.063330,3.497499,10.884038,20.503741,0.0


In [14]:
index = list(set(li0).difference(set(index_tr0)))

dff = df0.loc[index].reset_index(drop = True)

In [15]:
li00 = list(range(1,len(dff)))

index_val0 = sample(li00, 2000)

df_val0 = dff.loc[index_val0].reset_index(drop = True)

In [16]:
index_te0 = list(set(li00).difference(set(index_val0)))

df_te0 = dff.loc[index_te0].reset_index(drop = True)

In [17]:
df_val0

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,9.862924,19.250776,-2.046898,6.073527,12.619636,19.703083,0.0
1,11.180302,20.048893,2.440144,-0.878462,10.831853,20.013264,0.0
2,10.273715,17.920410,-1.271954,8.577676,10.796878,22.329412,0.0
3,11.467493,20.203737,-0.112250,-6.004046,8.660776,21.156406,0.0
4,9.747215,17.978310,-2.445553,-3.524296,10.394536,20.760453,0.0
...,...,...,...,...,...,...,...
1995,8.740407,18.643051,0.838998,0.988866,12.429012,20.941799,0.0
1996,9.331929,17.993790,-3.425505,3.136812,10.591264,18.545947,0.0
1997,12.274159,18.280107,0.334560,1.209965,8.692174,21.981409,0.0
1998,11.861998,20.923822,-2.241031,6.294579,11.317817,19.935223,0.0


In [18]:
df_te0

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,8.862190,19.875165,2.070148,-1.932622,10.467065,19.543920,0.0
1,9.567872,20.397633,-1.682410,-1.147919,12.141437,19.832751,0.0
2,7.956286,18.894773,-3.298700,0.355945,8.539450,22.248524,0.0
3,10.771028,20.677702,-0.419413,6.884561,11.062811,19.247882,0.0
4,11.547111,17.461590,0.493503,-7.141786,11.187736,23.499564,0.0
...,...,...,...,...,...,...,...
2039,10.233547,17.751671,1.337545,4.437117,10.456328,22.042726,0.0
2040,8.883213,17.716090,0.292724,9.684588,7.604798,19.775801,0.0
2041,7.079235,21.928641,-1.197625,1.576940,8.612625,21.630574,0.0
2042,9.407711,20.570520,1.711892,-2.890427,10.592882,16.764366,0.0


In [19]:
index_val1 = sample(li1, 27)
index_te1 = list(set(li1).difference(set(index_val1)))

df_val1 = df1.loc[index_val1].reset_index(drop = True)
df_te1 = df1.loc[index_te1].reset_index(drop = True)

In [20]:
df_val1

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,11.223595,26.039011,2.243705,0.70159,6.607744,22.748339,1.0
1,6.96986,14.830034,-1.827199,-7.902797,11.742536,21.890941,1.0
2,13.962011,20.703721,-0.061895,3.199113,7.59084,25.395016,1.0
3,11.270489,18.837693,-3.809331,-10.864376,7.451519,23.458497,1.0
4,13.773683,16.764427,-2.412264,-12.90271,9.263291,15.939811,1.0
5,11.21971,24.109488,2.243495,16.90197,8.664476,22.160678,1.0
6,6.43336,17.081117,1.502474,-13.403646,12.694445,23.093357,1.0
7,6.829255,17.688066,1.590936,-10.378061,6.010466,22.321078,1.0
8,8.084153,14.697817,3.922621,-3.364086,12.393389,19.779215,1.0
9,10.805454,20.369868,0.117241,-6.435825,15.383893,16.110497,1.0


In [21]:
df_te1

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,9.468373,18.743048,-1.266223,-20.056818,7.917381,20.201706,1.0
1,11.973881,22.486348,-0.066925,-11.879938,3.599944,21.292253,1.0
2,8.261131,24.689891,-2.712199,8.932249,7.601956,17.465838,1.0
3,8.095411,23.513102,1.064261,-5.193106,13.857783,16.186253,1.0
4,9.184504,19.685476,3.338574,-19.833892,12.73458,19.271336,1.0
5,8.981139,17.423745,2.613908,11.433214,13.209214,24.505147,1.0
6,7.937358,15.67669,-1.105316,-8.051221,12.786681,24.214592,1.0
7,11.904203,19.13528,-6.294481,3.052595,10.735405,18.864197,1.0
8,10.368988,18.696348,-4.438455,15.599589,10.17047,19.646169,1.0
9,10.611336,25.116263,0.595267,10.820618,14.600131,21.635058,1.0


In [22]:
df_validacao = pd.concat([df_val0, df_val1]).reset_index(drop = True)
df_teste = pd.concat([df_te0, df_te1]).reset_index(drop = True)

In [23]:
df_validacao

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,9.862924,19.250776,-2.046898,6.073527,12.619636,19.703083,0.0
1,11.180302,20.048893,2.440144,-0.878462,10.831853,20.013264,0.0
2,10.273715,17.920410,-1.271954,8.577676,10.796878,22.329412,0.0
3,11.467493,20.203737,-0.112250,-6.004046,8.660776,21.156406,0.0
4,9.747215,17.978310,-2.445553,-3.524296,10.394536,20.760453,0.0
...,...,...,...,...,...,...,...
2022,8.432902,22.478172,1.751484,2.096249,15.467202,22.166471,1.0
2023,10.846806,14.630948,2.396958,-0.727713,7.667084,16.256074,1.0
2024,12.121978,16.749393,-1.833823,1.826815,8.243195,25.631909,1.0
2025,8.130799,14.620713,2.796031,12.409311,10.310444,18.687043,1.0


In [24]:
df_teste

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,8.862190,19.875165,2.070148,-1.932622,10.467065,19.543920,0.0
1,9.567872,20.397633,-1.682410,-1.147919,12.141437,19.832751,0.0
2,7.956286,18.894773,-3.298700,0.355945,8.539450,22.248524,0.0
3,10.771028,20.677702,-0.419413,6.884561,11.062811,19.247882,0.0
4,11.547111,17.461590,0.493503,-7.141786,11.187736,23.499564,0.0
...,...,...,...,...,...,...,...
2065,6.975882,22.759423,-2.553305,-12.657738,11.015048,15.069125,1.0
2066,8.557670,18.662596,-1.026912,7.093296,4.326196,18.718360,1.0
2067,13.258870,20.618047,1.997706,11.950572,8.354106,24.949376,1.0
2068,7.814796,16.092293,2.194788,11.302675,10.816714,23.580067,1.0


## Parte 1

In [25]:
X_treino = df_treino.drop(['anomalia'], axis = 1)
X_val = df_validacao.drop(['anomalia'], axis = 1)
X_teste = df_teste.drop(['anomalia'], axis = 1)

In [75]:
y_treino = df_treino['anomalia']
y_val = df_validacao['anomalia']
y_teste = df_teste['anomalia']

In [48]:
from sklearn.metrics import roc_auc_score

In [49]:
ann = DetectorAnomalias(epsilon = 0.001)
ann.fit(X_treino)

In [50]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [51]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.5

In [52]:
ann = DetectorAnomalias(epsilon = 0.0001)
ann.fit(X_treino)

In [53]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [54]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.5

In [55]:
ann = DetectorAnomalias(epsilon = 0.00001)
ann.fit(X_treino)

In [56]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [57]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.6795

In [58]:
ann = DetectorAnomalias(epsilon = 0.000001)
ann.fit(X_treino)

In [59]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [60]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.9177500000000001

In [61]:
ann = DetectorAnomalias(epsilon = 0.0000001)
ann.fit(X_treino)

In [62]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [63]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.9864999999999999

In [64]:
ann = DetectorAnomalias(epsilon = 0.00000001)
ann.fit(X_treino)

In [65]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [66]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.99975

In [67]:
ann = DetectorAnomalias(epsilon = 0.000000001)
ann.fit(X_treino)

In [68]:
lista_anomalia = []
for x in X_val.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [69]:
roc_auc_score(y_true = y_val, y_score = lista_anomalia)

0.5555555555555556

In [70]:
#aplicando melhor resultado na base de teste
ann = DetectorAnomalias(epsilon = 0.00000001)
ann.fit(X_treino)

In [72]:
lista_anomalia = []
for x in X_teste.to_numpy():
    lista_anomalia.append( ann.isAnomaly(x) )

In [73]:
roc_auc_score(y_true = y_teste, y_score = lista_anomalia)

0.99926614481409

O melhor $\epsilon$ encontrado foi '0.00000001', teve boa performance no df de validação e manteve a performance no df de teste.

## Parte 2

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

In [86]:
index_tr1 = sample(li1, 30)
index_di = list(set(li1).difference(set(index_tr1)))

df_tr1 = df1.loc[index_tr1].reset_index(drop = True)
df_di = df1.loc[index_di].reset_index(drop = True)

df_val1, df_te1 = df_di[:12], df_di[12:]

In [87]:
df_treino = pd.concat([df_treino, df_tr1]).reset_index(drop = True)
df_validacao = pd.concat([df_val0, df_val1]).reset_index(drop = True)
df_teste = pd.concat([df_te0, df_te1]).reset_index(drop = True)

In [88]:
Xtreino, ytreino = df_treino.drop(['anomalia'], axis = 1), df_treino['anomalia']

Xval, yval = df_validacao.drop(['anomalia'], axis = 1), df_validacao['anomalia']

Xteste, yteste = df_teste.drop(['anomalia'], axis = 1), df_teste['anomalia']

In [92]:
lr = LogisticRegression()

lr.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = lr.predict(Xval))

0.5

In [93]:
knn = KNeighborsRegressor(n_neighbors = 20)

knn.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = knn.predict(Xval))

0.6649999999999999

In [94]:
knn = KNeighborsRegressor(n_neighbors = 50)

knn.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = knn.predict(Xval))

0.6630416666666666

In [95]:
knn = KNeighborsRegressor(n_neighbors = 150)

knn.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = knn.predict(Xval))

0.821375

In [96]:
knn = KNeighborsRegressor(n_neighbors = 300)

knn.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = knn.predict(Xval))

0.8529374999999999

In [97]:
dtr = DecisionTreeRegressor(max_depth = 20)

dtr.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = dtr.predict(Xval))

0.7485

In [98]:
dtr = DecisionTreeRegressor(max_depth = 50)

dtr.fit(Xtreino, ytreino)

roc_auc_score(y_true = yval, y_score = dtr.predict(Xval))

0.7063333333333334

Quando comparados os modelos "não supervisionado" e "supervisionados" é fácil perceber que o primeiro modelo se sai bem melhor, pois não depende da variável target, que neste caso não ajuda muito, já que temos pouca representatividade de anomalias.