# Ejercicio Clustering. Detección de Anomalías

Uno de los usos de los algoritmos de clustering es la Detección de Anomalías, esto es, la detección de observaciones anómalas, aquellas que no siguen un comportamiento normal. Si el objetivo del clustering es encontrar grupos de elementos similares, aquellos elementos que no son similares a ningún grupo se pueden considerar como elementos anómalos.

Para este ejercicio vamos a usar un [Dataset de transacciones de tarjetas de crédito](https://www.kaggle.com/arjunbhasin2013/ccdata), donde cada observacion es un cliente distinto.

Nuestro objetivo es implementar un modelo que agrupa las transacciones apropiadamente y encontrar los potenciales outliers, es decir, aquellas transacciones que son sospechosas de ser un fraude o un error. Para resolver este ejercicio correctamente hay que investigar, en vez de simplemente seguir a rajatabla lo enseñado en el curso.

**Pistas:**

- Hemos explicado un algoritmo de clustering que no solo asigna elementos a clusters válidos, sino que también clasifica elementos como valores extremos (outliers). 

- Para la búsqueda de hiperparámetros, un buen sitio para mirar es [ParameterSampler](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterSampler.html).

In [1]:
import pandas as pd

df = pd.read_csv("./data/CC GENERAL.csv")

In [2]:
df.head()

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


Cada observación (fila) tiene información agregada sobre un cliente distinto, el balance de su tarjeta de cretito, el número de compras realizado, el número de veces que saca dinero de un cajero, etcétera.

In [3]:
df.dtypes

CUST_ID                              object
BALANCE                             float64
BALANCE_FREQUENCY                   float64
PURCHASES                           float64
ONEOFF_PURCHASES                    float64
INSTALLMENTS_PURCHASES              float64
CASH_ADVANCE                        float64
PURCHASES_FREQUENCY                 float64
ONEOFF_PURCHASES_FREQUENCY          float64
PURCHASES_INSTALLMENTS_FREQUENCY    float64
CASH_ADVANCE_FREQUENCY              float64
CASH_ADVANCE_TRX                      int64
PURCHASES_TRX                         int64
CREDIT_LIMIT                        float64
PAYMENTS                            float64
MINIMUM_PAYMENTS                    float64
PRC_FULL_PAYMENT                    float64
TENURE                                int64
dtype: object

In [4]:
customer_ids = df.CUST_ID
df = df.drop(columns="CUST_ID")

In [5]:
df.head()

Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


# Importamos DBSCAN

In [24]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

In [27]:
df = df.fillna(0)
df_normalizado = pd.DataFrame(StandardScaler().fit_transform(df))

In [28]:
estimador_dbscan = DBSCAN()
etiquetas_dbscan = estimador_dbscan.fit_predict(df_normalizado)
pd.Series(etiquetas_dbscan).value_counts()

-1     6627
 0     1948
 10      60
 2       34
 15      30
 7       23
 14      23
 8       14
 6       13
 3       11
 29      10
 21       9
 5        9
 9        8
 12       8
 1        8
 26       8
 17       7
 11       7
 27       7
 19       7
 4        6
 13       6
 23       6
 28       5
 30       5
 16       5
 24       5
 22       5
 25       5
 20       5
 18       5
 35       5
 31       5
 32       4
 34       4
 33       3
dtype: int64

El valor -1 nos indica las anormalias que hay, en este caso hay muchas por lo que no se está ejecutando el algoritmo de forma eficientes, es decir, con los mejores hiperparámetros. Para ello, calculamos cuales son los hiperparámetros ideales para este problema.

In [29]:
from scipy.stats import randint
from scipy.stats import uniform 

distribucion_parametros = {
    "eps": uniform(0,5),
    "min_samples": randint(2, 20),
    "p": randint(1, 3),
}

Tanto en HDBSCAN como en DBSCAN no podemos usar predict por lo que no sirve hacer uso de la búsqueda aleatoria, para ello creamos nosotros nuestro método de búsqueda gracias a ParameterSampler.

In [33]:
import numpy as np
from sklearn.model_selection import ParameterSampler
from sklearn.metrics import silhouette_score

muestras = 30
iteraciones = 3
porcentaje = 0.7 #El 70% de los datos son para entrenamiento
resultados_busqueda = []
lista_parametros = list(ParameterSampler(distribucion_parametros, n_iter=muestras))


for parametro in lista_parametros:
    for i in range(iteraciones):
        parametros_resultado = []
        muestra = df_normalizado.sample(frac = porcentaje)
        estimador_dbscan = DBSCAN(**parametro) #Al hacer ** "descomprimos" los datos
        etiquetas_dbscan = estimador_dbscan.fit_predict(muestra)
        try:
            parametros_resultado.append(silhouette_score(muestra, etiquetas_dbscan))
        except ValueError: #Salta excepción en silhoutte_score si evalúa solo un núcleo
            pass
    resultados_busqueda.append([np.mean(parametros_resultado), parametro])
            

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [36]:
sorted(resultados_busqueda, key=lambda x: x[0], reverse=True)[:5]

[[nan, {'eps': 0.07830642371426777, 'min_samples': 14, 'p': 1}],
 [0.7545729997826311, {'eps': 4.93016469924362, 'min_samples': 5, 'p': 1}],
 [0.7519183739582335, {'eps': 4.851680761970842, 'min_samples': 5, 'p': 1}],
 [0.7378142265804657, {'eps': 4.8498589836253, 'min_samples': 11, 'p': 2}],
 [0.7333186211004177, {'eps': 4.9247205038090245, 'min_samples': 8, 'p': 1}]]

La mejor configuración de hiperparámetros son:  

[0.7545729997826311, {'eps': 4.93016469924362, 'min_samples': 5, 'p': 1}]


In [37]:
mejores_parametros = {'eps': 4.93016469924362, 'min_samples': 5, 'p': 1}
estimador_dbscan = DBSCAN(**mejores_parametros)
etiquetas_dbscan = estimador_dbscan.fit_predict(df_normalizado)
pd.Series(etiquetas_dbscan).value_counts()

 0    8890
-1      55
 1       5
dtype: int64

Como podemos ver hay 55 anormalías

In [40]:
def resumen_cluster(cluster_id):
    cluster = df[etiquetas_dbscan == cluster_id]
    resumen_cluster = cluster.mean().to_dict()
    resumen_cluster["cluster_id"] = cluster_id
    return resumen_cluster

def comparar_clusters(*cluster_ids):
    resumenes = []
    for cluster_id in cluster_ids:
        resumenes.append(resumen_cluster(cluster_id))
    return pd.DataFrame(resumenes).set_index("cluster_id").T

In [41]:
comparar_clusters(-1, 0, 1)

cluster_id,-1,0,1
BALANCE,6416.236768,1533.399657,3446.747277
BALANCE_FREQUENCY,0.95686,0.87673,0.963636
PURCHASES,15675.169455,912.881443,206.582
ONEOFF_PURCHASES,10427.974727,531.828127,164.762
INSTALLMENTS_PURCHASES,5247.194727,381.355524,41.82
CASH_ADVANCE,5691.050549,945.303656,8827.835377
PURCHASES_FREQUENCY,0.828347,0.488434,0.180303
ONEOFF_PURCHASES_FREQUENCY,0.558544,0.200307,0.109091
PURCHASES_INSTALLMENTS_FREQUENCY,0.701066,0.362509,0.089394
CASH_ADVANCE_FREQUENCY,0.227686,0.134145,0.893939
