# Detección de transacciones fraudulentas

<a id='contents' />

## Tabla de contenidos

1. [Carga de librerías, procesamiento de datos](#loading)
2. [Entrenamiento del modelo](#training)
3. [SageMaker Endpoint](#endpoint)
3. [Enviar tráfico al endpoint](#traffic) 
4. [Limpieza de recursos](#clean)

<a id='loading' />

## Carga de librerías, procesamiento de datos
[(back to top)](#contents)

Carga de librerías:

In [38]:
import numpy as np 
import pandas as pd
import awswrangler as wr
from sklearn.model_selection import train_test_split
import boto3
import os
import sagemaker
from sagemaker import get_execution_role, RandomCutForest
from sagemaker.model_monitor import DataCaptureConfig
import seaborn as sns
import matplotlib.pyplot as plt

session = sagemaker.Session()

Importar los datos desde S3 con AWS Wrangler

In [39]:
bucket='<YOUR-BUCKET-NAME>'
prefix='fraud-classifier'

In [40]:
data=wr.s3.read_csv(f"s3://{bucket}/{prefix}/train/base/creditcard.csv")

Investigación de los datos:

In [41]:
print(data.columns)
data[['Time', 'V1', 'V2', 'V27', 'V28', 'Amount', 'Class']].describe()

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


Unnamed: 0,Time,V1,V2,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,31.6122,33.84781,25691.16,1.0


Definir las features del modelo:

In [42]:
feature_columns = data.columns[:-1]
features = data[feature_columns].values.astype('float32')

<a id='training' />

## Entrenamiento del modelo
[(back to top)](#contents)

Vamos a dividir el conjunto de datos en conjuntos de train y test. Asignamos 10% de los datos al conjunto de test.

In [46]:
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42)

#### Modelo de detección de anomalías (Random Cut Forest)

Random Cut Forest es un modelo de aprendizaje no supervisado que intyenta identificar observaciones anómalas basándose en las variables definidas. 

**Hiperparámetros:**

- num_samples_per_tree: El número de observaciones en cada árbol. Normalmente, 1/num_samples_per_tree debería ser un aproximado de la fracción: (aproximado número de anomalías)/(número de datos normales) 

- num_trees - Número de árboles para el random cut forest. Cada árbol es un modelo separado, con muestras de datos diferentes. El modelo completo de RCF usa la media de los anomaly scores predecidos por cada árbol. 

Adicionalmente, debemos especificar parámetros adicionales:

- EC2 instance type: En donde el entrenamiento del modelo va a correr. 
- S3 bucket: Bucket y ruta específica que contiene la data
- Role: El rol de AWS IAM. 

Tipos de instancias recomendadas para RCF: ml.m4, ml.c4, or ml.c5

**Importante:** Para poder hacer tuning de los huperparámetros del modelo en SageMaker, requerimos tener un dataset de test con etiquetas de anomalías. Como no lo tenemos, en este caso debemos definir específicamente los hiperparámetros.

In [7]:
from sagemaker import RandomCutForest

# specify general training job information
rcf = RandomCutForest(role=get_execution_role(),
                      train_instance_count=1,
                      train_instance_type='ml.c4.xlarge',
                      data_location='s3://{}/{}/'.format(bucket, prefix),
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      num_samples_per_tree=512,
                      num_trees=50)

In [8]:
rcf.fit(rcf.record_set(X_train))

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-17 18:52:25 Starting - Starting the training job...
2020-07-17 18:52:35 Starting - Launching requested ML instances......
2020-07-17 18:53:52 Starting - Preparing the instances for training......
2020-07-17 18:54:53 Downloading - Downloading input data
2020-07-17 18:54:53 Training - Downloading the training image.........
2020-07-17 18:56:21 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
  from numpy.testing.nosetester import import_nose[0m
  from numpy.testing.decorators import setastest[0m
[34m[07/17/2020 18:56:25 INFO 140510002829120] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'_ftp_port': 8999, u'num_samples_per_tree': 256, u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'_log_level': u'info', u'_kvstore': u'dist_async', u'force_dense': u'true', u'epochs'

<a id='endpoint' />

## SageMaker Endpoint 
[(back to top)](#contents)

Una vez entrenado el modelo, procedemos a desplegarlo y a realizar predicciones sobre el test.

In [36]:
prefix = 'monitoring'

s3_capture_upload_path = 's3://{}/{}/datacapture'.format(bucket, prefix)
print(s3_capture_upload_path)

rcf_predictor = rcf.deploy(
    endpoint_name='random-cut-forest-endpoint-4',
    initial_instance_count=1,
    instance_type='ml.c4.xlarge',
    data_capture_config=DataCaptureConfig(
                                enable_capture=True,
                                sampling_percentage=100,
                                destination_s3_uri=s3_capture_upload_path))

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


s3://fraud-bucket-processed-events/monitoring/datacapture
-----------------!

### Serialización/Deserialización de la data

Podemos entregar los datos en variedad de formatos al endpoint de inferencia. En este ejemplo se utiliza data en formato .csv. Hay otros formatos disponibles como JSON y RecordIO Protobuf. Utilizamos csv_serializer y json_serializer cuando configuramos el endpoint de inferencia. 

In [49]:
from sagemaker.predictor import csv_serializer, json_deserializer

rcf_predictor.content_type = 'text/csv'
rcf_predictor.serializer = csv_serializer
rcf_predictor.accept = 'application/json'
rcf_predictor.deserializer = json_deserializer

<a id='revision' />

## Revisión del modelo:
[(back to top)](#contents)

Una vez desplegado el modelo en el endpoint, podemos realizar predicciones:

In [50]:
def predict_rcf(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = []
    for array in split_array:
        array_preds = [s['score'] for s in current_predictor.predict(array)['scores']]
        predictions.append(array_preds)

    return np.concatenate([np.array(batch) for batch in predictions])

Inferencia de scores de anomalía

In [52]:
scores=predict_rcf(rcf_predictor, X_test)

In [72]:
results_test=pd.DataFrame(X_test, columns=feature_columns)
results_test['score'] = results

Imprimir los puntos con scores mayores a 3 desviaciones estándar (aproximadamente el percentil 99) de la media de puntajes. 

In [75]:
score_mean = results_test['score'].mean()
score_std = results_test['score'].std()
score_cutoff = score_mean + 3*score_std

anomalies = results_test[results_test['score'] > score_cutoff]
anomalies

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,score
32,141655.0,-29.942972,-25.831781,-16.227512,6.690679,-20.787846,13.085694,17.256622,-9.161746,5.003041,...,-2.494699,-0.660297,-8.537816,0.400804,-0.643023,0.496903,6.267709,-2.765070,3502.110107,1.498158
103,1657.0,1.241253,-0.107798,0.432991,-0.121891,-0.790021,-0.980616,-0.218998,-0.019842,0.196839,...,-0.061766,-0.267594,0.076490,0.568786,0.134627,0.934921,-0.096797,-0.007563,2.310000,1.233508
190,1422.0,-1.805552,1.671304,0.619620,-0.976160,-0.458580,-0.548156,0.134674,0.681750,0.606681,...,-0.257455,-0.561867,-0.026061,-0.027794,-0.031312,0.321154,0.580264,0.345711,1.000000,1.248522
222,1757.0,-0.640106,-0.014125,1.009867,-1.866220,0.000452,-0.746637,0.434067,-0.063813,-1.653938,...,0.381225,0.930371,-0.067886,0.052395,0.018214,-0.325211,0.327599,0.199766,70.849998,1.265753
253,878.0,-2.512192,0.317838,1.270333,1.214258,1.179903,-0.412233,1.002716,-0.774680,1.656602,...,-0.655258,0.242368,-0.135074,0.104261,0.211102,-0.287307,0.335129,0.328212,18.799999,1.335678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28357,171914.0,1.970018,-0.355816,-1.323578,0.299358,-0.038320,-0.627062,0.025980,-0.065628,0.586317,...,-0.212821,-0.617320,0.228779,-0.423555,-0.262033,0.284253,-0.090964,-0.075231,34.990002,1.224628
28388,172393.0,-0.685289,1.520622,-0.630377,-0.325479,-0.022094,0.077774,-1.396767,-5.504335,-0.387496,...,-2.781250,0.226990,-0.121431,-0.110813,0.947480,-0.013367,0.075546,0.288095,72.400002,1.381463
28408,784.0,1.269172,0.444892,0.172031,1.022692,0.015860,-0.732037,0.297024,-0.281962,-0.183246,...,0.018769,0.163191,-0.159069,0.072121,0.777835,-0.310421,0.021357,0.019546,5.070000,1.334098
28410,7248.0,-0.426112,1.317521,1.863215,1.506461,-0.328416,-0.703216,0.419872,-0.391180,1.060302,...,-0.388986,-0.517449,0.070642,0.648413,-0.731423,0.307208,0.166221,0.060570,1.290000,1.221526


<a id='traffic' />

## Enviar tráfico al endpoint:
[(back to top)](#contents)

Se utilizan requests HTTP al API Gateway, con el script `generate_endpoint_traffic.py` . El API Gateway invoca la función Lambda, que a su vez invoca el endpoint del modelo para realizar la inferencia. 

In [7]:
from threading import Thread
from generate_endpoint_traffic import generate_traffic

thread = Thread(target = generate_traffic, args=[np.copy(X_test)])
thread.start()

<a id='clean' />

## Limpieza de recursos
[(back to top)](#contents)

Si queremos eliminar el endpoint por alguna razón, corremos el siguiente código:

In [None]:
# rcf_predictor.delete_endpoint()