# Credit Card Fraud Detection
## Anonymized credit card transactions labeled as fraudulent or genuine
https://www.kaggle.com/dalpozz/creditcardfraud

Objetivo: Entrenar un clasificador para detectar operaciones fraudulentas.

Se estima que al año se realizan 100 **mil millones** de transacciones con CC.


Obtenemos el dataset:

In [None]:
!wget https://github.com/israelzuniga/dlatam-bigdata-workshop/raw/master/notebooks/data/creditcardfraud.zip

In [None]:
!unzip creditcardfraud.zip

Establecemos constantes para re-usarse en la app de Spark:

In [None]:
CSV_PATH = "./creditcard.csv"
APP_NAME = "Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 13579
TRAINING_DATA_RATIO = 0.7
RF_NUM_TREES = 3
RF_MAX_DEPTH = 4
RF_NUM_BINS = 32
RF_MAX_BINS = 32


Instanciamos Spark:

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName(APP_NAME) \
    .master(SPARK_URL) \
    .getOrCreate()

df = spark.read \
    .options(header = "true", inferschema = "true") \
    .csv(CSV_PATH)

print("Rows: %d" % df.count())


Una vez que cargamos el CSV, necesitamos transformarlo a un vector y separar datos para entrenamiento y pruebas:

https://spark.apache.org/docs/1.1.0/mllib-data-types.html

https://spark.apache.org/docs/2.1.1/api/python/pyspark.mllib.html#module-pyspark.mllib.regression

In [None]:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

transformed_df = df.rdd.map(lambda row: LabeledPoint(row[-1], Vectors.dense(row[0:-1])))

splits = [TRAINING_DATA_RATIO, 1.0 - TRAINING_DATA_RATIO]
training_data, test_data = transformed_df.randomSplit(splits, RANDOM_SEED)

print("Training set rows: %d" % training_data.count())
print("Test set rows: %d" % test_data.count())


Entrenamos un clasificador RandomForest:

In [None]:
from pyspark.mllib.tree import RandomForest
from time import *

start_time = time()

model = RandomForest.trainClassifier(training_data, numClasses=2, categoricalFeaturesInfo={}, \
    numTrees=RF_NUM_TREES, featureSubsetStrategy="auto", impurity="gini", \
    maxDepth=RF_MAX_DEPTH, maxBins=RF_MAX_BINS, seed=RANDOM_SEED)

end_time = time()
elapsed_time = end_time - start_time
print("Time to train model: %.3f seconds" % elapsed_time)


Generamos predicciones y evaluamos el desempeño del clasificador:

In [None]:
predictions = model.predict(test_data.map(lambda x: x.features))
labels_and_predictions = test_data.map(lambda x: x.label).zip(predictions)
acc = labels_and_predictions.filter(lambda x: x[0] == x[1]).count() / float(test_data.count())
print("Model accuracy: %.3f%%" % (acc * 100))


In [None]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

start_time = time()

metrics = BinaryClassificationMetrics(labels_and_predictions)
print("Area under Precision/Recall (PR) curve: %.f" % (metrics.areaUnderPR * 100))
print("Area under Receiver Operating Characteristic (ROC) curve: %.3f" % (metrics.areaUnderROC * 100))

end_time = time()
elapsed_time = end_time - start_time
print("Time to evaluate model: %.3f seconds" % elapsed_time)
