<a href="https://colab.research.google.com/github/omaralvaradobaubap/Aplicaciones-Financieras/blob/main/Semana2_1_Aps_Financieras8_Intro_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MaxMitre/Aplicaciones-Financieras/blob/main/Semana2/1_Intro_Keras.ipynb)

# Problema

Esta semana utilizaremos el mismo DataSet para ambas sesiones. Ésta primera sesión será para analizar un problema de fraude de modo sencillo, es decir, con un clasificador binario sencillo (que nos servirá para introducir algunas herramientas).

Podemos encontrar la base de datos en https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

Los datos contienen transacciones realizadas mediante trarjeta de crédito en Septiembre del 2013 por titulares de tarjeta en Europa. Ocurrieron en 2 días en los que hubo 492 fraudes de 284,807 transacciones.

Solo contiene variables numéricas y ya pasó por un proceso de PCA para solo quedarse con las mejores características.

La columna de "Time" contiene los segundos ocurridos entre transacciones. "Amount" es la cantidad de la transacción y "Class" es la variable que tiene 1 en caso de fraude y 0 en caso contrario.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Leer Datos usando un método distinto al que estamos acostumbrados

In [None]:
import numpy as np

# Datos disponibles en https://www.kaggle.com/mlg-ulb/creditcardfraud/

all_features = []
all_targets = []
with open('/content/drive/MyDrive/Cruso-ApsFinancieras/semana3/creditcard.csv') as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EJEMPLO DE CARACTERÍSTICAS:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)

In [None]:
features

In [None]:
features.shape

In [None]:
targets.shape

In [None]:
import pandas as pd

en_dataframe = pd.read_csv('/content/drive/MyDrive/Cruso-ApsFinancieras/semana3/creditcard.csv')

In [None]:
en_dataframe

In [None]:
features2 = np.array(en_dataframe.iloc[:, :-1])
targets2 = np.array(en_dataframe.iloc[:,-1])

In [None]:
features2

In [None]:
np.allclose(features, features2, rtol=0.0000001)

In [None]:
targets2.shape

In [None]:
targets.shape

In [None]:
targets

In [None]:
targets2

## Cuidado con las dimensiones, pueden crear problemas extraños

Rompemos la RAM, solo para mostrar cosas que pueden fallar, despues no ejecutaremos esta linea

In [None]:
# Si está descomentada, supera nuestra RAM disponible
#np.allclose(targets, targets2)

In [None]:
# Esta variable casi llena nuestra RAM, la borraremos abajo
aux = (targets[:100000] == targets2[:100000])

In [None]:
targets

In [None]:
targets2

In [None]:
aux

In [None]:
aux.shape

In [None]:
del aux

# Corrección

La manera correcta para poder llevar a cabo esta comparación es cuidando las dimensiones de los objetos que estamos comparando

In [None]:
targets.shape

In [None]:
targets2.shape

In [None]:
features2 = np.array(en_dataframe.iloc[:, :-1])
targets2 = np.array(en_dataframe.iloc[:,-1]).reshape(-1, 1)  # Cambio aqui

In [None]:
np.allclose(targets, targets2, rtol=1e-05)

In [None]:
print(targets.shape)
print(targets2.shape)

# Preparar conjunto de validación

Lo haremos manual (es decir, no aleatorio)

In [None]:
features, targets

In [None]:
num_val_samples = int(len(features) * 0.2)
num_val_samples

In [None]:
num_val_samples = int(len(features) * 0.2)

# 80% datos para entrenamiento
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]

# 20% de datos para test
val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

print("Número de muestras para entrenamiento:", len(train_features))
print("Número de muestras para validación:", len(val_features))

In [None]:
num_val_samples

# Analizar desbalance de los datos

In [None]:
counts = np.bincount(train_targets[:, 0])
print(f"Número de valores 1 en muestra de entrenamiento: {counts[1]} ({100 * float(counts[1]) / len(train_targets):.2f}% of total)")

Asignaremos pesos de una manera balanceada (es una clase de peso "balanced" utilizada en keras). Manualmente se calcula como:

In [None]:
total_counts = np.bincount(train_targets[:, 0])
n_samples = len(train_targets)

weight_for_0 = n_samples / (total_counts[0]*2)
weight_for_1 = n_samples / (total_counts[1]*2)

#counts = np.bincount(train_targets[:, 0])

#weight_for_0 = 1.0 / counts[0]
#weight_for_1 = 1.0 / counts[1]

print(f"{weight_for_0: .6f}")
print(f"{weight_for_1: .4f}")

Podemos ver que los pesos son practicamente una proporción de 500 a 1 aproximadamente

# Estandarizar los datos

In [None]:
mean = np.mean(train_features, axis=0)
train_features -= mean
val_features -= mean
std = np.std(train_features, axis=0)
train_features /= std
val_features /= std

In [None]:
df_chiquito = np.array([[1,2,3,4], [4,5,6,7], [7,8,9,10]])
df_chiquito

In [None]:
np.mean(df_chiquito, axis=1)

# Modelo de clasificación binaria

## Slicing para arrays

In [None]:
arreglo = np.array([[1,2,3], [4,5,6], [5,6,7], [7,8,9]])
arreglo

In [None]:
arreglo[:, 1:]

In [None]:
# Removemos la columnas correspondiente a tiempo
train_features = train_features[:, 1:]
val_features = val_features[:, 1:]

In [None]:
train_features.shape

In [None]:
train_features.shape[-1]

In [None]:
from tensorflow import keras
import tensorflow as tf

model = keras.Sequential(
    [
        keras.layers.Dense(256, activation="relu", input_shape=(train_features.shape[-1],)),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3), ###
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3), ###
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

# Entrenar modelo tomando en cuenta pesos para los datos


In [None]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(0.01), loss="binary_crossentropy", metrics=metrics # El 0.01 es llamado "learning rate"
)

# Para poder guardar info del entranamiento
callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]

# Utilizada para asignar pesos que "balanceen" las clases
class_weight = {0: weight_for_0, 1: weight_for_1}

history = model.fit(
    train_features,
    train_targets,
    batch_size=2048, # Cantidad de datos que optimizan pesos por cada pasada, "batch size"
    epochs=30, # Cantidad de pasadas que dan todos los datos, para entrenamiento
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets), # Solo para medir como va avanzando el modelo, no utiliza estos datos para modificar pesos
    class_weight=class_weight,
)

In [None]:
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(y = history.history['loss'], name = 'loss'))
fig.add_trace(go.Scatter(y = history.history['val_loss'], name = 'val_loss'))
fig.update_layout(
    title = 'Pérdida del modelo (con capas de Dropout)',
    xaxis_title = 'Época (epoch)',
    yaxis_title = 'Pérdida (MSE)'
)
fig.show()

In [None]:
from tensorflow import keras
import tensorflow as tf

model = keras.Sequential(
    [
        keras.layers.Dense(256, activation="relu", input_shape=(train_features.shape[-1],)),
        keras.layers.Dense(256, activation="relu"),
        #keras.layers.Dropout(0.3), ###
        keras.layers.Dense(256, activation="relu"),
        #keras.layers.Dropout(0.3), ###
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

In [None]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(0.01), loss="binary_crossentropy", metrics=metrics # El 0.01 es llamado "learning rate"
)

# Para poder guardar info del entranamiento
callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]

# Utilizada para asignar pesos que "balanceen" las clases
class_weight = {0: weight_for_0, 1: weight_for_1}

history = model.fit(
    train_features,
    train_targets,
    batch_size=2048, # Cantidad de datos que optimizan pesos por cada pasada, "batch size"
    epochs=30, # Cantidad de pasadas que dan todos los datos, para entrenamiento
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets), # Solo para medir como va avanzando el modelo, no utiliza estos datos para modificar pesos
    class_weight=class_weight,
)

In [None]:
# Esta celda es ejecutada entrenando otra red nueva, en la que no habrá dropout
fig = go.Figure()
fig.add_trace(go.Scatter(y = history.history['loss'], name = 'loss'))
fig.add_trace(go.Scatter(y = history.history['val_loss'], name = 'val_loss'))
fig.update_layout(
    title = 'Pérdida del modelo (sin capas de Dropout)',
    xaxis_title = 'Época (epoch)',
    yaxis_title = 'Pérdida (MSE)'
)
fig.show()

In [None]:
tf.keras.utils.plot_model(
    model,
    #to_file="model.png",
    show_shapes=True,
    show_dtype=False,
    show_layer_names=True,
    #dpi=96,
)

In [None]:
# NO SALGA NOTACION CIENTIFICA
np.set_printoptions(suppress=True)

In [None]:
y_train_pred = model.predict(train_features)
y_train_pred

In [None]:
#y_train_pred = model.predict(train_features)
y_train_pred[y_train_pred < 0.5] = 0
y_train_pred[y_train_pred >= 0.5] = 1

In [None]:
y_test_pred = model.predict(val_features)
y_test_pred[y_test_pred < 0.5] = 0
y_test_pred[y_test_pred >= 0.5] = 1

In [None]:
len(y_test_pred[:, 0])

In [None]:
y_test_pred[:, 0].sum()

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

In [None]:
ConfusionMatrixDisplay.from_predictions(train_targets, y_train_pred, cmap=plt.cm.Greens)

In [None]:
# Recuperar "pesos" de los coeficientes en la epoca "X"
# CUIDADO AL CORRER ESTA CELDA

#model.load_weights("/content/fraud_model_at_epoch_25.h5")

In [None]:
ConfusionMatrixDisplay.from_predictions(val_targets, y_test_pred, cmap=plt.cm.Greens)

In [None]:
from sklearn.metrics import average_precision_score

In [None]:
y_test_prob = model.predict(val_features)

In [None]:
y_test_prob

In [None]:
auprc = average_precision_score(val_targets, y_test_prob)
auprc

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(train_targets, y_train_pred))

In [None]:
print(classification_report(val_targets, y_test_pred))

In [None]:
clase_1 = 500
clase_2 = 280000

In [None]:
clase_2/(clase_2 + clase_1)

# Ejercicios:

* ¿Que sucedería con el modelo si no usamos el parámetro "class_weight"?

* ¿Podemos recuperar solo la mejor época de nuestro entrenamiento?

* ¿Porque la matriz de confusión no parece dar mucha luz sobre lo que ocurre?

 - Cargar los datos utilizando Pandas
 - Realizar la división de entrenamiento y prueba usando la función 'train_test_split' de sklearn

In [None]:
# Código de ejercicio, puede crear mas celdas

