# Credic Card Fraud

Este notebook tem como objetivo treinar e avaliar um modelo de classificação binária utilizando o dataset público de detecção de fraudes em transações de cartão de crédito (Kaggle – Credit Card Fraud Detection).

O dataset contém 284.807 transações reais, das quais apenas 492 são fraudes (uma relação muito parecida com o projeto que estamos fazendo no CREA), configurando um problema desbalanceado. As variáveis principais são:

* V1–V28: features anonimizadas via PCA;
* Time e Amount: tempo e valor da transação;
* Class rótulo binário (0 = transação legítima, 1 = fraude).

O modelo implementado é um Sequential em Keras com uma única camada `Dense` e ativação `sigmoid`, equivalente a uma regressão logística. O treinamento é feito com:

* Otimizador: Adam
* Função de perda: Binary Crossentropy
* Métricas: Accuracy e F1-score

Após o treino (50 épocas, batch size = 10), o modelo é avaliado no conjunto de teste, reportando accuracy e F1-score O projeto discute o impacto do desbalanceamento e apresenta possíveis melhorias como ajuste de threshold, uso de class_weight, reamostragem e métricas mais robustas como AUC-PR.

## Modo de uso:
Como o arquivo de Fraude é muito grande, ele não foi anexadao nesse notebook. Caso queira rodar o código baixe o arquivo .zip em https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud e coloque o arquivo dentro da pasta [data](./data/).


In [1]:
import numpy as np
import pandas as pd
import zipfile, pathlib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score
from sklearn.utils.class_weight import compute_class_weight

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
pathlib.Path("data").mkdir(exist_ok=True)
with zipfile.ZipFile("data/creditcardfraud.zip") as z:
    z.extractall("data")

# Agora o CSV estará em data/creditcard.csv
import pandas as pd
df = pd.read_csv("data/creditcard.csv")
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
#separar X/y
y = df["Class"].values.astype(np.int32)
X = df.drop(columns=["Class"]).copy()

In [5]:
#padronizar 'Time' e 'Amount' (V1–V28 já são PCA-normalizados)
scaler = StandardScaler()
X[["Time", "Amount"]] = scaler.fit_transform(X[["Time", "Amount"]])

X = X.values

In [6]:
#split estratificado (desbalanceamento!)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42
)

In [7]:
#métrica F1 custom para Keras (acumulada por época)
class F1Score(keras.metrics.Metric):
    def __init__(self, threshold=0.5, name="f1", **kwargs):
        super().__init__(name=name, **kwargs)
        self.threshold = threshold
        self.tp = self.add_weight(name="tp", initializer="zeros")
        self.fp = self.add_weight(name="fp", initializer="zeros")
        self.fn = self.add_weight(name="fn", initializer="zeros")

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_true = tf.cast(y_true, tf.float32)
        y_pred = tf.cast(y_pred > self.threshold, tf.float32)
        tp = tf.reduce_sum(tf.cast(tf.logical_and(tf.equal(y_true,1), tf.equal(y_pred,1)), tf.float32))
        fp = tf.reduce_sum(tf.cast(tf.logical_and(tf.equal(y_true,0), tf.equal(y_pred,1)), tf.float32))
        fn = tf.reduce_sum(tf.cast(tf.logical_and(tf.equal(y_true,1), tf.equal(y_pred,0)), tf.float32))
        self.tp.assign_add(tp)
        self.fp.assign_add(fp)
        self.fn.assign_add(fn)

    def result(self):
        precision = self.tp / (self.tp + self.fp + keras.backend.epsilon())
        recall = self.tp / (self.tp + self.fn + keras.backend.epsilon())
        return 2.0 * (precision * recall) / (precision + recall + keras.backend.epsilon())

    def reset_states(self):
        for var in self.variables:
            var.assign(0.0)

In [8]:
#class weights para combater o desbalanceamento
classes = np.unique(y_train)
weights = compute_class_weight(class_weight="balanced", classes=classes, y=y_train)
class_weights = {int(c): float(w) for c, w in zip(classes, weights)}

In [10]:
#modelo sequencial com 1 Dense(sigmoid)
model = keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)),
    layers.Dense(1, activation="sigmoid"),
])

model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy", F1Score()]  # accuracy + F1 custom
)

In [12]:
#treina o modeloo por 50 épocas, batch_size=10
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=10,
    validation_split=0.1,
    class_weight=class_weights,
    verbose=1
)

Epoch 1/50
[1m20506/20506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 2ms/step - accuracy: 0.9828 - f1: 0.0169 - loss: 0.1441 - val_accuracy: 0.9788 - val_f1: 0.0179 - val_loss: 0.1054
Epoch 2/50
[1m20506/20506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 2ms/step - accuracy: 0.9776 - f1: 0.0158 - loss: 0.1615 - val_accuracy: 0.9818 - val_f1: 0.0200 - val_loss: 0.0996
Epoch 3/50
[1m20506/20506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 2ms/step - accuracy: 0.9815 - f1: 0.0182 - loss: 0.1299 - val_accuracy: 0.9773 - val_f1: 0.0169 - val_loss: 0.1109
Epoch 4/50
[1m20506/20506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 2ms/step - accuracy: 0.9796 - f1: 0.0157 - loss: 0.1600 - val_accuracy: 0.9778 - val_f1: 0.0169 - val_loss: 0.1112
Epoch 5/50
[1m20506/20506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 2ms/step - accuracy: 0.9805 - f1: 0.0167 - loss: 0.1340 - val_accuracy: 0.9788 - val_f1: 0.0176 - val_loss: 0.1059
Epoch 6/50
[1m

In [13]:
#oredição e métricas no conjunto de teste
y_prob = model.predict(X_test, batch_size=4096).ravel()
y_pred = (y_prob >= 0.5).astype(int)

acc = accuracy_score(y_test, y_pred)
f1  = f1_score(y_test, y_pred)

print(f"Accuracy (test) = {acc:.6f}")
print(f"F1-score (test) = {f1:.6f}")


[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Accuracy (test) = 0.984323
F1-score (test) = 0.166200


como o F1 estava baixo, segue código para ajustar limiar para maximizar F1

In [14]:
from sklearn.metrics import precision_recall_curve
p, r, thr = precision_recall_curve(y_test, y_prob)
f1s = 2*p*r/(p+r+1e-12)
best_idx = np.nanargmax(f1s)
best_thr = thr[max(best_idx-1,0)]
y_pred_opt = (y_prob >= best_thr).astype(int)
print("Best threshold for F1:", best_thr)
print("F1@best_thr:", f1_score(y_test, y_pred_opt))

Best threshold for F1: 0.9996316
F1@best_thr: 0.8059701492537313
