# Spliting Data

Before conducting data preprocessing, I will separate the test set to assess the model's effectiveness. It is crucial for the model to generalize well, avoiding issues such as underfitting, where the model performs poorly on both training and validation data, or overfitting, where the model overly adapts to the training data and struggles to generalize to new data. The separation before standardization is done to ensure that the test data remains unseen during training and validation, simulating a real-world scenario.

In [29]:

# Bibliotecas padrão
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow 
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.io as pio
import plotly.offline as pyo
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly as ply

# Visualização de dados
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import missingno

# Testes estatísticos
from scipy.stats import chi2_contingency, mannwhitneyu
import scipy.stats as stats

# Modelos de machine learning e utilitários
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.metrics import (roc_curve, auc, confusion_matrix, log_loss, roc_auc_score,
                             precision_score, recall_score, f1_score, make_scorer)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from sklearn.preprocessing import StandardScaler

# Classificadores e métodos de ensemble
from imblearn.ensemble import BalancedRandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
file_path = "data/fraud_data.xlsx"

df = df = pd.read_excel(file_path)
df.head()

Unnamed: 0,score_1,score_2,score_3,score_4,score_5,score_6,pais,score_7,produto,categoria_produto,score_8,score_9,score_10,entrega_doc_1,entrega_doc_2,entrega_doc_3,data_compra,valor_compra,score_fraude_modelo,fraude
0,4,0.7685,94436.24,20.0,0.444828,1.0,BR,5,Máquininha Corta Barba Cabelo Peito Perna Pelo...,cat_8d714cd,0.883598,240.0,102.0,1,,N,2020-03-27 11:51:16,5.64,66,0
1,4,0.755,9258.5,1.0,0.0,33.0,BR,0,Avental Descartavel Manga Longa - 50 Un. Tnt ...,cat_64b574b,0.376019,4008.0,0.0,1,Y,N,2020-04-15 19:58:08,124.71,72,0
2,4,0.7455,242549.09,3.0,0.0,19.0,AR,23,Bicicleta Mountain Fire Bird Rodado 29 Alumini...,cat_e9110c5,0.516368,1779.0,77.0,1,,N,2020-03-25 18:13:38,339.32,95,0
3,4,0.7631,18923.9,50.0,0.482385,18.0,BR,23,Caneta Delineador Carimbo Olho Gatinho Longo 2...,cat_d06e653,0.154036,1704.0,1147.0,1,,Y,2020-04-16 16:03:10,3.54,2,0
4,2,0.7315,5728.68,15.0,0.0,1.0,BR,2,Resident Evil Operation Raccoon City Ps3,cat_6c4cfdc,0.855798,1025.0,150.0,1,,N,2020-04-02 10:24:45,3.53,76,0


In [3]:
def split_df(df):
    df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
    return df_train, df_test

df_train, df_test = split_df(df)

X_train = df_train.drop('fraude', axis=1)
y_train = df_train['fraude']

X_test = df_test.drop('fraude', axis=1)
y_test = df_test['fraude']

# Baseline Model

Firstly, we will assess the distribution of the current model. It is evident that the model is far from ideal, as there is a significant overlap between fraudulent and legitimate transactions in the classification

In [4]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=df[df['fraude'] == 0]['score_fraude_modelo'],
                            nbinsx=50, name='Não Fraude', marker=dict(color='blue'),
                            opacity=0.7))

fig.add_trace(go.Histogram(x=df[df['fraude'] == 1]['score_fraude_modelo'],
                            nbinsx=50, name='Fraude', marker=dict(color='red'),
                            opacity=0.7))

fig.update_layout(title="Distribuição de Score por Fraude ou Não",
                  xaxis_title="Score",
                  yaxis_title="Count",
                  barmode='overlay')

fig.update_traces(histnorm='percent')

fig.show()  


We will now assess other metrics such as AUC, confusion matrix, and the like.

* AUC (Area Under the Curve): AUC is a metric that evaluates the performance of a classification model, particularly for binary classification. It represents the area under the Receiver Operating Characteristic (ROC) curve, providing a measure of how well the model distinguishes between classes. A higher AUC indicates better model performance.

* Confusion Matrix: A confusion matrix is a table that summarizes the performance of a classification algorithm. It shows the count of true positive, true negative, false positive, and false negative predictions. The diagonal elements represent correct predictions, while off-diagonal elements indicate misclassifications. The matrix is a valuable tool for understanding the model's accuracy, precision, recall, and other performance metrics.

In [5]:
fpr, tpr, thresholds = roc_curve(df['fraude'], df['score_fraude_modelo'])
auc = auc(fpr, tpr)
print("ROC AUC: ", auc)

ROC AUC:  0.726275487251462


In [6]:
fig = px.line(x=fpr, y=tpr, labels={"x": "Taxa de Falsos Positivos", "y": "Taxa de Verdadeiros Positivos"},
              title='Curva ROC',
              line_shape='linear')

fig.add_shape(type='line', line=dict(dash='dash', color='red'),
              x0=0, x1=1, y0=0, y1=1)

fig.add_annotation(x=0.5, y=0.5, text=f'AUC = {auc:.2f}', showarrow=False)

fig.show()


Finally, we will assess the threshold that maximizes profit and plot the other metrics. It will be assumed that the purchase value is in a single unit of measure

In [7]:
def calculate_profit_metrics(decision_df, blocked_col, target_col, amount_col):
    """
    Calculates fraud losses, revenues, and profit based on the given decision dataframe.

    Parameters:
    - decision_df (pd.DataFrame): Dataframe containing decision data
    - blocked_col (str): Name of the column indicating if a transaction was blocked
    - target_col (str): Name of the column indicating if a transaction was a fraud
    - amount_col (str): Name of the column indicating the transaction amount

    Returns:
    pd.Series: Series containing summed values of fraud losses, revenues, and profit
    """

    # Fraud losses: Amount lost due to undetected fraud (not blocked but is fraud)
    decision_df["fraud_losses"] = (
        (~decision_df[blocked_col]) & (decision_df[target_col])
    ) * decision_df[amount_col]

    # Revenues: Amount earned from legitimate transactions (not blocked and not fraud)
    decision_df["revenues"] = (
        (~decision_df[blocked_col]) & (~decision_df[target_col])
    ) * decision_df[amount_col] * 0.1  # Assuming 10% revenue from legitimate transactions

    # Profit: Revenues minus Fraud losses
    decision_df["profit"] = decision_df["revenues"] - decision_df["fraud_losses"]

    return decision_df[["fraud_losses", "revenues", "profit"]].sum()


target_col = "fraude"
prediction_col = "score_fraude_modelo"
amount_col = "valor_compra" 
blocked_col = "blocked"

possible_thresholds = np.arange(1, 100, 1)
all_decisions = []

for threshold in possible_thresholds: 
    all_decisions.append(calculate_profit_metrics(df_test.assign(blocked=lambda df_test: df_test[prediction_col] >= threshold), blocked_col, 
                                              target_col, amount_col)
                         )
    

threshold_evaluation = pd.concat(all_decisions, axis=1, keys=[s for s in possible_thresholds]).T.rename_axis("threshold").reset_index()
threshold_evaluation

Unnamed: 0,threshold,fraud_losses,revenues,profit
0,1,1181.23,3242.109,2060.879
1,2,1890.99,3917.883,2026.893
2,3,1969.12,4829.258,2860.138
3,4,2288.36,5726.876,3438.516
4,5,2692.96,6743.685,4050.725
...,...,...,...,...
94,95,84125.99,111680.516,27554.526
95,96,87603.24,112932.551,25329.311
96,97,90925.22,113735.367,22810.147
97,98,94238.31,114322.705,20084.395


In [8]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=threshold_evaluation['threshold'], y=threshold_evaluation['fraud_losses'],
                         mode='lines', name='Fraud Losses'))

fig.add_trace(go.Scatter(x=threshold_evaluation['threshold'], y=threshold_evaluation['revenues'],
                         mode='lines', name='Revenues'))

fig.add_trace(go.Scatter(x=threshold_evaluation['threshold'], y=threshold_evaluation['profit'],
                         mode='lines', name='Profit'))

fig.update_layout(title='Profit Metrics vs. Threshold',
                  xaxis_title='Threshold',
                  yaxis_title='Amount',
                  legend_title='Metrics')

fig.show()


In [9]:
best_threshold = threshold_evaluation.loc[threshold_evaluation["profit"].idxmax(), "threshold"]
best_decision_anterior = calculate_profit_metrics(df_test.assign(blocked=lambda df_test: df_test[prediction_col] >= best_threshold), blocked_col,
                                                target_col, amount_col)

## print the results of the best threshold in a dataframe
best_decision = best_decision_anterior.to_frame().T
best_decision["threshold"] = best_threshold
best_decision.T.rename_axis("metric").rename(columns={0: "$"}).reset_index()

Unnamed: 0,metric,$
0,fraud_losses,25353.32
1,revenues,80329.995
2,profit,54976.675
3,threshold,73.0


In [10]:
## print profit ratio
profit_ratio = best_decision["profit"] / best_decision["revenues"]


With the current model, there is a loss of 25 thousand due to frauds, a gain of 80 thousand, and ultimately a profit of 54 thousand. However, it is important to note that these data are samples and do not reflect the entire dataset

In [11]:
def plot_confusion_matrix(cm):
    """
    Plots a confusion matrix.

    Parameters:
    - cm (np.array): Confusion matrix
    """

    colorscale = [[0, 'lightblue'], [1, 'darkblue']]

    fig = go.Figure(data=go.Heatmap(z=cm, colorscale=colorscale,
                                    x=['Predicted 0', 'Predicted 1'],
                                    y=['True 0', 'True 1'],
                                    hoverongaps=False,
                                    text=cm,
                                    hoverinfo="text"))

    fig.update_layout(title="Confusion Matrix",
                    xaxis_title="Predicted Labels",
                    yaxis_title="True Labels",
                    width=400,  # Set the width of the plot
                    height=400)  # Set the height of the plot

    # Add annotations
    for i in range(len(cm)):
        for j in range(len(cm[0])):
            fig.add_annotation(x=j, y=i, text=str(cm[i][j]),
                            font=dict(color='white', size=12),
                            showarrow=False)

    fig.show()


df_test['predicted'] = df_test['score_fraude_modelo'].apply(lambda x: 1 if x >= best_threshold else 0)
cm = confusion_matrix(df_test['fraude'], df_test['predicted'])
plot_confusion_matrix(cm)

In [12]:
## get the number of true positives, false positives, true negatives, and false negatives
tn, fp, fn, tp = cm.ravel()

## calculate fraud rate and aproval rate
fraud_rate = round(fn / (fn + tn), 2)
approval_rate = round((fn + tn) / (tp + fp + tn + fn), 2)

## print the results
print("Fraud rate: ", fraud_rate)
print("Approval rate: ", approval_rate)

Fraud rate:  0.02
Approval rate:  0.74


As seen above, our **fraud rate** is 2%, and the **approval rate** is 74%!

Now, let's explore additional metrics of the model, such as log loss, AUC, and others.

In [13]:
## calculate log loss, AUC, precision, recall, and F1 score
log_loss_score = log_loss(df_test['fraude'], df_test['predicted'])
auc_score = roc_auc_score(df_test['fraude'], df_test['predicted'])

precision = precision_score(df_test['fraude'], df_test['predicted'])
recall = recall_score(df_test['fraude'], df_test['predicted'])
f1 = f1_score(df_test['fraude'], df_test['predicted'])

## print the results
print("Log loss: ", log_loss_score)
print("AUC: ", auc_score)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1)

Log loss:  8.61203024977306
AUC:  0.7192877255456416
Precision:  0.13430315625405898
Recall:  0.6727391021470397
F1 score:  0.22390645300996104


# Preprocessing

## Notes
* The column `valor_compra` refers to the purchase amount and is in a single unit (e.g., Dollar).
* There is no additional fraud cost beyond what has been mentioned.
* None of the columns introduced into the model should cause data leakage - meaning all this data is calculated/received before the "Fraud" event occurs.

## For preprocessing, I chose to:
* Exclude the `score_fraude_modelo` column, which is the baseline model and should not be used.
* Exclude the `data_compra` column to avoid degrading the model over time.
* Exclude the `produto` column due to its high cardinality (more than 8 thousand categories).
* Retain the top 1000 categories in `categoria_produto` that account for 80% of fraud cases.
* Limit the `país` column to BR and AR (which together make up more than 90% of the entire distribution) and group others.
* Fill missing score values with the median, as they do not follow a normal distribution.
* Create a feature `is_null` indicating which values of entrega_doc_2 are null.
* Consider null values in `entrega_doc_2` as 0, meaning not delivered.
Apply target encoding to the variable `categoria_produto` due to its high cardinality.
Apply one-hot encoding to the remaining categorical variables.

In [32]:
class ColumnDropper(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.drop(columns=['data_compra', 'produto', 'score_fraude_modelo', 'categoria_produto'], axis = 1)

class DataProcessor(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()

        # Cria a coluna 'is_missing' com 1 para valores ausentes e 0 para valores não ausentes
        X_copy['is_missing'] = X_copy['entrega_doc_2'].isnull().astype(int)

        # Preenche os valores nulos com 0 e converte 'Y' em 1 e 'N' em 0
        X_copy['entrega_doc_2'] = X_copy['entrega_doc_2'].fillna('N').map({'Y': 1, 'N': 0})

        # Processamento de colunas específicas
        X_copy['pais'] = X_copy['pais'].map({'BR': 'BR', 'AR': 'AR'}).fillna('Outros')
        X_copy['entrega_doc_3'] = X_copy['entrega_doc_3'].map({'Y': 1, 'N': 0})

        return X_copy

class ScoreImputer(BaseEstimator, TransformerMixin):
    
    def __init__(self, strategy="median"):
        self.strategy = strategy
        self.imputers = {}
    
    def fit(self, X, y=None):
        cols = ['score_2', 'score_3', 'score_4', 'score_5', 'score_6', 'score_7', 'score_8', 'score_9', 'score_10']
        self.imputers = {col: SimpleImputer(strategy=self.strategy).fit(X[[col]]) for col in cols}
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        for col, imputer in self.imputers.items():
            X_copy[col] = imputer.transform(X_copy[[col]])
        return X_copy

class OneHotFeatureEncoder(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns=None):
        self.columns = columns
        self.encoder = OneHotEncoder(sparse=False)
    
    def fit(self, X, y=None):
        if self.columns is None:
            self.columns = X.select_dtypes(include=['object']).columns.tolist()
        self.encoder.fit(X[self.columns])
        return self
    
    def transform(self, X):
        onehot_data = self.encoder.transform(X[self.columns])
        columns_encoded = self.encoder.get_feature_names_out(self.columns)
        X = X.drop(self.columns, axis=1)
        X[columns_encoded] = onehot_data
        return X

class KFoldTargetEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, colnames='categoria_produto', target_name='fraude', n_fold=5, verbosity=True, discard_original_col=False):
        self.colnames = colnames
        self.target_name = target_name
        self.n_fold = n_fold
        self.verbosity = verbosity
        self.discard_original_col = discard_original_col

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):

        if self.target_name not in X.columns or self.colnames not in X.columns:
            raise ValueError("Both targetName and colnames should be columns in X")

        mean_of_target = X[self.target_name].mean()
        kf = KFold(n_splits=self.n_fold, shuffle=True, random_state=42)

        col_mean_name = f"{self.colnames}_Kfold_Target_Enc"
        X[col_mean_name] = X.groupby(self.colnames)[self.target_name].transform('mean')

        X[col_mean_name].fillna(mean_of_target, inplace=True)

        if self.verbosity:
            encoded_feature = X[col_mean_name].values
            
        if self.discard_original_col:
            X = X.drop(self.target_name, axis=1)

        return X

def pipeline(model):
    
    # Criando o pipeline
    pipe = Pipeline([
        ("dropper", ColumnDropper()),
        ("processor", DataProcessor()),
        ("imputer", ScoreImputer()),
        ("onehot", OneHotFeatureEncoder()),
        ('classifier', model)
    ])
    
    return pipe
    

Collecting nbimporter
  Downloading nbimporter-0.3.4-py3-none-any.whl (4.9 kB)
Installing collected packages: nbimporter
Successfully installed nbimporter-0.3.4
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [27]:
def create_df_product_category(df, fraud_df, fraud_threshold=80):
    """
    Create df_product_category DataFrame with information about each category.

    Parameters:
    - df: DataFrame containing the original data
    - fraud_df: DataFrame containing fraud data
    - fraud_threshold: Cumulative percentage threshold for fraud

    Returns:
    - df_product_category: DataFrame with information about each category
    """

    # Count the amount of items for each category
    numbers_of_items_for_category = df['categoria_produto'].value_counts().reset_index()
    numbers_of_items_for_category.rename(columns={'count': 'amount_of_items'}, inplace=True)

    # Count the percentage of fraud for each category
    percentage_of_fraud_for_category = (fraud_df['categoria_produto'].value_counts(normalize=True) * 100).reset_index()
    percentage_of_fraud_for_category.rename(columns={'proportion': 'percentage_of_fraud'}, inplace=True)

    # Count the amount of fraud for each category
    amount_of_fraud_for_category = fraud_df['categoria_produto'].value_counts().reset_index()
    amount_of_fraud_for_category.rename(columns={'count': 'amount_of_frauds'}, inplace=True)

    # Merge amount of items, percentage of fraud, and amount of fraud for each category
    df_product_category = pd.merge(numbers_of_items_for_category, percentage_of_fraud_for_category, on='categoria_produto')
    df_product_category = pd.merge(df_product_category, amount_of_fraud_for_category, on='categoria_produto', how='left')

    # Add the cumulative sum of percent of fraud
    df_product_category['cumsum_%_frauds'] = df_product_category['percentage_of_fraud'].cumsum()

    # Filter the cumulative sum <= fraud_threshold
    df_product_category_filtered = df_product_category[df_product_category['cumsum_%_frauds'] <= fraud_threshold]

    print(len(df_product_category_filtered), 'categories represent', fraud_threshold, '% of fraud')

    return df_product_category_filtered


# Cria um DataFrame com informações sobre cada categoria
df_product_category = create_df_product_category(df, df[df['fraude'] == 1])
# Mantém as categorias filtradas e marca as não selecionadas como "Outros"

selected_categories = df_product_category['categoria_produto']

# Atualiza o DataFrame original marcando as categorias não selecionadas como "Outros"
df_copy.loc[~df_copy["categoria_produto"].isin(selected_categories), "categoria_produto"] = "Outros"

# Divide o DataFrame em conjuntos de treinamento e teste
df_train, df_test = split_df(df_copy)
        
# Cria um encoder usando Target Encoder para o grupo de categorias devido à alta cardinalidade
target_encoder = KFoldTargetEncoder()
df_train = target_encoder.fit_transform(df_train)
df_test = target_encoder.transform(df_test)

# Separa características e rótulos para treinamento
X_train = df_train.drop('fraude', axis=1)
y_train = df_train['fraude']

# Separa características e rótulos para teste
X_test = df_test.drop('fraude', axis=1)
y_test = df_test['fraude']


837 categories represent 80 % of fraud


# Training the models

Our goal maximing ROC AUC score

In [37]:
# Definindo os modelos
BRC = BalancedRandomForestClassifier(random_state=1234)
XGB = XGBClassifier(scale_pos_weight=19, random_state=1234)
LGB = LGBMClassifier(class_weight='balanced', random_state=1234)
DTC = DecisionTreeClassifier(class_weight='balanced', random_state=1234)

# Criando um dicionário de modelos para facilitar a iteração
models = {
    "Balanced RF": BRC,
    "Light GBM": LGB,
    "XGBoost": XGB,
    "Decision Tree": DTC
}

results = []

# Treinando os modelos e coletando os resultados
for name, model in models.items():
    pipe = pipeline(model)
    
    kfold = KFold(n_splits=5, random_state=42, shuffle=True)
    cv_results = cross_val_score(pipe, X_train, y_train, cv=kfold, scoring='roc_auc')
    results.append(cv_results)
    msg = f"{name}: {cv_results.mean():.4f} (±{cv_results.std():.4f})"
    print(msg)

# Transforma a lista de arrays em um DataFrame
df_results = pd.DataFrame({'Model': [name for name in models.keys() for _ in range(5)],
                            'ROC AUC': [score for scores in results for score in scores]})

# Plotando a comparação dos modelos usando Plotly
fig = px.box(df_results, x='Model', y='ROC AUC', labels={"Model": "Modelo", "ROC AUC": "ROC AUC"},
             title='Comparação de Modelos de Classificação', width=800, height=500)
fig.update_layout(xaxis=dict(tickangle=-45, tickmode='array', tickvals=list(models.keys())))
fig.show()


Balanced RF: 0.8515 (±0.0050)
[LightGBM] [Info] Number of positive: 4754, number of negative: 91246
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005679 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2411
[LightGBM] [Info] Number of data points in the train set: 96000, number of used features: 19
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [Info] Number of positive: 4741, number of negative: 91259
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006887 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2412
[LightGBM] [Info] Number of data points in the train set: 96000, number of used featu

Among the tested models, the `Decision Tree` performed the worst. Both the Balanced `RF` and `Light GBM` achieved the best results, with `XGBoost` showing similar performance. I chose to proceed with `Light GBM` due to its consistency and fast processing.

# Hyperparameter Tuning

I will use Randomized Search CV to perform hyperparameter tuning. Randomized Search CV is an approach to hyperparameter optimization that differs from the traditional grid search method (GridSearchCV). Instead of searching through all possible combinations of hyperparameters (as GridSearchCV does), RandomizedSearchCV randomly selects a fixed number of hyperparameter combinations from a space of possibilities.