# Introduction

Addressing credit card fraud is a top priority for financial institutions, given the constant threat it poses to consumers and businesses. In Brazil alone, Serasa Experian reports that approximately 12.1 million people fell victim to financial fraud in the past year, resulting in a total loss of 1.8 billion reais over the last 12 months.

Effectively detecting fraud is a significant challenge, as legitimate and fraudulent transactions often bear similarities. The complexity is further heightened by the diversity of values and locations where fraud can occur, making pattern recognition challenging. This difficulty can lead to errors in both false positives, such as the preventive blocking of consumer cards, and false negatives, where fraudulent transactions go undetected.

To address this issue, it is crucial to continually enhance security systems, adopt advanced detection technologies, and invest in more robust authentication methods. Collaboration between financial institutions, regulators, and technology companies plays a pivotal role in developing more effective solutions to mitigate the risks associated with credit card fraud.

# Objectives

The objective of this project is to conduct an exploratory data analysis and build machine learning models capable of accurately detecting fraudulent transactions. To achieve this, advanced data analysis and machine learning techniques will be employed to identify patterns and anomalies in the data, along with data balancing techniques. Additionally, it will be crucial to assess the effectiveness of the constructed models, both in terms of accuracy in fraud detection and the minimization of false positives.

# Business Understanding

Fraudulent transactions involve the illicit acquisition of goods and services using stolen payment information. As credit card usage continues to rise, both online and offline, associated fraud activities also escalate. It is crucial for businesses to comprehend this threat, its various forms, and metrics to effectively monitor and combat such activities.

There're some Key Performance Indicators (KPIs) that is important to understand:
* **Acceptance**: The volume of transactions accepted after authorization and screening.
* **Challenges**: Transactions flagged as potentially fraudulent and subject to manual review.
* **Denials**: Payment requests rejected by the acquirer or identified as fraud before processing.
* **Chargebacks**: Transactions identified as fraud by the acquirer or contested by the customer.
* **False Positives**: Legitimate customer transactions incorrectly blocked as fraud.

Effective fraud management hinges on monitoring these KPIs diligently. Elevated false positive rates, for instance, can lead to lost sales and frustrated customers. Therefore, beyond safeguarding against fraud, companies must ensure their fraud detection solutions don't deter legitimate customers.

For this project, it is known that the company earns 10% of the value for a correctly approved payment and loses 100% in case of fraud:

* **Fraud Rate**: fraudulent transactions approved / total transactions approved
* **Approval Rate**: total transactions approved / transactions received



In [6]:
# Bibliotecas padrão
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow 

# Visualização de dados
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import missingno

# Testes estatísticos
from scipy.stats import chi2_contingency, mannwhitneyu
import scipy.stats as stats

# Modelos de machine learning e utilitários
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.metrics import (roc_curve, auc, confusion_matrix, log_loss, roc_auc_score,
                             precision_score, recall_score, f1_score, make_scorer)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce

# Classificadores e métodos de ensemble
from imblearn.ensemble import BalancedRandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

In [11]:
file_path = "data/fraud_data.xlsx"

df = df = pd.read_excel(file_path)
df.head()

Unnamed: 0,score_1,score_2,score_3,score_4,score_5,score_6,pais,score_7,produto,categoria_produto,score_8,score_9,score_10,entrega_doc_1,entrega_doc_2,entrega_doc_3,data_compra,valor_compra,score_fraude_modelo,fraude
0,4.0,0.7685,94436.24,20.0,0.444828,1.0,BR,5.0,Máquininha Corta Barba Cabelo Peito Perna Pelo...,cat_8d714cd,0.883598,240.0,102.0,1.0,,N,2020-03-27 11:51:16,5.64,66.0,0.0
1,4.0,0.755,9258.5,1.0,0.0,33.0,BR,0.0,Avental Descartavel Manga Longa - 50 Un. Tnt ...,cat_64b574b,0.376019,4008.0,0.0,1.0,Y,N,2020-04-15 19:58:08,124.71,72.0,0.0
2,4.0,0.7455,242549.09,3.0,0.0,19.0,AR,23.0,Bicicleta Mountain Fire Bird Rodado 29 Alumini...,cat_e9110c5,0.516368,1779.0,77.0,1.0,,N,2020-03-25 18:13:38,339.32,95.0,0.0
3,4.0,0.7631,18923.9,50.0,0.482385,18.0,BR,23.0,Caneta Delineador Carimbo Olho Gatinho Longo 2...,cat_d06e653,0.154036,1704.0,1147.0,1.0,,Y,2020-04-16 16:03:10,3.54,2.0,0.0
4,2.0,0.7315,5728.68,15.0,0.0,1.0,BR,2.0,Resident Evil Operation Raccoon City Ps3,cat_6c4cfdc,0.855798,1025.0,150.0,1.0,,N,2020-04-02 10:24:45,3.53,76.0,0.0


In [12]:
print(df.shape)

(150000, 20)


## Dataset Dictionary

- **`score_1` a `score_10`**: 
  - Description: Credit bureau scores.
  - Purpose: Used to assess the buyer's reliability.
  
- **`Pais` (`Country`)**: 
  - Description: Country where the purchase was made.

- **`Produto` (`Product`)**: 
  - Description: Specific item acquired on the e-commerce platform.

- **`Categoria_produto` (`Produc Category`)**: 
  - Description: Classification of the product within the e-commerce platform.

- **`Entrega_doc_1` a `entrega_doc_3` (`Document Delivery Indicators`)**: 
  - Description: Indicators of document delivery at the account creation stage.
  - Values: 
    - **0** or **N**: Did not deliver.
    - **1** or **Y**: Delivered.
    - Blank: Considered as not delivered.

- **`Score_fraude_modelo` (`Fraud Model Score`)**: 
  - Description: Probability, provided by the current model, of a purchase being fraudulent.
  - Values: Ranges from 0 to 100. The closer to 100, the higher the model's confidence that the transaction is fraudulent.

- **`Fraude` (`Fraud`)**: 
  - Description: Verification of the authenticity of the purchase.
  - Valores: 
    - **0**: Legitimate transaction.
    - **1**: Fraudulent transaction.
  - Note: This information is confirmed a few days after the transaction to ensure accuracy.
