<a href="https://colab.research.google.com/github/leandrofigueiraalmeida/BP-Rossmann-Sales-Model/blob/main/LFA_Fraudes_TransacoPagamentos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Challenge: Fraud Detection on  Transactions of Payments


### Developer:         Leandro Figueira de Almeida
### Date:           09/07/2023
### Linkedin:    https://www.linkedin.com/in/leandro-figueira-de-almeida/
### Phone:    19 - 98181 5364
### Email:         leandro.figueira.almeida@gmailcom

# 1° What's Business Problem

> You are a Data Scientist at **Hotmart** , has as its Mission to transform content creators into entrepreneurs.

> Among the various tool solutions for content creators, Hotmart offers its **Payment Gateway Service**.

> See that payment page where we put the **purchase transaction data**?

> Its aim is to Create a Predictive Machine to Detect Transactions that are possibly FRAUDULENT**.



# 2° Exploratory Data Analysis

* Imports

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Bibliotecas utilizadas na Construção da Aplicação
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

* Loading Data

In [3]:
#Importação dos Dados
df =  pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Repos/Deteccao_Fraude_Transacoes_Pagamento/fraud.csv')

* Checking Data Information

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


* Reading Data

In [5]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


* Data Dimensions

In [6]:
print('Number of Rows: {}'.format(df.shape[0]))
print('Number of Cols: {}'.format(df.shape[1]))

Number of Rows: 6362620
Number of Cols: 11


* Data Types

In [7]:
df.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

* Check NA

In [8]:
# Verificando valores missing  e linhas duplicadas
df.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

* Descriptive Statistical

In [9]:
# Estatísticas descritivas das variáveis
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


* Exploring the correlations of the variables with the target variablet

In [10]:
# Checking correlation - Checando as Correlações do Target
correlation = df.corr()
print(correlation["isFraud"].sort_values(ascending=False))

  correlation = df.corr()


isFraud           1.000000
amount            0.076688
isFlaggedFraud    0.044109
step              0.031578
oldbalanceOrg     0.010154
newbalanceDest    0.000535
oldbalanceDest   -0.005885
newbalanceOrig   -0.008148
Name: isFraud, dtype: float64


* Exploring the correlations all variables

In [11]:
df.corr(method='spearman')

  df.corr(method='spearman')


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.000836,-0.006145,-0.010716,-0.004526,-0.005315,0.020819,0.002122
amount,0.000836,1.0,0.047642,-0.070543,0.595401,0.670118,0.03606,0.002653
oldbalanceOrg,-0.006145,0.047642,1.0,0.80318,0.024034,-0.008188,0.03943,0.002463
newbalanceOrig,-0.010716,-0.070543,0.80318,1.0,0.044433,-0.094429,-0.028031,0.002662
oldbalanceDest,-0.004526,0.595401,0.024034,0.044433,1.0,0.935802,-0.017141,-0.001644
newbalanceDest,-0.005315,0.670118,-0.008188,-0.094429,0.935802,1.0,-0.005182,-0.001743
isFraud,0.020819,0.03606,0.03943,-0.028031,-0.017141,-0.005182,1.0,0.044109
isFlaggedFraud,0.002122,0.002653,0.002463,0.002662,-0.001644,-0.001743,0.044109,1.0


* Exploring Transaction Type "type"

In [12]:
# Explorando o Tipo de Transação "type"
print(df.type.value_counts())

CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: type, dtype: int64


In [13]:
# Explorando o Tipo de Transação "type"
type = df["type"].value_counts()

# Dashboard

In [14]:
# Explorando o Tipo de Transação "type" e Criando Gráfico de Rosca
type = df["type"].value_counts()
transactions = type.index
quantity = type.values

import plotly.express as px
figure = px.pie(df,
             values=quantity,
             names=transactions,hole = 0.5,
             title="Distribution of Transaction Type")
figure.show()

* Install DATAPREP

In [15]:
#Instalação do Pacote
!pip install dataprep

Collecting dataprep
  Downloading dataprep-0.4.5-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
Collecting flask_cors<4.0.0,>=3.0.10 (from dataprep)
  Downloading Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting jinja2<3.1,>=3.0 (from dataprep)
  Downloading Jinja2-3.0.3-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.6/133.6 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jsonpath-ng<2.0,>=1.5 (from dataprep)
  Downloading jsonpath_ng-1.5.3-py3-none-any.whl (29 kB)
Collecting metaphone<0.7,>=0.6 (from dataprep)
  Downloading Metaphone-0.6.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-crfsuite==0.9.8 (from dataprep)
  Downloading python_crfsuite-0.9.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1

* DATAPREP Report

In [16]:
# Relatório Automatizado
from dataprep.eda import create_report
create_report(df)


invalid value encountered in sqrt


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



0,1
Number of Variables,11
Number of Rows,6.3626e+06
Missing Cells,0
Missing Cells (%),0.0%
Duplicate Rows,0
Duplicate Rows (%),0.0%
Total Size in Memory,1.6 GB
Average Row Size in Memory,263.4 B
Variable Types,Numerical: 6  Categorical: 5

0,1
amount is skewed,Skewed
oldbalanceOrg is skewed,Skewed
newbalanceOrig is skewed,Skewed
oldbalanceDest is skewed,Skewed
newbalanceDest is skewed,Skewed
nameOrig has a high cardinality: 6353307 distinct values,High Cardinality
nameDest has a high cardinality: 2722362 distinct values,High Cardinality
isFraud has constant length 1,Constant Length
isFlaggedFraud has constant length 1,Constant Length
oldbalanceOrg has 2102449 (33.04%) zeros,Zeros

0,1
newbalanceOrig has 3609566 (56.73%) zeros,Zeros
oldbalanceDest has 2704388 (42.5%) zeros,Zeros
newbalanceDest has 2439433 (38.34%) zeros,Zeros

0,1
Approximate Distinct Count,743
Approximate Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,101801920
Mean,243.3972
Minimum,1
Maximum,743

0,1
Minimum,1
5-th Percentile,16
Q1,155
Median,238
Q3,334
95-th Percentile,483
Maximum,743
Range,742
IQR,179

0,1
Mean,243.3972
Standard Deviation,142.332
Variance,20258.39
Sum,1548600000.0
Skewness,0.3752
Kurtosis,0.3291
Coefficient of Variation,0.5848

0,1
Approximate Distinct Count,5
Approximate Unique (%),0.0%
Missing,0
Missing (%),0.0%
Memory Size,460796185

0,1
Mean,7.4224
Standard Deviation,0.532
Median,7.0
Minimum,5.0
Maximum,8.0

0,1
1st row,PAYMENT
2nd row,PAYMENT
3rd row,TRANSFER
4th row,CASH_OUT
5th row,PAYMENT

0,1
Count,43589101
Lowercase Letter,0
Space Separator,0
Uppercase Letter,43589101
Dash Punctuation,0
Decimal Number,0

0,1
Approximate Distinct Count,5316900
Approximate Unique (%),83.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,101801920
Mean,179861.9035
Minimum,0
Maximum,9.2446e+07

0,1
Minimum,0.0
5-th Percentile,2387.1
Q1,13692.08
Median,76838.4084
Q3,211719.86
95-th Percentile,549310.3608
Maximum,92446000.0
Range,92446000.0
IQR,198027.78

0,1
Mean,179861.9035
Standard Deviation,603858.2315
Variance,364640000000.0
Sum,1144400000000.0
Skewness,30.9939
Kurtosis,1797.9553
Coefficient of Variation,3.3573

0,1
Approximate Distinct Count,6353307
Approximate Unique (%),99.9%
Missing,0
Missing (%),0.0%
Memory Size,480265340

0,1
Mean,10.4823
Standard Deviation,0.6041
Median,11.0
Minimum,5.0
Maximum,11.0

0,1
1st row,C1231006815
2nd row,C1666544295
3rd row,C1305486145
4th row,C840083671
5th row,C2048537720

0,1
Count,6362620
Lowercase Letter,0
Space Separator,0
Uppercase Letter,6362620
Dash Punctuation,0
Decimal Number,60332420

0,1
Approximate Distinct Count,1845844
Approximate Unique (%),29.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,101801920
Mean,833883.1041
Minimum,0
Maximum,5.9585e+07

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,14811.81
Q3,111571.9468
95-th Percentile,6146800.0
Maximum,59585000.0
Range,59585000.0
IQR,111571.9468

0,1
Mean,833883.1041
Standard Deviation,2888200.0
Variance,8341900000000.0
Sum,5305700000000.0
Skewness,5.2491
Kurtosis,32.9649
Coefficient of Variation,3.4636

0,1
Approximate Distinct Count,2682586
Approximate Unique (%),42.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,101801920
Mean,855113.6686
Minimum,0
Maximum,4.9585e+07

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,153089.2068
95-th Percentile,6309000.0
Maximum,49585000.0
Range,49585000.0
IQR,153089.2068

0,1
Mean,855113.6686
Standard Deviation,2924000.0
Variance,8550100000000.0
Sum,5440800000000.0
Skewness,5.1769
Kurtosis,32.067
Coefficient of Variation,3.4195

0,1
Approximate Distinct Count,2722362
Approximate Unique (%),42.8%
Missing,0
Missing (%),0.0%
Memory Size,480261705

0,1
Mean,10.4818
Standard Deviation,0.6048
Median,11.0
Minimum,2.0
Maximum,11.0

0,1
1st row,M1979787155
2nd row,M2044282225
3rd row,C553264065
4th row,C38997010
5th row,M1230701703

0,1
Count,6362620
Lowercase Letter,0
Space Separator,0
Uppercase Letter,6362620
Dash Punctuation,0
Decimal Number,60328785

0,1
Approximate Distinct Count,3614697
Approximate Unique (%),56.8%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,101801920
Mean,1.1007e+06
Minimum,0
Maximum,3.5602e+08

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,139473.76
Q3,965166.31
95-th Percentile,5552400.0
Maximum,356020000.0
Range,356020000.0
IQR,965166.31

0,1
Mean,1100700.0
Standard Deviation,3399200.0
Variance,11554000000000.0
Sum,7003300000000.0
Skewness,19.9218
Kurtosis,948.6734
Coefficient of Variation,3.0882

0,1
Approximate Distinct Count,3555499
Approximate Unique (%),55.9%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,101801920
Mean,1.225e+06
Minimum,0
Maximum,3.5618e+08

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,221960.7
Q3,1139300.0
95-th Percentile,5819500.0
Maximum,356180000.0
Range,356180000.0
IQR,1139300.0

0,1
Mean,1225000.0
Standard Deviation,3674100.0
Variance,13499000000000.0
Sum,7794200000000.0
Skewness,19.3523
Kurtosis,862.1558
Coefficient of Variation,2.9993

0,1
Approximate Distinct Count,2
Approximate Unique (%),0.0%
Missing,0
Missing (%),0.0%
Memory Size,419932920

0,1
Mean,1
Standard Deviation,0
Median,1
Minimum,1
Maximum,1

0,1
1st row,0
2nd row,0
3rd row,1
4th row,1
5th row,0

0,1
Count,0
Lowercase Letter,0
Space Separator,0
Uppercase Letter,0
Dash Punctuation,0
Decimal Number,6362620

0,1
Approximate Distinct Count,2
Approximate Unique (%),0.0%
Missing,0
Missing (%),0.0%
Memory Size,419932920

0,1
Mean,1
Standard Deviation,0
Median,1
Minimum,1
Maximum,1

0,1
1st row,0
2nd row,0
3rd row,0
4th row,0
5th row,0

0,1
Count,0
Lowercase Letter,0
Space Separator,0
Uppercase Letter,0
Dash Punctuation,0
Decimal Number,6362620


# 3° Data Preparation

* Change types

In [17]:
# Fazendo conversão de object para número
df["type"] = df["type"].map({"CASH_OUT": 1, "PAYMENT": 2,
                                 "CASH_IN": 3, "TRANSFER": 4,
                                 "DEBIT": 5})


In [18]:
# Alterando a Label para a saída ficar visível
df["isFraud"] = df["isFraud"].map({0: "No Fraud", 1: "Fraud"})
print(df.head())

   step  type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1     2   9839.64  C1231006815       170136.0       160296.36   
1     1     2   1864.28  C1666544295        21249.0        19384.72   
2     1     4    181.00  C1305486145          181.0            0.00   
3     1     1    181.00   C840083671          181.0            0.00   
4     1     2  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest   isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0  No Fraud               0  
1  M2044282225             0.0             0.0  No Fraud               0  
2   C553264065             0.0             0.0     Fraud               0  
3    C38997010         21182.0             0.0     Fraud               0  
4  M1230701703             0.0             0.0  No Fraud               0  


Loading new Data

In [19]:
df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,2,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,No Fraud,0
1,1,2,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,No Fraud,0
2,1,4,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,Fraud,0
3,1,1,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,Fraud,0
4,1,2,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,No Fraud,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,1,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,Fraud,0
6362616,743,4,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,Fraud,0
6362617,743,1,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,Fraud,0
6362618,743,4,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,Fraud,0


* Data Describe

In [21]:
# Estatística dos Campos
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
step,6362620.0,243.3972,142.332,1.0,156.0,239.0,335.0,743.0
type,6362620.0,2.055307,0.9808966,1.0,1.0,2.0,3.0,5.0
amount,6362620.0,179861.9,603858.2,0.0,13389.57,74871.94,208721.5,92445520.0
oldbalanceOrg,6362620.0,833883.1,2888243.0,0.0,0.0,14208.0,107315.2,59585040.0
newbalanceOrig,6362620.0,855113.7,2924049.0,0.0,0.0,0.0,144258.4,49585040.0
oldbalanceDest,6362620.0,1100702.0,3399180.0,0.0,0.0,132705.665,943036.7,356015900.0
newbalanceDest,6362620.0,1224996.0,3674129.0,0.0,0.0,214661.44,1111909.0,356179300.0
isFlaggedFraud,6362620.0,2.514687e-06,0.001585775,0.0,0.0,0.0,0.0,1.0


* Target Variable

In [22]:
# Avaliando o Target
df.isFraud.value_counts()

No Fraud    6354407
Fraud          8213
Name: isFraud, dtype: int64

In [23]:
# Separando as Variáveis Explicativas (x) da variável Target (y)
from sklearn.model_selection import train_test_split
x = np.array(df[["type", "amount", "oldbalanceOrg", "newbalanceOrig"]])
y = np.array(df[["isFraud"]])

# 4° Predictive Machine to Fraud Detection

* Data Training and Predictive Machine

In [24]:
# Treinando a Máquina Preditiva com Machine Learning
from sklearn.tree import DecisionTreeClassifier
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10, random_state=42)
model = DecisionTreeClassifier()
model.fit(xtrain, ytrain)
#print(model.score(xtest, ytest))

In [25]:
# Fazendo novas Predições com dados de Teste
y_pred = model.predict(xtest)

In [26]:
y_pred

array(['No Fraud', 'No Fraud', 'No Fraud', ..., 'No Fraud', 'No Fraud',
       'No Fraud'], dtype=object)

# 5° Evaluating Predictive Machine

* Evaluate Model

In [27]:
# Evaluate model - Avaliando a Máquina Preditiva (Modelo)
print('Métricas do Classification Report: \n', classification_report(ytest, y_pred))
print('Acurácia: \n', accuracy_score(ytest, y_pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, y_pred))

Métricas do Classification Report: 
               precision    recall  f1-score   support

       Fraud       0.91      0.88      0.89       817
    No Fraud       1.00      1.00      1.00    635445

    accuracy                           1.00    636262
   macro avg       0.95      0.94      0.95    636262
weighted avg       1.00      1.00      1.00    636262

Acurácia: 
 0.999732814469511
Confusion Matrix: 
 [[   722     95]
 [    75 635370]]


In [28]:
print('Acurácia: \n', accuracy_score(ytest, y_pred))

Acurácia: 
 0.999732814469511


### New Predictions

#### Case 1

In [29]:
# prediction
#features = [type, amount, oldbalanceOrg, newbalanceOrig]
features = np.array([[4, 9000.60, 9000.60, 0.0]])
print(model.predict(features))

['Fraud']


#### Case 2

In [30]:
# prediction
#features = [type, amount, oldbalanceOrg, newbalanceOrig]
features = np.array([[2, 5000, 5000, 0.0]])
print(model.predict(features))

['No Fraud']


In [31]:
# prediction
#features = [type, amount, oldbalanceOrg, newbalanceOrig]
features = np.array([[1, 5000, 5000, 0.0]])
print(model.predict(features))

['No Fraud']
