# DETECCIÓN DE TRANSACCIONES FRAUDULENTAS CON TARJETAS BANCARIAS USANDO MACHINE LEARNING 

## `Masters Degree in Data Science`

# <font color='orange'>CUNEF</font>

## ASIGNACIÓN TFM 

 ### Ignacio González García-Valdés

#### -------------------

#### Target Variable 

The target variable for this dataset is the following: is_fraud

#### Assignment Objective

It is conceivable that we are dealing with a synthetic dataset designed to mimic a financial institution's card transaction dataset. Based on the available data, the primary goal of this task is to detect fraudulent transactions.

The main objective is to develop a model that can assess the probability of fraud for a given transaction. To achieve this, a thorough analysis of the data is necessary, including profiling, feature engineering, variable selection, and transformation. These steps are essential to gain insights into the underlying patterns and characteristics of fraudulent transactions.

By carefully examining the data and generating appropriate descriptors, we can create a robust model that effectively identifies and predicts the likelihood of fraud. The success of this endeavor relies on meticulous exploration, feature engineering, and comprehensive variable analysis to ensure the model's accuracy and reliability.

#### Notebook Objective

Within this document titled "02_Data_Preprocessing," a series of essential preprocessing routines will be executed to appropriately prepare the dataframe for the subsequent modeling phase. These preprocessing steps are crucial for refining and adapting the data to ensure compatibility and optimal performance with the chosen model. By applying these routines, the dataframe will undergo necessary transformations, feature engineering, data cleaning, and any other required adjustments to meet the specific requirements and assumptions of the model being utilized.

#### --------------------

#### Libraries

In [1]:
import pandas as pd 
import numpy as np
import sklearn
from sklearn.pipeline import Pipeline
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import warnings
from sklearn.preprocessing import OrdinalEncoder
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder, MaxAbsScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler


pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)

sns.set(rc = {'figure.figsize':(20,10)})

#### Parametrisation

In [2]:
csv_path = '../Data/Processed/fraud_dataset_EDA.csv'
csv_outpath = '../Data/Processed/fraud_dataset_scaled.csv'
seed = 123456
sampling_strategy = 1
beta = 2
test_size = 0.30

#### Warnings

In [3]:
import warnings

#### --------------------

In [4]:
fraud_df = pd.read_csv(csv_path)
fraud_df

Unnamed: 0,cc_num,merchant,category,amt,city,state,zip,lat,long,city_pop,job,trans_num,unix_time,merch_lat,merch_long,is_fraud,trans_month,trans_dayofweek,trans_hour,trans_year,trans_week,cust_age
0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Moravian Falls,NC,28654,36.0788,-81.1781,3495,"Psychologist, counselling",0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0,1,Tuesday,0,2019,1,30.0
1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Orient,WA,99160,48.8878,-118.2105,149,Special educational needs teacher,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0,1,Tuesday,0,2019,1,40.0
2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Malad City,ID,83252,42.1808,-112.2620,4154,Nature conservation officer,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0,1,Tuesday,0,2019,1,56.0
3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.00,Boulder,MT,59632,46.2306,-112.1138,1939,Patent attorney,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0,1,Tuesday,0,2019,1,51.0
4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Doe Hill,VA,24433,38.4207,-79.4629,99,Dance movement psychotherapist,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0,1,Tuesday,0,2019,1,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1852389,30560609640617,fraud_Reilly and Sons,health_fitness,43.77,Luray,MO,63453,40.4931,-91.8912,519,Town planner,9b1f753c79894c9f4b71f04581835ada,1388534347,39.946837,-91.333331,0,12,Thursday,23,2020,53,54.0
1852390,3556613125071656,fraud_Hoppe-Parisian,kids_pets,111.84,Lake Jackson,TX,77566,29.0393,-95.4401,28739,Futures trader,2090647dac2c89a1d86c514c427f5b91,1388534349,29.661049,-96.186633,0,12,Thursday,23,2020,53,21.0
1852391,6011724471098086,fraud_Rau-Robel,kids_pets,86.88,Burbank,WA,99323,46.1966,-118.9017,3684,Musician,6c5b7c8add471975aa0fec023b2e8408,1388534355,46.658340,-119.715054,0,12,Thursday,23,2020,53,39.0
1852392,4079773899158,fraud_Breitenberg LLC,travel,7.99,Mesa,ID,83643,44.6255,-116.4493,129,Cartographer,14392d723bb7737606b2700ac791b7aa,1388534364,44.470525,-117.080888,0,12,Thursday,23,2020,53,55.0


### Data Preprocessing

- Categorical Encoding
- Mean Encoding
- Splitting the train and test data sets
- Feature Scalling.

In [5]:
fraud_df.dtypes.to_frame(name='Type').T.style.set_properties(**{'background-color': 'deepskyblue'})

Unnamed: 0,cc_num,merchant,category,amt,city,state,zip,lat,long,city_pop,job,trans_num,unix_time,merch_lat,merch_long,is_fraud,trans_month,trans_dayofweek,trans_hour,trans_year,trans_week,cust_age
Type,int64,object,object,float64,object,object,int64,float64,float64,int64,object,object,int64,float64,float64,int64,int64,object,int64,int64,int64,float64


In [6]:
cols = ['merchant', 'city', 'state', 'job', 'cc_num', 'trans_num', 'unix_time']

for col in cols:
    num_unique_values = fraud_df[col].nunique()
    print(f'Número de valores únicos para {col}: {num_unique_values}\n')


Número de valores únicos para merchant: 693

Número de valores únicos para city: 906

Número de valores únicos para state: 51

Número de valores únicos para job: 497

Número de valores únicos para cc_num: 999

Número de valores únicos para trans_num: 1852394

Número de valores únicos para unix_time: 1819583



In [7]:
drop_cols = ['cc_num','trans_num', 'unix_time']

fraud_df.drop(drop_cols, axis = 1, inplace = True)
fraud_df.head()

Unnamed: 0,merchant,category,amt,city,state,zip,lat,long,city_pop,job,merch_lat,merch_long,is_fraud,trans_month,trans_dayofweek,trans_hour,trans_year,trans_week,cust_age
0,"fraud_Rippin, Kub and Mann",misc_net,4.97,Moravian Falls,NC,28654,36.0788,-81.1781,3495,"Psychologist, counselling",36.011293,-82.048315,0,1,Tuesday,0,2019,1,30.0
1,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Orient,WA,99160,48.8878,-118.2105,149,Special educational needs teacher,49.159047,-118.186462,0,1,Tuesday,0,2019,1,40.0
2,fraud_Lind-Buckridge,entertainment,220.11,Malad City,ID,83252,42.1808,-112.262,4154,Nature conservation officer,43.150704,-112.154481,0,1,Tuesday,0,2019,1,56.0
3,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Boulder,MT,59632,46.2306,-112.1138,1939,Patent attorney,47.034331,-112.561071,0,1,Tuesday,0,2019,1,51.0
4,fraud_Keeling-Crist,misc_pos,41.96,Doe Hill,VA,24433,38.4207,-79.4629,99,Dance movement psychotherapist,38.674999,-78.632459,0,1,Tuesday,0,2019,1,32.0


In [9]:
X = fraud_df.drop('is_fraud', axis=1)
Y = fraud_df['is_fraud']

#### One-hot encoding

In [10]:
cols_to_one_hot_encode = ['category', 'trans_dayofweek']

In [11]:
X = pd.get_dummies(X, columns=cols_to_one_hot_encode)

In [12]:
X.head()

Unnamed: 0,merchant,amt,city,state,zip,lat,long,city_pop,job,merch_lat,merch_long,trans_month,trans_hour,trans_year,trans_week,cust_age,category_entertainment,category_food_dining,category_gas_transport,category_grocery_net,category_grocery_pos,category_health_fitness,category_home,category_kids_pets,category_misc_net,category_misc_pos,category_personal_care,category_shopping_net,category_shopping_pos,category_travel,trans_dayofweek_Friday,trans_dayofweek_Monday,trans_dayofweek_Saturday,trans_dayofweek_Sunday,trans_dayofweek_Thursday,trans_dayofweek_Tuesday,trans_dayofweek_Wednesday
0,"fraud_Rippin, Kub and Mann",4.97,Moravian Falls,NC,28654,36.0788,-81.1781,3495,"Psychologist, counselling",36.011293,-82.048315,1,0,2019,1,30.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
1,"fraud_Heller, Gutmann and Zieme",107.23,Orient,WA,99160,48.8878,-118.2105,149,Special educational needs teacher,49.159047,-118.186462,1,0,2019,1,40.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,fraud_Lind-Buckridge,220.11,Malad City,ID,83252,42.1808,-112.262,4154,Nature conservation officer,43.150704,-112.154481,1,0,2019,1,56.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,"fraud_Kutch, Hermiston and Farrell",45.0,Boulder,MT,59632,46.2306,-112.1138,1939,Patent attorney,47.034331,-112.561071,1,0,2019,1,51.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,fraud_Keeling-Crist,41.96,Doe Hill,VA,24433,38.4207,-79.4629,99,Dance movement psychotherapist,38.674999,-78.632459,1,0,2019,1,32.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0


#### Mean encoder

In [14]:
cols_to_mean_encode = ['merchant', 'city', 'state', 'job']

In [15]:
for col in cols_to_mean_encode:
    mean_encode = fraud_df.groupby(col)['is_fraud'].mean()
    X[col] = X[col].map(mean_encode)

In [16]:
X.head()

Unnamed: 0,merchant,amt,city,state,zip,lat,long,city_pop,job,merch_lat,merch_long,trans_month,trans_hour,trans_year,trans_week,cust_age,category_entertainment,category_food_dining,category_gas_transport,category_grocery_net,category_grocery_pos,category_health_fitness,category_home,category_kids_pets,category_misc_net,category_misc_pos,category_personal_care,category_shopping_net,category_shopping_pos,category_travel,trans_dayofweek_Friday,trans_dayofweek_Monday,trans_dayofweek_Saturday,trans_dayofweek_Sunday,trans_dayofweek_Thursday,trans_dayofweek_Tuesday,trans_dayofweek_Wednesday
0,0.013575,4.97,0.003758,0.004521,28654,36.0788,-81.1781,3495,0.00332,36.011293,-82.048315,1,0,2019,1,30.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
1,0.009921,107.23,0.00216,0.00466,99160,48.8878,-118.2105,149,0.002472,49.159047,-118.186462,1,0,2019,1,40.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,0.001893,220.11,0.010884,0.004107,83252,42.1808,-112.262,4154,0.021534,43.150704,-112.154481,1,0,2019,1,56.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,0.002416,45.0,0.020188,0.004106,59632,46.2306,-112.1138,1939,0.005461,47.034331,-112.561071,1,0,2019,1,51.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,0.003057,41.96,0.004449,0.006538,24433,38.4207,-79.4629,99,0.004449,38.674999,-78.632459,1,0,2019,1,32.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0


#### Splitting the train and test data sets

In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed, stratify=Y)

In [20]:
rus = RandomUnderSampler(random_state=42)
X_train_res, Y_train_res = rus.fit_resample(X_train, Y_train)

print('Data   : ', X_train_res.shape)
print('Labels : ', Y_train_res.shape)

Data   :  (13512, 37)
Labels :  (13512,)


#### Min-Max Scaler

In [21]:
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_res), columns=X_train_res.columns, index=X_train_res.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)