# Kaggle Fraud Dataset  
source: [www.kaggle.com/datasets/fraud.csv](https://www.kaggle.com/datasets/vardhansiramdasu/fraudulent-transactions-prediction/data)

| Feature | Description |
|---------|-------------|
|**step**| maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).| 
|**type**| CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.| 
|**amount**| amount of the transaction in local currency.| 
|**nameOrig**| customer who started the transaction| 
|**oldbalanceOrg**| initial balance before the transaction| 
|**newbalanceOrig**| new balance after the transaction| 
|**nameDest**| customer who is the recipient of the transaction| 
|**oldbalanceDest**| initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).| 
|**newbalanceDest**| new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).| 
|**isFraud**| This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.|  
|**isFlaggedFraud**| The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.| 

In [24]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt, ticker as mticker

from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

from feature_engine import encoding as ce

from xgboost import XGBClassifier 

## Load data

In [17]:
df = pd.read_csv('Fraud.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [18]:
# Imbalance of target 
print(df['isFraud'].value_counts(dropna=False))
print(df['isFraud'].value_counts(normalize=True, dropna=False))

0    6354407
1       8213
Name: isFraud, dtype: int64
0    0.998709
1    0.001291
Name: isFraud, dtype: float64


In [19]:
df.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [20]:
df.groupby('type')['type'].count()

type
CASH_IN     1399284
CASH_OUT    2237500
DEBIT         41432
PAYMENT     2151495
TRANSFER     532909
Name: type, dtype: int64

In [11]:
df['nameOrig_tf'] = [i[0:2] for i in df['nameOrig']]
df.groupby('nameOrig_tf')['nameOrig_tf'].count()

nameOrig_tf
C1    3290968
C2     766936
C3     327938
C4     329176
C5     330009
C6     329312
C7     330560
C8     328926
C9     328795
Name: nameOrig_tf, dtype: int64

In [12]:
df['nameDest_tf'] = [i[0:2] for i in df['nameDest']]
df.groupby('nameDest_tf')['nameDest_tf'].count()

nameDest_tf
C1    2177198
C2     509697
C3     216675
C4     216903
C5     216484
C6     216051
C7     221434
C8     218371
C9     218312
M1    1113250
M2     258735
M3     111298
M4     111116
M5     111343
M6     110943
M7     111760
M8     111350
M9     111700
Name: nameDest_tf, dtype: int64

In [15]:
df = pd.read_csv('Fraud.csv')
X_train, X_test, y_train, y_test = train_test_split(df.drop(['isFraud', 'isFlaggedFraud'], axis=1),
                                                    df['isFraud'],
                                                    test_size=0.2,
                                                    stratify=df['isFraud'],
                                                    random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5090096, 11), (1272524, 11), (5090096,), (1272524,))

In [16]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

0    0.998709
1    0.001291
Name: isFraud, dtype: float64
0    0.998709
1    0.001291
Name: isFraud, dtype: float64


## Set up preprocessing pipeline

In [26]:
def preprocessing(X):
    
    X['nameOrig'] = [i[0:2] for i in X['nameOrig']]
    X['nameDest'] = [i[0:2] for i in X['nameDest']]

    return X

In [29]:
fraud_pipeline = Pipeline([
    
    # transform Origname and Destname to less granular features
    'orig_dest_transformation', FunctionTransformer(preprocessing),

    # # categorical encoding
    # ('encoder_categorical',ce.OrdinalEncoder(encoding_method='ordered',
    #                                          variables=['nameOrig', 'nameDest']))
    
    # Extreme Gradiant Boosting model

])

In [30]:
X_train = fraud_pipeline.fit_transform(X_train, y_train)
X_test = fraud_pipeline.transform(X_test)

ValueError: too many values to unpack (expected 2)