# XGBoost to predict fraud on criptocurrency (ethereum) transactions
### Guilherme Chaveiro Soares
This is a small project I developed in one of the classes of the Multivariate Data Analysis course in my Data Science postgraduate program.

Data extracted from https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset/data  
The dataset contains rows of known fraud and valid transactions made over Ethereum.

In [None]:
# importing the libraries and functions
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import joblib
import sklearn
import xgboost
import kagglehub
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler #for scaling
from sklearn import metrics #for evaluation metrics
from sklearn.model_selection import GridSearchCV
from joblib import dump, load #for saving and loading the trained model
import warnings
warnings.filterwarnings('ignore')

### Importing dataset

In [None]:
#importing the dataset
path = kagglehub.dataset_download("vagifa/ethereum-frauddetection-dataset")
df = pd.read_csv(f"{path}/transaction_dataset.csv")

Using Colab cache for faster access to the 'ethereum-frauddetection-dataset' dataset.


In [None]:
df.shape

(9841, 51)

In [None]:
# target var
df.FLAG.value_counts() #its imbalanced!!

Unnamed: 0_level_0,count
FLAG,Unnamed: 1_level_1
0,7662
1,2179


In [None]:
#analyzing the datatypes n missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9841 entries, 0 to 9840
Data columns (total 51 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   Unnamed: 0                                            9841 non-null   int64  
 1   Index                                                 9841 non-null   int64  
 2   Address                                               9841 non-null   object 
 3   FLAG                                                  9841 non-null   int64  
 4   Avg min between sent tnx                              9841 non-null   float64
 5   Avg min between received tnx                          9841 non-null   float64
 6   Time Diff between first and last (Mins)               9841 non-null   float64
 7   Sent tnx                                              9841 non-null   int64  
 8   Received Tnx                                          9841

### Selecting the variabless

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Index,Address,FLAG,Avg min between sent tnx,Avg min between received tnx,Time Diff between first and last (Mins),Sent tnx,Received Tnx,Number of Created Contracts,...,ERC20 min val sent,ERC20 max val sent,ERC20 avg val sent,ERC20 min val sent contract,ERC20 max val sent contract,ERC20 avg val sent contract,ERC20 uniq sent token name,ERC20 uniq rec token name,ERC20 most sent token type,ERC20_most_rec_token_type
0,0,1,0x00009277775ac7d0d59eaad8fee3d10ac6c805e8,0,844.26,1093.71,704785.63,721,89,0,...,0.0,16831000.0,271779.92,0.0,0.0,0.0,39.0,57.0,Cofoundit,Numeraire
1,1,2,0x0002b44ddb1476db43c868bd494422ee4c136fed,0,12709.07,2958.44,1218216.73,94,8,0,...,2.260809,2.260809,2.260809,0.0,0.0,0.0,1.0,7.0,Livepeer Token,Livepeer Token
2,2,3,0x0002bda54cb772d040f779e88eb453cac0daa244,0,246194.54,2434.02,516729.3,2,10,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,,XENON
3,3,4,0x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e,0,10219.6,15785.09,397555.9,25,9,0,...,100.0,9029.231,3804.076893,0.0,0.0,0.0,1.0,11.0,Raiden,XENON
4,4,5,0x00062d1dd1afb6fb02540ddad9cdebfe568e0d89,0,36.61,10707.77,382472.42,4598,20,1,...,0.0,45000.0,13726.65922,0.0,0.0,0.0,6.0,27.0,StatusNetwork,EOS


In [None]:
df.columns = [col.lower() for col in df.columns] # turning the column names into lowercase

In [None]:
df.columns

Index(['unnamed: 0', 'index', 'address', 'flag', 'avg min between sent tnx',
       'avg min between received tnx',
       'time diff between first and last (mins)', 'sent tnx', 'received tnx',
       'number of created contracts', 'unique received from addresses',
       'unique sent to addresses', 'min value received', 'max value received ',
       'avg val received', 'min val sent', 'max val sent', 'avg val sent',
       'min value sent to contract', 'max val sent to contract',
       'avg value sent to contract',
       'total transactions (including tnx to create contract',
       'total ether sent', 'total ether received',
       'total ether sent contracts', 'total ether balance',
       ' total erc20 tnxs', ' erc20 total ether received',
       ' erc20 total ether sent', ' erc20 total ether sent contract',
       ' erc20 uniq sent addr', ' erc20 uniq rec addr',
       ' erc20 uniq sent addr.1', ' erc20 uniq rec contract addr',
       ' erc20 avg time between sent tnx', ' erc20 

In [None]:
# dropping the flag, the categorical and the useless variables
drop_cols = ['unnamed: 0',
             'index',
             'address',
             ' erc20_most_rec_token_type',
             ' erc20 most sent token type',
             'flag']

In [None]:
unique_values = df.nunique() #there are some constant variables!!
print(unique_values)

unnamed: 0                                              9841
index                                                   4729
address                                                 9816
flag                                                       2
avg min between sent tnx                                5013
avg min between received tnx                            6223
time diff between first and last (mins)                 7810
sent tnx                                                 641
received tnx                                             727
number of created contracts                               20
unique received from addresses                           256
unique sent to addresses                                 258
min value received                                      4589
max value received                                      6302
avg val received                                        6767
min val sent                                            4719
max val sent            

In [None]:
constant_cols = unique_values[unique_values == 1].index.tolist()
drop_cols.extend(constant_cols)

In [None]:
print(drop_cols)

['unnamed: 0', 'index', 'address', ' erc20_most_rec_token_type', ' erc20 most sent token type', 'flag', ' erc20 avg time between sent tnx', ' erc20 avg time between rec tnx', ' erc20 avg time between rec 2 tnx', ' erc20 avg time between contract tnx', ' erc20 min val sent contract', ' erc20 max val sent contract', ' erc20 avg val sent contract']


In [None]:
features = df.columns[~df.columns.isin(drop_cols)].tolist()

In [None]:
df.loc[:,features].info() # there r only a few columns with ~800 missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9841 entries, 0 to 9840
Data columns (total 38 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   avg min between sent tnx                              9841 non-null   float64
 1   avg min between received tnx                          9841 non-null   float64
 2   time diff between first and last (mins)               9841 non-null   float64
 3   sent tnx                                              9841 non-null   int64  
 4   received tnx                                          9841 non-null   int64  
 5   number of created contracts                           9841 non-null   int64  
 6   unique received from addresses                        9841 non-null   int64  
 7   unique sent to addresses                              9841 non-null   int64  
 8   min value received                                    9841

### Creating the Preprocessing and the ML Pipelines

In [None]:
class PipeSteps(BaseEstimator, TransformerMixin):
    def __init__(self, columns=[]):
        self.columns = columns

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        X = X.copy()
        return X

In [None]:
class SelectColumns(PipeSteps):
    def transform(self, X):
        X = X.copy()
        return X[self.columns]

In [None]:
class FillData(PipeSteps):
    def fit(self, X, y = None):
        self.median = { col: X[col].median() for col in self.columns }
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].fillna(self.median[col])
        return X

In [None]:
class DataScaler(PipeSteps):
    def fit(self, X, y = None):
        self.scaler = StandardScaler()
        self.scaler.fit(X[self.columns])
        return self

    def transform(self, X):
        X = X.copy()
        X[self.columns] = self.scaler.transform(X[self.columns])
        return X

In [None]:
preprocessing_pipeline = Pipeline(
    [('feature_selection', SelectColumns(features)),
     ('fill_missing', FillData(features)),
     ('stardard_scaling', DataScaler(features))]
)

In [None]:
final_pipe = Pipeline(
    [('preprocessing', preprocessing_pipeline),
     ('learning', XGBClassifier(random_state = 42, eval_metric = 'auc', objective = 'binary:logistic'))]
)
# i made 2 different pipelines to maybe use another ml model later

In [None]:
param_grid = {
    'learning__n_estimators': [100, 300, 500],
    'learning__max_depth': [3, 5, 7],
    'learning__scale_pos_weight': [1, 5, 10]  # for imbalanced classes
}

grid_search = GridSearchCV(final_pipe, param_grid, cv=3, scoring='roc_auc', n_jobs=-1)

### Data Preparation, Fitting and Evaluating the Model

In [None]:
# Predictor variablesss
X = df[features]
# Target var
y = df['flag']

In [None]:
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state= 42) #splitted on a 25/75 proportion

In [None]:
X_train.shape

(7380, 38)

In [None]:
X_test.shape

(2461, 38)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
predictions = grid_search.predict(X_test)
predictions

array([1, 1, 0, ..., 0, 0, 0])

In [None]:
auc_score = metrics.roc_auc_score(y_test, predictions)
print(f'AUC on training dataset: {auc_score:,.2%}')

AUC on training dataset: 97.95%


### Deploying the model

I used an example provided by the postgraduate course (in data science) im taking. Idk where the teacher took it from, but I'll use it here to test it.

In [None]:
# Saving the model
dump(grid_search, 'xgb_final_model.joblib')

['xgb_final_model.joblib']

In [None]:
# Importing the new data
new_data = pd.read_csv('novos_dados.csv')
new_data

Unnamed: 0,avg min between sent tnx,avg min between received tnx,time diff between first and last (mins),sent tnx,received tnx,number of created contracts,unique received from addresses,unique sent to addresses,min value received,max value received,...,erc20 uniq sent addr.1,erc20 uniq rec contract addr,erc20 min val rec,erc20 max val rec,erc20 avg val rec,erc20 min val sent,erc20 max val sent,erc20 avg val sent,erc20 uniq sent token name,erc20 uniq rec token name
0,2570.59,3336.01,30572.7,8,3,0,2,4,0.1,40.0,...,0.0,1.0,600.0,600.0,600.0,0.0,0.0,0.0,0.0,1.0


In [None]:
loaded_model = load('xgb_final_model.joblib')

In [None]:
deployment = loaded_model.predict(new_data)

In [None]:
deployment # = 0 so it means the example is probably not a fraud, yay!

array([0])

Mental notes: I still want to try it with different data, maybe apply PCA to test it, try to treat the outliers differently, evaluate with more metrics, try SMOTE to balance the classes of the target var and also fit the model with different test/train split proportions.

In [None]:
%load_ext watermark
%watermark -v -m

Python implementation: CPython
Python version       : 3.12.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.6.105+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [None]:
%watermark --iversions

sklearn   : 1.6.1
joblib    : 1.5.2
kagglehub : 0.3.13
matplotlib: 3.10.0
xgboost   : 3.0.5
numpy     : 2.0.2
pandas    : 2.2.2

