## This dataset contains rows of known fraud and valid transactions made over Ethereum.

### Task: EDA & Prediction of Fraud/Valid Transaction

Here is a description of the rows of the dataset:
* Index: the index number of a row
* Address: the address of the ethereum account
* FLAG: whether the transaction is fraud or not
* Avg min between sent tnx: Average time between sent transactions for account in minutes
* Avgminbetweenreceivedtnx: Average time between received transactions for account in minutes
* TimeDiffbetweenfirstand_last(Mins): Time difference between the first and last transaction
* Sent_tnx: Total number of sent normal transactions
* Received_tnx: Total number of received normal transactions
* NumberofCreated_Contracts: Total Number of created contract transactions
* UniqueReceivedFrom_Addresses: Total Unique addresses from which account received transactions
* UniqueSentTo_Addresses20: Total Unique addresses from which account sent transactions
* MinValueReceived: Minimum value in Ether ever received
* MaxValueReceived: Maximum value in Ether ever received
* AvgValueReceived5Average value in Ether ever received
* MinValSent: Minimum value of Ether ever sent
* MaxValSent: Maximum value of Ether ever sent
* AvgValSent: Average value of Ether ever sent
* MinValueSentToContract: Minimum value of Ether sent to a contract
* MaxValueSentToContract: Maximum value of Ether sent to a contract
* AvgValueSentToContract: Average value of Ether sent to contracts
* TotalTransactions(IncludingTnxtoCreate_Contract): Total number of transactions
* TotalEtherSent:Total Ether sent for account address
* TotalEtherReceived: Total Ether received for account address
* TotalEtherSent_Contracts: Total Ether sent to Contract addresses
* TotalEtherBalance: Total Ether Balance following enacted transactions
* TotalERC20Tnxs: Total number of ERC20 token transfer transactions
* ERC20TotalEther_Received: Total ERC20 token received transactions in Ether
* ERC20TotalEther_Sent: Total ERC20token sent transactions in Ether
* ERC20TotalEtherSentContract: Total ERC20 token transfer to other contracts in Ether
* ERC20UniqSent_Addr: Number of ERC20 token transactions sent to Unique account addresses
* ERC20UniqRec_Addr: Number of ERC20 token transactions received from Unique addresses
* ERC20UniqRecContractAddr: Number of ERC20token transactions received from Unique contract addresses
* ERC20AvgTimeBetweenSent_Tnx: Average time between ERC20 token sent transactions in minutes
* ERC20AvgTimeBetweenRec_Tnx: Average time between ERC20 token received transactions in minutes
* ERC20AvgTimeBetweenContract_Tnx: Average time ERC20 token between sent token transactions
* ERC20MinVal_Rec: Minimum value in Ether received from ERC20 token transactions for account
* ERC20MaxVal_Rec: Maximum value in Ether received from ERC20 token transactions for account
* ERC20AvgVal_Rec: Average value in Ether received from ERC20 token transactions for account
* ERC20MinVal_Sent: Minimum value in Ether sent from ERC20 token transactions for account
* ERC20MaxVal_Sent: Maximum value in Ether sent from ERC20 token transactions for account
* ERC20AvgVal_Sent: Average value in Ether sent from ERC20 token transactions for account
* ERC20UniqSentTokenName: Number of Unique ERC20 tokens transferred
* ERC20UniqRecTokenName: Number of Unique ERC20 tokens received
* ERC20MostSentTokenType: Most sent token for account via ERC20 transaction
* ERC20MostRecTokenType: Most received token for account via ERC20 transactions

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold, cross_validate, learning_curve, RandomizedSearchCV, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, plot_confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

import xgboost as xgb
import lightgbm as lgb

import os

In [2]:
ds = pd.read_csv('../input/ethereum-frauddetection-dataset/transaction_dataset.csv', index_col=[0])
ds.drop(columns='Index', inplace=True)

In [5]:
# sampling some observations
print(ds.shape)
ds.sample(3)

(9841, 49)


Unnamed: 0,Address,FLAG,Avg min between sent tnx,Avg min between received tnx,Time Diff between first and last (Mins),Sent tnx,Received Tnx,Number of Created Contracts,Unique Received From Addresses,Unique Sent To Addresses,min value received,max value received,avg val received,min val sent,max val sent,avg val sent,min value sent to contract,max val sent to contract,avg value sent to contract,total transactions (including tnx to create contract,total Ether sent,total ether received,total ether sent contracts,total ether balance,Total ERC20 tnxs,ERC20 total Ether received,ERC20 total ether sent,ERC20 total Ether sent contract,ERC20 uniq sent addr,ERC20 uniq rec addr,ERC20 uniq sent addr.1,ERC20 uniq rec contract addr,ERC20 avg time between sent tnx,ERC20 avg time between rec tnx,ERC20 avg time between rec 2 tnx,ERC20 avg time between contract tnx,ERC20 min val rec,ERC20 max val rec,ERC20 avg val rec,ERC20 min val sent,ERC20 max val sent,ERC20 avg val sent,ERC20 min val sent contract,ERC20 max val sent contract,ERC20 avg val sent contract,ERC20 uniq sent token name,ERC20 uniq rec token name,ERC20 most sent token type,ERC20_most_rec_token_type
7451,0xc879acd7998aa249dac1af34da580701a1de0adb,0,23.85,0.25,72.05,3,2,0,2,3,5.787453,95.212547,50.5,2.0,71.447169,33.666152,0.0,0.0,0.0,5,100.998456,101.0,0.0,0.001544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
4970,0x8334a533f0c3f904ca59061fae649a8c596b09ac,0,51.52,21.94,394796.12,5930,4070,16,30,2942,0.0,399.999559,1.847418,0.0,600.0,3.614308,0.0,0.0,0.0,10016,21432.84476,7518.989385,0.0,-13913.85537,50.0,1419735.6,347.001827,0.0,1.0,38.0,0.0,45.0,0.0,0.0,0.0,0.0,0.0,1380482.0,28974.19591,347.001827,347.001827,347.001827,0.0,0.0,0.0,1.0,43.0,UnikoinGold,Beauty Coin
2281,0x3b964b4b9b85960693598f978dbed7dacda07944,0,0.0,30339.75,273057.78,0,9,1,2,0,0.036506,1.000251,0.340709,0.0,0.0,0.0,0.0,0.0,0.0,10,0.0,3.066385,0.0,3.066385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


In [None]:
# column names
display(ds.columns)

# describtion of numeratic columns
display(ds.describe())

# Non-Null Count and type of columns
display(ds.info())

### Categorical columns

In [None]:
# How many non-unique adresses?
_, uniq_idx, counts = np.unique(ds['Address'].to_numpy(), return_index=True, return_counts=True)
non_unique_addresses_idx = uniq_idx[counts > 1]
print("non-unique adresses count: {}".format(len(non_unique_addresses_idx), end='\n\n'))
# What are the flags of non-uniqe adresses
non_unique_addresses_flags = ds.iloc[non_unique_addresses_idx]['FLAG']
print("flags of non-unique adresses: ", end='')
print(*non_unique_addresses_flags)

ds.drop(columns='Address', inplace=True)

Columns 'ERC20 most sent token type' and ' ERC20_most_rec_token_type' contain token types. Most of tokens occur only once so they seem irrelevant in fraud detection.

In [None]:
display(np.unique(ds[' ERC20 most sent token type'].astype(str)))
display(np.unique(ds[' ERC20_most_rec_token_type'].astype(str)))

ds.drop(columns=[' ERC20 most sent token type', ' ERC20_most_rec_token_type'], inplace=True)

### Numerical columns

In [None]:
# are classes balanced?
print('class : count : percent')
print('0     : {}  : {:.2%}'.format(sum(ds['FLAG']==0), sum(ds['FLAG']==0)/len(ds['FLAG']) ))
print('1     : {}  : {:.2%}'.format(sum(ds['FLAG']==1), sum(ds['FLAG']==1)/len(ds['FLAG']) ))

sns.heatmap(ds.iloc[:,:1])
plt.show()

Dataset is unbalanced, we must remember about this when choosing our model metric.

Deleting columns that holds only zeros.

In [None]:
ds.drop(columns=[' ERC20 avg time between rec tnx', ' ERC20 avg time between rec 2 tnx', ' ERC20 avg time between contract tnx',
                 ' ERC20 min val sent contract', ' ERC20 max val sent contract', ' ERC20 avg val sent contract', ' ERC20 avg time between sent tnx'], inplace=True)

In [None]:
# missing values
missing_values = ds.isna()
missing_percent = missing_values.sum() / ds.shape[0] * 100
missing_df = pd.DataFrame([missing_values.sum(), missing_percent], ['count', 'percent'])
display(missing_df.sort_values(by='percent', axis=1, ascending=False))
missing_df.sort_values(by='percent', axis=1, ascending=False).to_csv('missing.csv')

sns.heatmap(missing_values, cbar=False, cmap='magma')
plt.show()

Looks like missing values are highly connected to fraud cases

In [None]:
non_fraud_rows, fraud_rows = np.where( [ds.iloc[:,0]==1] )
print(ds.iloc[fraud_rows,:].isna().sum()[-20:])

As we expected every missing value is in fraud rows. That means that almost 40% of fraud rows have missing values.

In [None]:
missing_columns = ds.columns[ds.isna().sum() > 0]

In [None]:
# correlation
corr = ds.corr()
plt.figure(figsize=(20,12))
sns.heatmap(np.abs(corr), cmap='coolwarm')
plt.show()

## Data preprocessing

We left only numeratic features so preprocessing is limited to imputing null values with column mean and scaling.

In [None]:
preprocessing_pipeline = Pipeline([
    ('impoter', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

X = ds.drop(columns='FLAG').to_numpy()
y = ds['FLAG'].to_numpy()

random_permutation = np.random.permutation(len(X))
X = X[random_permutation]
y = y[random_permutation]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

X_train = preprocessing_pipeline.fit_transform(X_train)
X_test = preprocessing_pipeline.transform(X_test)

## Model selection

In [None]:
def evaluate_models(X, y, models, cv):
    f1_scores = dict()
    acc_scores = dict()
    
    for i, model in enumerate(models):
        clf_pipeline = make_pipeline(preprocessing_pipeline, model)
        results = cross_validate(clf_pipeline, X, y, cv=cv, scoring=['f1', 'accuracy'], n_jobs=-1)
        avg_f1 = np.mean(results['test_f1'])
        avg_acc = np.mean(results['test_accuracy'])
        
        model_name = model.__class__.__name__
        f1_scores[model_name] = avg_f1
        acc_scores[model_name] = avg_acc
        print('{}-of-{}: {} f1={}, acc={}'.format(i+1, len(models), model_name, avg_f1, avg_acc))
    return f1_scores, acc_scores

In [None]:
cv = StratifiedKFold(5, shuffle=True, random_state=42)

classifiers = [
    LogisticRegression(random_state=42),
    KNeighborsClassifier(),
    RandomForestClassifier(random_state=42),
    lgb.LGBMClassifier(random_state=42),
    xgb.XGBClassifier(random_state=42),
    SVC(random_state=42),
    AdaBoostClassifier(random_state=42),
    GaussianNB(),
    MLPClassifier(random_state=42),
]

In [None]:
f1_scores, acc_scores = evaluate_models(X, y, classifiers, cv)

In [None]:
def visualize_scores(f1_scores, acc_scores):
    x = np.arange(len(f1_scores))
    width = 0.45
    
    f1_values = list(f1_scores.values())
    acc_values = list(acc_scores.values())
    
    plt.figure(figsize=(15, 8)).tight_layout()
    plt.bar(x - width / 2, f1_values, width, label='f1')
    plt.bar(x + width / 2, acc_values, width, label='accuracy')
    
    for index, value in enumerate(x - width / 2):
        plt.text(value, f1_values[index], '{:.3}'.format(f1_values[index]),
                 verticalalignment='bottom', horizontalalignment='center', fontsize=10)

    for index, value in enumerate(x + width / 2):
        plt.text(value, acc_values[index], '{:.3}'.format(acc_values[index]),
                 verticalalignment='bottom', horizontalalignment='center', fontsize=10)    
    
    classifiers_names = f1_scores.keys()
    plt.xticks(x, classifiers_names, rotation=40, horizontalalignment='right', fontsize=10)
    plt.legend()

visualize_scores(f1_scores, acc_scores)

### XGBClassifier hyperparameter tuning

In [None]:
xgb_parameters = {
    'xgbclassifier__n_estimators': range(1000, 4001, 1000),
    'xgbclassifier__gamma': [0, 0.5, 1],
    'xgbclassifier__max_depth': [5, 6, 7]
}

xgb_pipeline = make_pipeline(preprocessing_pipeline, xgb.XGBClassifier(random_state=42))
# xgb_pipeline.steps
xgb_grid_search = RandomizedSearchCV(
    xgb_pipeline,
    param_distributions=xgb_parameters,
    scoring = 'f1',
    n_iter = 12,
    n_jobs = -1,
    cv = 5,
    random_state=42
)

xgb_grid_search.fit(X, y)

In [None]:
#grid_search.grid_scores_
display(xgb_grid_search.best_score_)
display(xgb_grid_search.best_params_)

In [None]:
LGBM_parameters = {
        'lgbmclassifier__bagging_fraction': [0, 0.2, 0.5, 0.8, 1],
        'lgbmclassifier__feature_fraction': [0.5, 0.8],
        'lgbmclassifier__max_depth': [6, 10, 13, 16, 20],
        'lgbmclassifier__min_data_in_leaf': range(40, 180, 20),
        'lgbmclassifier__num_leaves': range(500, 2500, 300)
}

LGBM_pipeline = make_pipeline(preprocessing_pipeline, lgb.LGBMClassifier(random_state=42))
LGBM_grid_search = RandomizedSearchCV(
    LGBM_pipeline,
    param_distributions=LGBM_parameters,
    scoring = 'f1',
    n_iter = 60,
    n_jobs = -1,
    cv = 5,
    random_state=42
)

LGBM_grid_search.fit(X, y)

In [None]:
display(LGBM_grid_search.best_score_)
display(LGBM_grid_search.best_params_)

In [None]:
RFC_parameters = {
        'randomforestclassifier__n_estimators': range(50, 1050, 100),
        'randomforestclassifier__max_depth': range(50, 300, 20),
        'randomforestclassifier__min_samples_split': [2, 5, 10],
        'randomforestclassifier__min_samples_leaf': [1, 2, 4],
        'randomforestclassifier__bootstrap': [True, False]
}

RFC_pipeline = make_pipeline(preprocessing_pipeline, RandomForestClassifier(random_state=42))
RFC_grid_search = RandomizedSearchCV(
    RFC_pipeline,
    param_distributions=RFC_parameters,
    scoring = 'f1',
    n_iter = 24,
    n_jobs = -1,
    cv = 5,
    random_state=42
)

RFC_grid_search.fit(X, y)

In [None]:
display(RFC_grid_search.best_score_)
display(RFC_grid_search.best_params_)

## Best model evaluation

In [None]:
best_model = lgb.LGBMClassifier(num_leaves=1400, min_data_in_leaf=100, max_depth=10,
    feature_fraction=0.8, bagging_fraction=0.5, random_state=42)

best_model.fit(X_train, y_train)

predictions = best_model.predict(X_test)

In [None]:
print("f1 score = {}".format(f1_score(y_test, predictions)))

print("ROC AUC score = {}".format(roc_auc_score(y_test, predictions)))

print("accuracy score = {}".format(accuracy_score(y_test, predictions)))

display(plot_confusion_matrix(best_model, X_test, y_test))

In [None]:
def plot_features_importance(feature_importance):
    column_names = ds.drop(columns='FLAG').columns

    df_feature_importance = pd.DataFrame(sorted(zip(feature_importance, column_names)),
                                       columns=['Importance value', 'Feature'])
    df_feature_importance = df_feature_importance.sort_values('Importance value', ascending=False)

    plt.figure(figsize=(9, 7)).tight_layout()
    sns.barplot(y="Feature", x="Importance value", data=df_feature_importance)
    plt.show()

plot_features_importance(best_model.feature_importances_)