# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > TABLE OF CONTENTS<br><div>  

* [IMPORT](#1)
* [INTRODUCTION](#2)
* [PREPROCESSING](#3)
* [MODEL TRAINING](#4)  
* [EVALUATION](#5)

<a id="1"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > IMPORT<br><div> 

In [90]:
%%time

import warnings
warnings.filterwarnings('ignore')
import copy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import uniform, randint
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.model_selection import RandomizedSearchCV,train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from xgboost import XGBClassifier

CPU times: total: 0 ns
Wall time: 997 µs


<a id="2"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > INTRODUCTION<br><div> 

**Dataset**
Ransomware attacks are one of the most dangerous related crimes in the coin market. To increase the challenge of fighting the attack, early detection of ransomware seems necessary.
The dataset downloaded from https://archive.ics.uci.edu/dataset/526/bitcoinheistransomwareaddressdataset contained the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and formed the Bitcoin graph. We filtered out the network edges that transfer less than B0.3, since ransom amounts are rarely below this threshold.

**Features**
address: String. Bitcoin address.
year: Integer. Year.
day: Integer. Day of the year. 1 is the first day, 365 is the last day.
length: Integer.
weight: Float.
count: Integer.
looped: Integer.
neighbors: Integer.
income: Integer. Satoshi amount (1 bitcoin = 100 million satoshis).
label: Category String. Name of the ransomware family (e.g., Cryptxxx, cryptolocker etc) or white (i.e., not known to be ransomware).

**Assumption**
We shall assume that fraudulent transactions are very less in number as due to novelty of the Bitcoin network, only a limited number of transactions have been reported as fraud. 

<a id="3"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > PREPROCESSING<br><div> 
    
**Key tasks**
1. Data import
2. Feature engineering
3. Under-sampling the datasets appropriately

In [63]:
# import data
df=pd.read_csv("data/BitcoinHeistData.csv")

In [64]:
df.shape

(2916697, 10)

In [65]:
df.head(20)

Unnamed: 0,address,year,day,length,weight,count,looped,neighbors,income,label
0,111K8kZAEnJg245r2cM6y9zgJGHZtJPy6,2017,11,18,0.008333333,1,0,2,100050000.0,princetonCerber
1,1123pJv8jzeFQaCV4w644pzQJzVWay2zcA,2016,132,44,0.0002441406,1,0,1,100000000.0,princetonLocky
2,112536im7hy6wtKbpH1qYDWtTyMRAcA2p7,2016,246,0,1.0,1,0,2,200000000.0,princetonCerber
3,1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7,2016,322,72,0.00390625,1,0,2,71200000.0,princetonCerber
4,1129TSjKtx65E35GiUo4AYVeyo48twbrGX,2016,238,144,0.07284841,456,0,1,200000000.0,princetonLocky
5,112AmFATxzhuSpvtz1hfpa3Zrw3BG276pc,2016,96,144,0.084614,2821,0,1,50000000.0,princetonLocky
6,112E91jxS2qrQY1z78LPWUWrLVFGqbYPQ1,2016,225,142,0.002088519,881,0,2,100000000.0,princetonCerber
7,112eFykaD53KEkKeYW9KW8eWebZYSbt2f5,2016,324,78,0.00390625,1,0,2,100990000.0,princetonCerber
8,112FTiRdJjMrNgEtd4fvdoq3TC33Ah5Dep,2016,298,144,2.302828,4220,0,2,80000000.0,princetonCerber
9,112GocBgFSnaote6krx828qaockFraD8mp,2016,62,112,3.72529e-09,1,0,1,50000000.0,princetonLocky


In [66]:
df.describe()

Unnamed: 0,year,day,length,weight,count,looped,neighbors,income
count,2916697.0,2916697.0,2916697.0,2916697.0,2916697.0,2916697.0,2916697.0,2916697.0
mean,2014.475,181.4572,45.00859,0.5455192,721.6446,238.5067,2.206516,4464889000.0
std,2.257398,104.0118,58.98236,3.674255,1689.676,966.3217,17.91877,162686000000.0
min,2011.0,1.0,0.0,3.606469e-94,1.0,0.0,1.0,30000000.0
25%,2013.0,92.0,2.0,0.02148438,1.0,0.0,1.0,74285590.0
50%,2014.0,181.0,8.0,0.25,1.0,0.0,2.0,199998500.0
75%,2016.0,271.0,108.0,0.8819482,56.0,0.0,2.0,994000000.0
max,2018.0,365.0,144.0,1943.749,14497.0,14496.0,12920.0,49964400000000.0


In [67]:
df["label"].value_counts()

white                          2875284
paduaCryptoWall                  12390
montrealCryptoLocker              9315
princetonCerber                   9223
princetonLocky                    6625
montrealCryptXXX                  2419
montrealNoobCrypt                  483
montrealDMALockerv3                354
montrealDMALocker                  251
montrealSamSam                      62
montrealCryptoTorLocker2015         55
montrealGlobeImposter               55
montrealGlobev3                     34
montrealGlobe                       32
montrealWannaCry                    28
montrealRazy                        13
montrealAPT                         11
paduaKeRanger                       10
montrealFlyper                       9
montrealXTPLocker                    8
montrealXLockerv5.0                  7
montrealVenusLocker                  7
montrealCryptConsole                 7
montrealEDA2                         6
montrealJigSaw                       4
paduaJigsaw              

So only a small fraction of transactions are fraudulent. Our anomaly detection can work well.

In [68]:
df.isnull().sum()

address      0
year         0
day          0
length       0
weight       0
count        0
looped       0
neighbors    0
income       0
label        0
dtype: int64

In [69]:
# in case we need the original data
old_df=copy.deepcopy(df)
# drop features that not related to fraud
df.drop(columns=["address","year","day"],axis=1,inplace=True)
for col in df.columns[:-1]:
    df[col]=df[col]
X=df.drop(columns=["label"])
y=df["label"]

In [70]:
new_df=pd.DataFrame()
grouped=df.groupby("label")
new_df["num_of_instances"]=grouped.size()

In [71]:
# standard deviation
for col in X.columns:
    new_df[f"{col}_std"]=grouped[col].agg(np.std).fillna(0)
    # new_df[f"{col}_min"]=grouped[col].min()
    # new_df[f"{col}_max"]=grouped[col].max()

In [72]:
new_df=new_df.reset_index()
new_df.head()

Unnamed: 0,label,num_of_instances,length_std,weight_std,count_std,looped_std,neighbors_std,income_std
0,montrealAPT,11,73.145434,0.600596,2721.94673,2434.040815,1.439697,365411500.0
1,montrealComradeCircle,1,0.0,0.0,0.0,0.0,0.0,0.0
2,montrealCryptConsole,7,68.747987,0.410909,1428.059373,0.0,0.0,8181253.0
3,montrealCryptXXX,2419,58.187904,0.434143,1684.393374,476.538479,0.858628,66096040.0
4,montrealCryptoLocker,9315,50.731602,1.555608,868.917734,533.66012,3.670793,12781160000.0


In [73]:
# Making the undersampled dataset:-
def MakeRndUndSmpl(df):
    """
    This function makes a dataset using the provided dataset with undersampling technique. 
    This is a slightly verbose implementation of the same.
    
    Input-->   df - pd.DataFrame
    Returns--> Modified dataframe
    """;
    
    # Calculate the number of samples for each label. 

    # Choose the samples with class label `1`.
    black_df = df.loc[df['label'] != 'white'] 
    # Choose the samples with class label `0`.
    white_df = df.loc[df['label'] == 'white']
    # Select `pos` number of negative samples.
    # This makes sure that we have equal number of samples for each label.
    white_df = white_df.sample(n=158587,random_state=42)

    # Join both label dataframes.
    undersampled_df = pd.concat([black_df, white_df]).sample(frac=1,random_state=50)

    # Shuffle the data and return
    return undersampled_df

In [74]:
sampled_df = MakeRndUndSmpl(df)

In [75]:
sampled_df.head(20)

Unnamed: 0,length,weight,count,looped,neighbors,income,label
526724,14,0.00390625,1,0,2,3216000000.0,white
2825235,144,0.1969697,1805,0,2,55783520.0,white
7558,14,0.015625,1,0,2,635531100.0,montrealCryptoLocker
1141393,6,1.0,1,0,2,292510000.0,white
1484054,142,0.06571111,2568,0,2,136000000.0,white
2702804,144,0.102503,4100,0,1,699980900.0,white
359530,0,0.5,1,0,2,4107789000.0,white
9936,6,0.5,1,0,2,120000000.0,montrealCryptXXX
2670910,0,0.09090909,1,0,2,300000000.0,white
2811264,144,0.11939,5734,0,2,7992815000.0,white


**Label Encoding**

We do label encoding of:
> * White label: 0
> * Ransomware: 1

In [76]:
# Create a label encoder object
label_encoder = LabelEncoder()

# Encode the labels
sampled_df['label'] = label_encoder.fit_transform(sampled_df['label'])

# Map 'white' to 0 and other values to 1
sampled_df['label'] = sampled_df['label'].apply(lambda x: 0 if x == label_encoder.transform(['white'])[0] else 1)

sampled_df['label']

526724     0
2825235    0
7558       1
1141393    0
1484054    0
          ..
1363905    0
1934713    0
2481475    0
2039010    0
880386     0
Name: label, Length: 200000, dtype: int64

In [77]:
sampled_df.columns

Index(['length', 'weight', 'count', 'looped', 'neighbors', 'income', 'label'], dtype='object')

We calculate the **Z-score** and filter out the anomalies as per a threshold.

In [78]:
z_score_df=pd.DataFrame()
for col in sampled_df.columns:
    z_score_df[f"{col}_z_score"]=(sampled_df[col]-sampled_df[col].mean())/sampled_df[col].std()
z_score_df.head()

Unnamed: 0,length_z_score,weight_z_score,count_z_score,looped_z_score,neighbors_z_score,income_z_score,label_z_score
526724,-0.513441,-0.114174,-0.424426,-0.234473,-0.009561,-0.00492,-0.511015
2825235,1.695481,-0.075626,0.677535,-0.234473,-0.009561,-0.028981,-0.511015
7558,-0.513441,-0.111834,-0.424426,-0.234473,-0.009561,-0.024567,1.956881
1141393,-0.649374,0.084711,-0.424426,-0.234473,-0.009561,-0.027178,-0.511015
1484054,1.661498,-0.101834,1.143608,-0.234473,-0.009561,-0.02837,-0.511015


In [79]:
filtered_df = z_score_df[z_score_df.apply(lambda row: all(-3 <= val <= 3 for val in row), axis=1)]

# Display the filtered columns' data
filtered_df.head()

Unnamed: 0,length_z_score,weight_z_score,count_z_score,looped_z_score,neighbors_z_score,income_z_score,label_z_score
526724,-0.513441,-0.114174,-0.424426,-0.234473,-0.009561,-0.00492,-0.511015
2825235,1.695481,-0.075626,0.677535,-0.234473,-0.009561,-0.028981,-0.511015
7558,-0.513441,-0.111834,-0.424426,-0.234473,-0.009561,-0.024567,1.956881
1141393,-0.649374,0.084711,-0.424426,-0.234473,-0.009561,-0.027178,-0.511015
1484054,1.661498,-0.101834,1.143608,-0.234473,-0.009561,-0.02837,-0.511015


**Train-test split**

In [86]:
X=sampled_df.drop("label",axis=1)
y=(sampled_df["label"]>0).astype('int')

In [87]:
# 0.67 as training and 0.33 for test
X_train, X_test,y_train, y_test = train_test_split(X,y ,
                                   random_state=42, 
                                   test_size=0.33, 
                                   shuffle=True)

**standardization**

We perform **normalization** on the train data and then scale test data accordingly.

In [88]:
scaler=StandardScaler()
scaler.fit(X_train)

X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

X_train_scaled=pd.DataFrame(X_train_scaled,columns=X_train.columns)
X_test_scaled=pd.DataFrame(X_test_scaled,columns=X_test.columns)

<a id="4"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > MODEL TRAINING<br><div> 

I use the **XGBoost Classifier** to train on our dataset. 
**Random search** is performed to optimize the hyperparameters of the classifier.

In [89]:
%%time

# Define the XGBoost classifier
model = XGBClassifier()

# Define the hyperparameter search space
param_space = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(1, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4)
}

# Perform random search
# This might take a few minites to run
random_search = RandomizedSearchCV(model, param_distributions=param_space, n_iter=10, cv=5)
random_search.fit(X_train_scaled, y_train)

# Print the best parameters and score
print("Best parameters found: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)

Best parameters found:  {'colsample_bytree': 0.9854319067318575, 'learning_rate': 0.13440068332107247, 'max_depth': 6, 'n_estimators': 427, 'subsample': 0.6859256996142484}
Best score:  0.8510223880597015


<a id="5"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > EVALUATION<br><div> 

I use the following metrics for evaluation:
> * accuracy
> * precision
> * recall
> * f1 score

In [91]:
# Get the best model from random search
best_model = random_search.best_estimator_

# Predict on the test set
y_pred = best_model.predict(X_test_scaled)

# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Accuracy: 0.85
Precision: 0.74
Recall: 0.44
F1 Score: 0.55


**Methodology used:**
> Suleiman Ali Alsaif, "Machine Learning-Based Ransomware Classification of Bitcoin Transactions", Applied Computational Intelligence and Soft Computing,
> vol. 2023, Article ID 6274260, 10 pages, 2023. https://doi.org/10.1155/2023/6274260