--------------
**Enhancement of Transaction Dataset**

The purpose of this notebook is to enhance the transaction feature dataset with class codes and labels, plus remove manipulated features as they cannot be validated. Enhancing this dataset will allow exploratory work to be undertaken to find a patterns which are unique to illicit transactions compared to licit transactions.

The Elliptic++ dataset is an enhancement of the Elliptic1 dataset. This dataset includes transaction information that can be validated using blockchain wallet search websites. See here:

Elliptic1 Paper and Dataset:
https://arxiv.org/pdf/1908.02591
https://www.kaggle.com/datasets/ellipticco/elliptic-data-set/data

Elliptic++ Paper and Dataset:
https://arxiv.org/pdf/2306.06108
https://github.com/git-disl/EllipticPlusPlus

--------------

In [1]:
# Data cleaning and manipulation
import pandas as pd
import numpy as np
from pandas_gbq import to_gbq

# Set up display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.4f' % x)


-------------------

Read in Transaction Datasets

-------------------

Dataset 1: The first dataset is txn_classes which contains the masked transactions and a class code. Class labels need to be added to this dataset.

Dataset 2: The second dataset is the txn_features dataset which contains masked (normalised) and unmasked features. Any adjusted or masked features will be removed and the class codes and labels will be added. Masked features are from the original Elliptic1 dataset. The unmasked features have been added from the Elliptic++ dataset and validated features using blockchain wallet searches.

In [2]:
# Read in txn_classes table
%%bigquery df_txn_classes
select * from `sixth-legend-440110-g7.txn_data.txn_classes`;

Query is running:   0%|          |

Downloading:   0%|          |

In [3]:
# Read in txn_features table
%%bigquery df_txn_features
select * from `sixth-legend-440110-g7.txn_data.txn_features`;

Query is running:   0%|          |

Downloading:   0%|          |

In [4]:
print(df_txn_classes.shape)
print(df_txn_features.shape)

(203769, 2)
(203769, 184)


-------------------

Add in class labels to the txn_classes dataset

-------------------

In [5]:
df_txn_classes.head(1)

Unnamed: 0,txId,class
0,3205536,1


In [6]:
# Map classes to a name
df_txn_classes['class_label'] = df_txn_classes['class'].map({1: 'Illicit', 2: 'Licit', 3: 'Unknown'})
print(df_txn_classes.shape)
df_txn_classes.head(1)

(203769, 3)


Unnamed: 0,txId,class,class_label
0,3205536,1,Illicit


-------------------

In the df_txn_features dataset, remove masked features and add the class code and labels.

-------------------

In [7]:
df_txn_features.head(1)

Unnamed: 0,txId,Time step,Local_feature_1,Local_feature_2,Local_feature_3,Local_feature_4,Local_feature_5,Local_feature_6,Local_feature_7,Local_feature_8,Local_feature_9,Local_feature_10,Local_feature_11,Local_feature_12,Local_feature_13,Local_feature_14,Local_feature_15,Local_feature_16,Local_feature_17,Local_feature_18,Local_feature_19,Local_feature_20,Local_feature_21,Local_feature_22,Local_feature_23,Local_feature_24,Local_feature_25,Local_feature_26,Local_feature_27,Local_feature_28,Local_feature_29,Local_feature_30,Local_feature_31,Local_feature_32,Local_feature_33,Local_feature_34,Local_feature_35,Local_feature_36,Local_feature_37,Local_feature_38,Local_feature_39,Local_feature_40,Local_feature_41,Local_feature_42,Local_feature_43,Local_feature_44,Local_feature_45,Local_feature_46,Local_feature_47,Local_feature_48,Local_feature_49,Local_feature_50,Local_feature_51,Local_feature_52,Local_feature_53,Local_feature_54,Local_feature_55,Local_feature_56,Local_feature_57,Local_feature_58,Local_feature_59,Local_feature_60,Local_feature_61,Local_feature_62,Local_feature_63,Local_feature_64,Local_feature_65,Local_feature_66,Local_feature_67,Local_feature_68,Local_feature_69,Local_feature_70,Local_feature_71,Local_feature_72,Local_feature_73,Local_feature_74,Local_feature_75,Local_feature_76,Local_feature_77,Local_feature_78,Local_feature_79,Local_feature_80,Local_feature_81,Local_feature_82,Local_feature_83,Local_feature_84,Local_feature_85,Local_feature_86,Local_feature_87,Local_feature_88,Local_feature_89,Local_feature_90,Local_feature_91,Local_feature_92,Local_feature_93,Aggregate_feature_1,Aggregate_feature_2,Aggregate_feature_3,Aggregate_feature_4,Aggregate_feature_5,Aggregate_feature_6,Aggregate_feature_7,Aggregate_feature_8,Aggregate_feature_9,Aggregate_feature_10,Aggregate_feature_11,Aggregate_feature_12,Aggregate_feature_13,Aggregate_feature_14,Aggregate_feature_15,Aggregate_feature_16,Aggregate_feature_17,Aggregate_feature_18,Aggregate_feature_19,Aggregate_feature_20,Aggregate_feature_21,Aggregate_feature_22,Aggregate_feature_23,Aggregate_feature_24,Aggregate_feature_25,Aggregate_feature_26,Aggregate_feature_27,Aggregate_feature_28,Aggregate_feature_29,Aggregate_feature_30,Aggregate_feature_31,Aggregate_feature_32,Aggregate_feature_33,Aggregate_feature_34,Aggregate_feature_35,Aggregate_feature_36,Aggregate_feature_37,Aggregate_feature_38,Aggregate_feature_39,Aggregate_feature_40,Aggregate_feature_41,Aggregate_feature_42,Aggregate_feature_43,Aggregate_feature_44,Aggregate_feature_45,Aggregate_feature_46,Aggregate_feature_47,Aggregate_feature_48,Aggregate_feature_49,Aggregate_feature_50,Aggregate_feature_51,Aggregate_feature_52,Aggregate_feature_53,Aggregate_feature_54,Aggregate_feature_55,Aggregate_feature_56,Aggregate_feature_57,Aggregate_feature_58,Aggregate_feature_59,Aggregate_feature_60,Aggregate_feature_61,Aggregate_feature_62,Aggregate_feature_63,Aggregate_feature_64,Aggregate_feature_65,Aggregate_feature_66,Aggregate_feature_67,Aggregate_feature_68,Aggregate_feature_69,Aggregate_feature_70,Aggregate_feature_71,Aggregate_feature_72,in_txs_degree,out_txs_degree,total_BTC,fees,size,num_input_addresses,num_output_addresses,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total
0,386141483,31,-0.173,-0.0663,1.0186,-0.0469,-0.024,-0.113,0.2427,-0.1636,-0.1695,-0.0497,-0.1659,2.4592,2.4153,-0.0227,-0.0133,-0.0574,-0.1711,-0.1729,-0.1763,0.6659,0.3206,-0.1397,-0.1489,-0.0801,-0.1557,-0.0108,-0.0121,-0.1397,-0.1489,-0.0801,-0.1557,-0.0107,-0.012,-0.0247,-0.0313,-0.023,-0.0262,0.0014,0.0015,-0.2271,-0.2393,-0.0753,-0.2348,0.0375,0.0434,-0.2272,-0.2432,-0.0979,-0.2359,0.0366,0.0423,0.3954,0.1848,-0.2326,0.3091,0.0488,0.053,-0.0392,-0.1729,-0.1631,-0.1609,-1.1294,-0.7558,-0.0391,-0.1729,-0.1631,-0.1609,-1.1191,-0.7558,-0.017,-0.03,-0.0176,-0.015,-2.0661,-1.2847,-0.0954,-0.2618,-0.2484,-0.2619,-1.2735,-0.8187,-0.0591,-0.2621,-0.2549,-0.259,-1.2912,-0.8419,-0.2939,-0.2057,-0.1516,-0.0677,0.0495,0.5802,-0.1711,-0.2031,-0.1167,-0.1933,2.5368,2.5323,0.4768,0.4756,-0.0553,0.4772,2.7816,2.7693,-0.1693,-0.1405,0.0625,-0.1484,3.0184,3.0225,-1.0963,-0.7536,0.9147,-0.9592,-3.6758,-3.6739,-0.1164,-0.1766,-0.1373,-0.1525,-0.0261,-0.0277,-0.0931,-0.1267,-0.0704,-0.1153,3.8785,3.8767,-0.1249,-0.2004,-0.1887,-0.2152,-0.0477,-0.0482,-13.0934,-11.6606,-0.3018,-12.8684,0.1831,0.1827,-0.1791,-0.4614,-0.4247,-0.4462,0.143,0.1425,-1.5119,-2.4666,-1.0599,-2.2861,0.1856,0.1855,-0.2405,-0.624,-0.5771,-0.6262,0.2411,0.2414,-0.2161,-0.1259,-0.1312,-0.2698,-0.1206,-0.1198,,,,,,,,,,,,,,,,,


In [9]:
# subset txn feature to remove all masked aggregate and local features
# Remove columns containing 'Aggregate_feature_' or 'Local_feature_'
columns_to_remove = [col for col in df_txn_features.columns if 'Aggregate_feature_' in col or 'Local_feature_' in col]
df_txn_features = df_txn_features.drop(columns=columns_to_remove)
print(df_txn_features.shape)

(203769, 19)


In [10]:
# add in class code and labels
df_txn_features_clean = pd.merge(df_txn_features, df_txn_classes, how = 'left', on = 'txId')

In [11]:
print(df_txn_features_clean.shape)
df_txn_features_clean.head(1)

(203769, 21)


Unnamed: 0,txId,Time step,in_txs_degree,out_txs_degree,total_BTC,fees,size,num_input_addresses,num_output_addresses,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total,class,class_label
0,386141483,31,,,,,,,,,,,,,,,,,,3,Unknown


In [14]:
# Move class number and label columns to index number 2
columns_to_move = ['class','class_label']
target_position = 2  # 0-based index for the third position

# Pop the columns in the desired order
for col in columns_to_move:
    col_data = df_txn_features_clean.pop(col)
    df_txn_features_clean.insert(target_position, col, col_data)
    target_position += 1  # Increment target position for the next column

In [16]:
print(df_txn_features_clean.shape)
df_txn_features_clean.head(1)

(203769, 21)


Unnamed: 0,txId,Time step,class,class_label,in_txs_degree,out_txs_degree,total_BTC,fees,size,num_input_addresses,num_output_addresses,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total
0,386141483,31,3,Unknown,,,,,,,,,,,,,,,,,


-------------------

Save as a new table in BigQuery

-------------------

In [17]:
# Define your project ID and table ID
project_id = 'sixth-legend-440110-g7'
table1_id = 'txn_data.txn_features_clean'

In [18]:
# Save DataFrame to BigQuery
to_gbq(df_txn_features_clean, table1_id, project_id=project_id, if_exists='replace')



100%|██████████| 1/1 [00:00<00:00, 7489.83it/s]


Check that data uploaded to new table matches the df.

In [19]:
# Read in combined wallet features table from BigQuery
%%bigquery df_txn_features_clean
select * from `sixth-legend-440110-g7.txn_data.txn_features_clean`;

Query is running:   0%|          |

Downloading:   0%|          |

In [20]:
print(df_txn_features_clean.shape)
df_txn_features_clean.head(1)

(203769, 21)


Unnamed: 0,txId,Time step,class,class_label,in_txs_degree,out_txs_degree,total_BTC,fees,size,num_input_addresses,num_output_addresses,in_BTC_min,in_BTC_max,in_BTC_mean,in_BTC_median,in_BTC_total,out_BTC_min,out_BTC_max,out_BTC_mean,out_BTC_median,out_BTC_total
0,184703182,1,1,Illicit,0.0,1.0,0.0541,0.001,225.0,1.0,2.0,0.0551,0.0551,0.0551,0.0551,0.0551,0.0101,0.0441,0.0271,0.0271,0.0541


Enhancement of Txn Dataset is complete.