# Exploratory Data Analysis (EDA)

In this notebook, we perform exploratory data analysis to understand:

- class distributions
- missing values
- data quality issues

The data is from [rphunter](https://figshare.com/s/e6e7cce0b6574770e7ce).

## Imports & Configuration

In [None]:

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import os
import seaborn as sns
import re
from sklearn.preprocessing import MultiLabelBinarizer
from pathlib import Path
import subprocess

%matplotlib inline
sns.set(style="whitegrid")


In [76]:
PATH = "../../data/external/rphunter"


## Load Dataset

In [77]:
total_df = pd.read_excel(os.path.join(PATH, "Rug-Pull-Incidents.xlsx"), engine='openpyxl', sheet_name="Total")
experiment_df = pd.read_excel(os.path.join(PATH, "Rug-Pull-Incidents.xlsx"), engine='openpyxl', sheet_name="Experiment")
normal_list = os.listdir(os.path.join(PATH, "Normal-Bytecode"))
rug_list = os.listdir(os.path.join(PATH, "Rug-Bytecode"))

## Initial Data Check

In [78]:
print(f"total_df shape: {total_df.shape} | experiment_df shape: {experiment_df.shape}")
print(f"Number of normal bytecode files: {len(normal_list)} | Number of rug bytecode files: {len(rug_list)}")

total_df shape: (1047, 8) | experiment_df shape: (645, 8)
Number of normal bytecode files: 1675 | Number of rug bytecode files: 645


In [79]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1047 entries, 0 to 1046
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Project Name           1047 non-null   object
 1   Chain                  1046 non-null   object
 2   Address                1035 non-null   object
 3   Open Source            1047 non-null   object
 4   Sale Restrict          291 non-null    object
 5   Variable Manipulation  160 non-null    object
 6   Balance Tamper         436 non-null    object
 7   Source                 1046 non-null   object
dtypes: object(8)
memory usage: 65.6+ KB


In [80]:
experiment_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 645 entries, 0 to 644
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Project Name           645 non-null    object
 1   Chain                  645 non-null    object
 2   Address                645 non-null    object
 3   Open Source            645 non-null    object
 4   Sale Restrict          204 non-null    object
 5   Variable Manipulation  145 non-null    object
 6   Balance Tamper         190 non-null    object
 7   Source                 644 non-null    object
dtypes: object(8)
memory usage: 40.4+ KB


In [81]:
total_df.head()

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
0,GMETA,BSC,0X93023F1D3525E273F291B6F76D2F5027A39BF302,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/BeosinAlert/status/1681240...
1,PokémonFi,BSC,0X2753DCE37A7EDB052A77832039BCC9AA49AD8B25,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1562555...
2,PokémonFi,BSC,0X0AA5CAE4D1C9230543542E998E04EA795EEDF738,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1562555...
3,Sudorare,ETH,0X5404EFAFDD8CC30053069DF2A1B0C4BA881B3E1E,Yes,,,Hidden Mint/Burn,https://x.com/PeckShieldAlert/status/156196749...
4,DRAC Network,ETH,0X10F6F2B97F3AB29583D9D38BABF2994DF7220C21,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/PeckShieldAlert/status/155...


In [82]:
experiment_df.head()

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
0,GMETA,BSC,0X93023F1D3525E273F291B6F76D2F5027A39BF302,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/BeosinAlert/status/1681240...
1,Sudorare,ETH,0X5404EFAFDD8CC30053069DF2A1B0C4BA881B3E1E,Yes,,,Hidden Mint/Burn,https://x.com/PeckShieldAlert/status/156196749...
2,DRAC Network,ETH,0X10F6F2B97F3AB29583D9D38BABF2994DF7220C21,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/PeckShieldAlert/status/155...
3,DHE,BSC,0X11CBC781DADAAD13FC3A361772C80B1C027820AF,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1539031...
4,ElonMVP,BSC,0X3E597EA168A85AA2AE5E2C4333665BCD875ED10F,Yes,Address Restrict,,,https://twitter.com/PeckShieldAlert/status/153...


In [83]:
total_df.describe()

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
count,1047,1046,1035,1047,291,160,436,1046
unique,1029,15,1035,2,15,10,9,1025
top,IDO rug pulls,BSC,0xD2fC424fF0196c3Fb2b3D2A3b29dDEe94e025aee,Yes,Address Restrict,Modifiable Tax Rate,Hidden Mint/Burn,https://hacked.slowmist.io/search/
freq,5,522,1,916,140,80,334,6


In [84]:
experiment_df.describe()

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
count,645,645,645,645,204,145,190,644
unique,638,7,645,2,14,10,7,631
top,Apache NFT SalesRoom,BSC,0xabe776435f7459e2f5ba773bfb753ed19a053dd0,Yes,Address Restrict,Modifiable Tax Rate,Hidden Mint/Burn,https://hacked.slowmist.io/search/
freq,2,324,1,643,92,69,132,6


## Labels Distribution

### Chain

In [85]:
total_df['Chain'].value_counts()

Chain
BSC          522
ETH          400
Polygon       84
Arbitrum      11
Fantom        11
Avax           4
BASE           2
Cronos         2
Heco           2
AVAX           2
KuCoin         2
Cchain         1
OP.ETH         1
SnowTrace      1
Blast          1
Name: count, dtype: int64

In [86]:
experiment_df['Chain'].value_counts()

Chain
BSC         324
ETH         294
Polygon      15
Arbitrum      8
Fantom        2
OP.ETH        1
BASE          1
Name: count, dtype: int64

### Open Source

In [87]:
total_df['Open Source'].value_counts()

Open Source
Yes    916
No     131
Name: count, dtype: int64

In [88]:
experiment_df['Open Source'].value_counts()

Open Source
Yes    643
No       2
Name: count, dtype: int64

### Sale Restrict

In [89]:
total_df['Sale Restrict'].value_counts()

Sale Restrict
Address Restrict                                           140
Amount Restrict                                             90
Address Restrict,Amount Restrict                            36
Address Restrict,TimeStamp Restrict                          7
\nAddress Restrict,Amount Restrict                           4
Address Restrict,Amount Restrict,TimeStamp Restrict          3
TimeStamp Restrict                                           2
\nAddress Restrict                                           2
Amount Restrict,TimeStamp Restrict                           1
Modifiable External Call                                     1
Address Restrict                                             1
Address Restrict,Address Restrict*3                          1
\n\nAddress Restrict,Amount Restrict,TimeStamp Restrict      1
\nAmount Restrict                                            1
TimeStanp Restrict                                           1
Name: count, dtype: int64

In [90]:
experiment_df['Sale Restrict'].value_counts()

Sale Restrict
Address Restrict                                           92
Amount Restrict                                            62
Address Restrict,Amount Restrict                           27
Address Restrict,TimeStamp Restrict                         7
\nAddress Restrict,Amount Restrict                          4
Address Restrict,Amount Restrict,TimeStamp Restrict         3
TimeStamp Restrict                                          2
Modifiable External Call                                    1
Amount Restrict,TimeStamp Restrict                          1
Address Restrict                                            1
Address Restrict,Address Restrict*3                         1
\nAddress Restrict                                          1
\n\nAddress Restrict,Amount Restrict,TimeStamp Restrict     1
\nAmount Restrict                                           1
Name: count, dtype: int64

### Variable Manipulation

In [91]:
total_df['Variable Manipulation'].value_counts()

Variable Manipulation
Modifiable Tax Rate                                   80
Modifiable External Call                              46
Modifiable Tax Address                                18
Modifiable Tax Rate,Modifiable Tax Address             6
Modifibale Tax Rate                                    4
\nModifiable Tax Rate                                  2
Modifuable Tax Rate                                    1
Modifibale Tax Rate,Modifiable Tax Address             1
Modifiable Tax Address,Hidden Balance Modification     1
\n                                                     1
Name: count, dtype: int64

In [92]:
experiment_df['Variable Manipulation'].value_counts()

Variable Manipulation
Modifiable Tax Rate                             69
Modifiable External Call                        43
Modifiable Tax Address                          17
Modifiable Tax Rate,Modifiable Tax Address       6
Modifibale Tax Rate                              4
\nModifiable Tax Rate                            2
Modifuable Tax Rate                              1
Modifibale Tax Rate,Modifiable Tax Address       1
Modifiable Tax Address,Hidden Balance Modify     1
\n                                               1
Name: count, dtype: int64

### Balance Tamper

In [93]:
total_df['Balance Tamper'].value_counts()

Balance Tamper
Hidden Mint/Burn                                  334
Hidden Balance Modification                        54
\nHidden Mint/Burn                                 35
Hidden Mint/Burn,Hidden Balance Modification        6
\n                                                  2
\n\nHidden Mint/Burn                                2
Hidden Mint/Eurn                                    1
Hidden Mint/Burn茂录聦Hidden Balance Modification      1
\n\n                                                1
Name: count, dtype: int64

In [94]:
experiment_df['Balance Tamper'].value_counts()

Balance Tamper
Hidden Mint/Burn                               132
Hidden Balance Modify                           46
Hidden Mint/Burn,Hidden Balance Modify           5
\nHidden Mint/Burn                               4
Hidden Mint/Burn脙炉脗录脗聦Hidden Balance Modify      1
\n\nHidden Mint/Burn                             1
\n                                               1
Name: count, dtype: int64

## Cleaning

In [95]:
# check if 'Address' column exists in Noremal-Bytecode and Rug-Bytecode directories or not
normal_bytecode_df = pd.DataFrame({'Address': normal_list})
rug_bytecode_df = pd.DataFrame({'Address': rug_list})
normal_bytecode_df['Address'] = normal_bytecode_df['Address'].apply(lambda x: x.split('.')[0])
rug_bytecode_df['Address'] = rug_bytecode_df['Address'].apply(lambda x: x.split('.')[0])
print(f"Normal Bytecode Addresses: {normal_bytecode_df['Address'].nunique()}")
print(f"Rug Bytecode Addresses: {rug_bytecode_df['Address'].nunique()}")

Normal Bytecode Addresses: 1675
Rug Bytecode Addresses: 645


In [96]:
normal_bytecode_df.head()

Unnamed: 0,Address
0,0x6B466B0232640382950c45440Ea5b630744eCa99
1,0x4E15361FD6b4BB609Fa63C81A2be19d873717870
2,0xa95c4f2e0d6455637f67F655Da4AFAe5d50d859B
3,0x35dd2ebf20746C6e658fac75cd80D4722fae62f6
4,0x264Dc2DedCdcbb897561A57CBa5085CA416fb7b4


In [97]:
print(f"Number of addresses in total_df that are in normal_bytecode_df: {total_df[total_df['Address'].str.lower().isin(normal_bytecode_df['Address'].str.lower())].shape[0]}")

Number of addresses in total_df that are in normal_bytecode_df: 2


In [98]:
print(f"Number of addresses in experiment_df that are in normal_bytecode_df: {experiment_df[experiment_df['Address'].str.lower().isin(normal_bytecode_df['Address'].str.lower())].shape[0]}")

Number of addresses in experiment_df that are in normal_bytecode_df: 0


In [99]:
print(f"Number of addresses in total_df that are in rug_bytecode_df: {total_df[total_df['Address'].str.lower().isin(rug_bytecode_df['Address'].str.lower())].shape[0]}")

Number of addresses in total_df that are in rug_bytecode_df: 634


In [100]:
print(f"Number of addresses in experiment_df that are in rug_bytecode_df: {experiment_df[experiment_df['Address'].str.lower().isin(rug_bytecode_df['Address'].str.lower())].shape[0]}")

Number of addresses in experiment_df that are in rug_bytecode_df: 644


In [101]:
empty_cond = ((total_df['Balance Tamper'].isna() | (total_df['Balance Tamper'].str.strip() == '')) &
    (total_df['Variable Manipulation'].isna() | (total_df['Variable Manipulation'].str.strip() == '')) &
    (total_df['Sale Restrict'].isna() | (total_df['Sale Restrict'].str.strip() == '')))

empty_or_nan_rows = total_df[empty_cond].shape[0]

print(f"Number of rows with empty or NaN values in 'Balance Tamper', 'Variable Manipulation', and 'Sale Restrict': {empty_or_nan_rows}")

Number of rows with empty or NaN values in 'Balance Tamper', 'Variable Manipulation', and 'Sale Restrict': 359


In [102]:
# remove empty_or_nan_rows from total_df
total_df = total_df[~empty_cond]

total_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 688 entries, 0 to 688
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Project Name           688 non-null    object
 1   Chain                  688 non-null    object
 2   Address                688 non-null    object
 3   Open Source            688 non-null    object
 4   Sale Restrict          291 non-null    object
 5   Variable Manipulation  160 non-null    object
 6   Balance Tamper         435 non-null    object
 7   Source                 688 non-null    object
dtypes: object(8)
memory usage: 48.4+ KB


In [103]:
total_df.head()

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
0,GMETA,BSC,0X93023F1D3525E273F291B6F76D2F5027A39BF302,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/BeosinAlert/status/1681240...
1,PokémonFi,BSC,0X2753DCE37A7EDB052A77832039BCC9AA49AD8B25,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1562555...
2,PokémonFi,BSC,0X0AA5CAE4D1C9230543542E998E04EA795EEDF738,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1562555...
3,Sudorare,ETH,0X5404EFAFDD8CC30053069DF2A1B0C4BA881B3E1E,Yes,,,Hidden Mint/Burn,https://x.com/PeckShieldAlert/status/156196749...
4,DRAC Network,ETH,0X10F6F2B97F3AB29583D9D38BABF2994DF7220C21,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/PeckShieldAlert/status/155...


In [104]:
total_df.drop_duplicates()
experiment_df.drop_duplicates()
total_df.dropna(subset=['Address'], inplace=True)
experiment_df.dropna(subset=['Address'], inplace=True)

In [105]:
empty_cond_exp = ((experiment_df['Balance Tamper'].isna() | (experiment_df['Balance Tamper'].str.strip() == '')) &
    (experiment_df['Variable Manipulation'].isna() | (experiment_df['Variable Manipulation'].str.strip() == '')) &
    (experiment_df['Sale Restrict'].isna() | (experiment_df['Sale Restrict'].str.strip() == '')))
empty_or_nan_rows_exp = experiment_df[
    empty_cond_exp
].shape[0]

print(f"Number of rows with empty or NaN values in 'Balance Tamper', 'Variable Manipulation', and 'Sale Restrict': {empty_or_nan_rows_exp}")

Number of rows with empty or NaN values in 'Balance Tamper', 'Variable Manipulation', and 'Sale Restrict': 229


In [106]:
experiment_df = experiment_df[~empty_cond_exp]
experiment_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, 0 to 415
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Project Name           416 non-null    object
 1   Chain                  416 non-null    object
 2   Address                416 non-null    object
 3   Open Source            416 non-null    object
 4   Sale Restrict          204 non-null    object
 5   Variable Manipulation  145 non-null    object
 6   Balance Tamper         190 non-null    object
 7   Source                 416 non-null    object
dtypes: object(8)
memory usage: 29.2+ KB


In [107]:
experiment_df.head()

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
0,GMETA,BSC,0X93023F1D3525E273F291B6F76D2F5027A39BF302,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/BeosinAlert/status/1681240...
1,Sudorare,ETH,0X5404EFAFDD8CC30053069DF2A1B0C4BA881B3E1E,Yes,,,Hidden Mint/Burn,https://x.com/PeckShieldAlert/status/156196749...
2,DRAC Network,ETH,0X10F6F2B97F3AB29583D9D38BABF2994DF7220C21,Yes,,Modifiable Tax Rate,Hidden Mint/Burn,https://twitter.com/PeckShieldAlert/status/155...
3,DHE,BSC,0X11CBC781DADAAD13FC3A361772C80B1C027820AF,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1539031...
4,ElonMVP,BSC,0X3E597EA168A85AA2AE5E2C4333665BCD875ED10F,Yes,Address Restrict,,,https://twitter.com/PeckShieldAlert/status/153...


In [108]:
print(f"Number of addresses in total_df that are in normal_bytecode_df: {total_df[total_df['Address'].str.lower().isin(normal_bytecode_df['Address'].str.lower())].shape[0]}")

Number of addresses in total_df that are in normal_bytecode_df: 0


In [109]:
print(f"Number of addresses in experiment_df that are in normal_bytecode_df: {experiment_df[experiment_df['Address'].str.lower().isin(normal_bytecode_df['Address'].str.lower())].shape[0]}")

Number of addresses in experiment_df that are in normal_bytecode_df: 0


In [110]:
print(f"Number of addresses in total_df that are in rug_bytecode_df: {total_df[total_df['Address'].str.lower().isin(rug_bytecode_df['Address'].str.lower())].shape[0]}")
total_df.info()

Number of addresses in total_df that are in rug_bytecode_df: 415
<class 'pandas.core.frame.DataFrame'>
Index: 688 entries, 0 to 688
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Project Name           688 non-null    object
 1   Chain                  688 non-null    object
 2   Address                688 non-null    object
 3   Open Source            688 non-null    object
 4   Sale Restrict          291 non-null    object
 5   Variable Manipulation  160 non-null    object
 6   Balance Tamper         435 non-null    object
 7   Source                 688 non-null    object
dtypes: object(8)
memory usage: 48.4+ KB


In [112]:
print(f"Number of addresses in experiment_df that are in rug_bytecode_df: {experiment_df[experiment_df['Address'].str.lower().isin(rug_bytecode_df['Address'].str.lower())].shape[0]}")
experiment_df.info()

Number of addresses in experiment_df that are in rug_bytecode_df: 415
<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, 0 to 415
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Project Name           416 non-null    object
 1   Chain                  416 non-null    object
 2   Address                416 non-null    object
 3   Open Source            416 non-null    object
 4   Sale Restrict          204 non-null    object
 5   Variable Manipulation  145 non-null    object
 6   Balance Tamper         190 non-null    object
 7   Source                 416 non-null    object
dtypes: object(8)
memory usage: 29.2+ KB


In [113]:
address_is_not_in_normal_n_rug = total_df[~total_df['Address'].str.lower().isin(normal_bytecode_df['Address'].str.lower()) & ~total_df['Address'].str.lower().isin(rug_bytecode_df['Address'].str.lower())]
address_is_not_in_normal_n_rug

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
1,PokémonFi,BSC,0X2753DCE37A7EDB052A77832039BCC9AA49AD8B25,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1562555...
2,PokémonFi,BSC,0X0AA5CAE4D1C9230543542E998E04EA795EEDF738,Yes,Address Restrict,,,https://twitter.com/CertiKAlert/status/1562555...
12,NEKOGOLD,BSC,0X4534A3DF5BFCEDAABD2F3F557271F42B7BD57543,No,,,Hidden Mint/Burn,https://twitter.com/PeckShieldAlert/status/153...
17,NFTflow,ETH,0X253954D29386E174ED4BC69902391A8ED3FD51CA,Yes,,,Hidden Mint/Burn,https://de.fi/rekt-database/nftflow
19,Lucky star Currency Token,BSC,0X2B3559C3DBDB294CBB71F2B30A693F4C6BE6132D,Yes,Address Restrict,,Hidden Balance Modification,https://x.com/BeosinAlert/status/1711268989450...
...,...,...,...,...,...,...,...,...
672,Yearn Prometheus,ETH,0X528FF33BF5BF96B5392C10BC4748D9E9FB5386B2,Yes,Address Restrict,,,https://de.fi/rekt-database/Yearn Prometheus
673,Goblin Town,ETH,0XBB7F05AA2DD33425EA0848CDA8E4EA54718C6336,Yes,Address Restrict,,,https://de.fi/rekt-database/Goblin Town
674,NoVASHIBA.COM,BSC,0X68CA1321BF1BF6B243F57C5EDBEE62304FD8CB30,No,Address Restrict,,,https://de.fi/rekt-database/NoVASHIBA.COM
675,Moonwalk,BSC,0XE64F4A5C364A92AB62E27076907D55334609C8AD,Yes,Address Restrict,,,https://de.fi/rekt-database/Moonwalk


In [115]:
address_is_not_in_normal_n_rug_exp = experiment_df[~experiment_df['Address'].str.lower().isin(normal_bytecode_df['Address'].str.lower()) & ~experiment_df['Address'].str.lower().isin(rug_bytecode_df['Address'].str.lower())]
address_is_not_in_normal_n_rug_exp

Unnamed: 0,Project Name,Chain,Address,Open Source,Sale Restrict,Variable Manipulation,Balance Tamper,Source
414,Squid Game,BSC,0X9531C509A24CEEC710529645FC347341FF9F15EA,Yes,,Modifiable External Call,,https://de.fi/rekt-database/Squid Game


In [141]:
import shutil

# Set source search root and destination folder
project_root = Path.home() / "Dev/master/dissertation"  # update if needed
destination_dir = Path.home() / "Dev/master/dissertation/workspace/ml/data/external/rphunter/Rug-Bytecode"

# Combine both address lists
all_addresses = pd.concat([
    address_is_not_in_normal_n_rug_exp['Address'],
    address_is_not_in_normal_n_rug['Address']
]).drop_duplicates().str.lower().str.strip()

for address in all_addresses:
    print(f"🔍 Searching for: {address}.hex")

    result = subprocess.run(
        ['find', str(project_root), '-iname', f'*{address}*.hex'],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )


    found_paths = [p for p in result.stdout.strip().split('\n') if p]
    for path in found_paths:
        src_path = Path(path)
        dest_path = destination_dir / src_path.name
        print(f"📦 Copying: {src_path} → {dest_path}")
        shutil.copy(src_path, dest_path)

🔍 Searching for: 0x9531c509a24ceec710529645fc347341ff9f15ea.hex
🔍 Searching for: 0x2753dce37a7edb052a77832039bcc9aa49ad8b25.hex
📦 Copying: /Users/napatcholthaipanich/Dev/master/dissertation/workspace/ml/data/external/crpwarner/groundtruth/hex/0x2753dcE37A7eDB052a77832039bcc9aA49Ad8b25.hex → /Users/napatcholthaipanich/Dev/master/dissertation/workspace/ml/data/external/rphunter/Rug-Bytecode/0x2753dcE37A7eDB052a77832039bcc9aA49Ad8b25.hex
📦 Copying: /Users/napatcholthaipanich/Dev/master/dissertation/EarlyDeliverable/sandbox/CRPWarner/complete_dataset/CRPWarner/dataset/groundtruth/hex/0x2753dcE37A7eDB052a77832039bcc9aA49Ad8b25.hex → /Users/napatcholthaipanich/Dev/master/dissertation/workspace/ml/data/external/rphunter/Rug-Bytecode/0x2753dcE37A7eDB052a77832039bcc9aA49Ad8b25.hex
🔍 Searching for: 0x0aa5cae4d1c9230543542e998e04ea795eedf738.hex
🔍 Searching for: 0x4534a3df5bfcedaabd2f3f557271f42b7bd57543.hex
🔍 Searching for: 0x253954d29386e174ed4bc69902391a8ed3fd51ca.hex
🔍 Searching for: 0x2b3559

## Encode Label

In [304]:
total_df = pd.read_excel(os.path.join(PATH, "Rug-Pull-Incidents.xlsx"), engine='openpyxl', sheet_name="Total")
experiment_df = pd.read_excel(os.path.join(PATH, "Rug-Pull-Incidents.xlsx"), engine='openpyxl', sheet_name="Experiment")

In [305]:
# drop 'Address' column from total_df and experiment_df
total_df.dropna(subset=['Address'],axis=0, inplace=True)
experiment_df.dropna(subset=['Address'],axis=0, inplace=True)

In [306]:
# drop duplicate rows
total_df.drop_duplicates(inplace=True)
experiment_df.drop_duplicates(inplace=True)

In [307]:
# drop missing labels of 'Balance Tamper', 'Variable Manipulation', and 'Sale Restrict' from total_df and experiment_df
missing_condition = ((total_df['Balance Tamper'].isna() | (total_df['Balance Tamper'].str.strip() == '')) &
                     (total_df['Variable Manipulation'].isna() | (total_df['Variable Manipulation'].str.strip() == '')) &
                     (total_df['Sale Restrict'].isna() | (total_df['Sale Restrict'].str.strip() == '')))
total_df = total_df[~missing_condition]

missing_condition_exp = ((experiment_df['Balance Tamper'].isna() | (experiment_df['Balance Tamper'].str.strip() == '')) &
                         (experiment_df['Variable Manipulation'].isna() | (experiment_df['Variable Manipulation'].str.strip() == '')) &
                         (experiment_df['Sale Restrict'].isna() | (experiment_df['Sale Restrict'].str.strip() == '')))
experiment_df = experiment_df[~missing_condition_exp]

### Balance Tamper

In [308]:
total_df['Balance Tamper'].value_counts()

Balance Tamper
Hidden Mint/Burn                                  334
Hidden Balance Modification                        54
\nHidden Mint/Burn                                 35
Hidden Mint/Burn,Hidden Balance Modification        6
\n\nHidden Mint/Burn                                2
\n                                                  2
Hidden Mint/Eurn                                    1
Hidden Mint/Burn茂录聦Hidden Balance Modification      1
Name: count, dtype: int64

In [309]:
experiment_df['Balance Tamper'].value_counts()

Balance Tamper
Hidden Mint/Burn                               132
Hidden Balance Modify                           46
Hidden Mint/Burn,Hidden Balance Modify           5
\nHidden Mint/Burn                               4
Hidden Mint/Burn脙炉脗录脗聦Hidden Balance Modify      1
\n\nHidden Mint/Burn                             1
\n                                               1
Name: count, dtype: int64

In [310]:
def clean_balance_tamper(value):
    if pd.isna(value):
        return ''

    # Strip whitespace and linebreaks
    value = value.strip()

    # Replace multiple linebreaks or strange characters with a comma
    value = re.sub(r'[\n\r]+', ',', value)
    value = re.sub(r'[茂录聦]', ',', value)  # Remove strange unicode artifacts
    value = re.sub(r'[脙炉脗录脗聦]', ',', value)  # Remove strange unicode artifacts

    # Correct known typos
    value = value.replace("Hidden Mint/Eurn", "Hidden Mint/Burn")
    value = value.replace("Hidden Balance Modify", "Hidden Balance Modification")

    # Split by comma and normalize each part
    parts = [p.strip() for p in value.split(',') if p.strip()]

    # Deduplicate and sort for consistency
    parts = sorted(set(parts))

    return list(set(parts))

# Apply to both dataframes
total_df['Balance Tamper'] = total_df['Balance Tamper'].apply(clean_balance_tamper)
experiment_df['Balance Tamper'] = experiment_df['Balance Tamper'].apply(clean_balance_tamper)

# Optional: show cleaned value counts
mlb = MultiLabelBinarizer()
tamper_encoded = pd.DataFrame(mlb.fit_transform(total_df['Balance Tamper']),
                              columns=mlb.classes_,
                              index=total_df.index)

total_df = pd.concat([total_df.drop(columns=['Balance Tamper']), tamper_encoded], axis=1)

amper_encoded = pd.DataFrame(mlb.fit_transform(experiment_df['Balance Tamper']),
                              columns=mlb.classes_,
                              index=experiment_df.index)
experiment_df = pd.concat([experiment_df.drop(columns=['Balance Tamper']), tamper_encoded], axis=1)

In [311]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 688 entries, 0 to 688
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Project Name                 688 non-null    object
 1   Chain                        688 non-null    object
 2   Address                      688 non-null    object
 3   Open Source                  688 non-null    object
 4   Sale Restrict                291 non-null    object
 5   Variable Manipulation        160 non-null    object
 6   Source                       688 non-null    object
 7   Hidden Balance Modification  688 non-null    int64 
 8   Hidden Mint/Burn             688 non-null    int64 
dtypes: int64(2), object(7)
memory usage: 69.9+ KB


In [312]:
experiment_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 689 entries, 0 to 688
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Project Name                 416 non-null    object 
 1   Chain                        416 non-null    object 
 2   Address                      416 non-null    object 
 3   Open Source                  416 non-null    object 
 4   Sale Restrict                204 non-null    object 
 5   Variable Manipulation        145 non-null    object 
 6   Source                       416 non-null    object 
 7   Hidden Balance Modification  688 non-null    float64
 8   Hidden Mint/Burn             688 non-null    float64
dtypes: float64(2), object(7)
memory usage: 53.8+ KB


### Sale Restrict

In [313]:
total_df['Sale Restrict'].value_counts()

Sale Restrict
Address Restrict                                           140
Amount Restrict                                             90
Address Restrict,Amount Restrict                            36
Address Restrict,TimeStamp Restrict                          7
\nAddress Restrict,Amount Restrict                           4
Address Restrict,Amount Restrict,TimeStamp Restrict          3
TimeStamp Restrict                                           2
\nAddress Restrict                                           2
Amount Restrict,TimeStamp Restrict                           1
Modifiable External Call                                     1
Address Restrict                                             1
Address Restrict,Address Restrict*3                          1
\n\nAddress Restrict,Amount Restrict,TimeStamp Restrict      1
\nAmount Restrict                                            1
TimeStanp Restrict                                           1
Name: count, dtype: int64

In [314]:
experiment_df['Sale Restrict'].value_counts()

Sale Restrict
Address Restrict                                           92
Amount Restrict                                            62
Address Restrict,Amount Restrict                           27
Address Restrict,TimeStamp Restrict                         7
\nAddress Restrict,Amount Restrict                          4
Address Restrict,Amount Restrict,TimeStamp Restrict         3
TimeStamp Restrict                                          2
Modifiable External Call                                    1
Amount Restrict,TimeStamp Restrict                          1
Address Restrict                                            1
Address Restrict,Address Restrict*3                         1
\nAddress Restrict                                          1
\n\nAddress Restrict,Amount Restrict,TimeStamp Restrict     1
\nAmount Restrict                                           1
Name: count, dtype: int64

In [315]:
def extract_sale_restrict_labels(value):
    if pd.isna(value) or not str(value).strip():
        return []

    value = re.sub(r'[\n\r]+', ',', value)

    # Fix known typos
    value = value.replace("TimeStanp Restrict", "TimeStamp Restrict")

    # Remove suffix like "*3" (e.g. Address Restrict*3)
    value = re.sub(r'\*[\d]+', '', value)

    # Split and clean
    parts = [p.strip() for p in value.split(',') if p.strip()]

    # Normalize duplicates
    parts = sorted(set(parts))
    return parts

# Apply label extraction
total_df['Sale Restrict'] = total_df['Sale Restrict'].apply(extract_sale_restrict_labels)
experiment_df['Sale Restrict'] = experiment_df['Sale Restrict'].apply(extract_sale_restrict_labels)

# Binarize
mlb_sale = MultiLabelBinarizer()

sale_encoded = pd.DataFrame(mlb_sale.fit_transform(total_df['Sale Restrict']),
                            columns=mlb_sale.classes_,
                            index=total_df.index)

# Drop original and concat
total_df = pd.concat([total_df.drop(columns=['Sale Restrict']), sale_encoded], axis=1)

# Do same for experiment_df
exp_sale_encoded = pd.DataFrame(mlb_sale.transform(experiment_df['Sale Restrict']),
                                columns=mlb_sale.classes_,
                                index=experiment_df.index)

experiment_df = pd.concat([experiment_df.drop(columns=['Sale Restrict']), exp_sale_encoded], axis=1)


In [316]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 688 entries, 0 to 688
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Project Name                 688 non-null    object
 1   Chain                        688 non-null    object
 2   Address                      688 non-null    object
 3   Open Source                  688 non-null    object
 4   Variable Manipulation        160 non-null    object
 5   Source                       688 non-null    object
 6   Hidden Balance Modification  688 non-null    int64 
 7   Hidden Mint/Burn             688 non-null    int64 
 8   Address Restrict             688 non-null    int64 
 9   Amount Restrict              688 non-null    int64 
 10  Modifiable External Call     688 non-null    int64 
 11  TimeStamp Restrict           688 non-null    int64 
dtypes: int64(6), object(6)
memory usage: 86.0+ KB


In [317]:
experiment_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 689 entries, 0 to 688
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Project Name                 416 non-null    object 
 1   Chain                        416 non-null    object 
 2   Address                      416 non-null    object 
 3   Open Source                  416 non-null    object 
 4   Variable Manipulation        145 non-null    object 
 5   Source                       416 non-null    object 
 6   Hidden Balance Modification  688 non-null    float64
 7   Hidden Mint/Burn             688 non-null    float64
 8   Address Restrict             689 non-null    int64  
 9   Amount Restrict              689 non-null    int64  
 10  Modifiable External Call     689 non-null    int64  
 11  TimeStamp Restrict           689 non-null    int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 70.0+ KB


### Variable Manipulation

In [318]:
total_df['Variable Manipulation'].value_counts()

Variable Manipulation
Modifiable Tax Rate                                   80
Modifiable External Call                              46
Modifiable Tax Address                                18
Modifiable Tax Rate,Modifiable Tax Address             6
Modifibale Tax Rate                                    4
\nModifiable Tax Rate                                  2
Modifuable Tax Rate                                    1
Modifibale Tax Rate,Modifiable Tax Address             1
Modifiable Tax Address,Hidden Balance Modification     1
\n                                                     1
Name: count, dtype: int64

In [319]:
experiment_df['Variable Manipulation'].value_counts()

Variable Manipulation
Modifiable Tax Rate                             69
Modifiable External Call                        43
Modifiable Tax Address                          17
Modifiable Tax Rate,Modifiable Tax Address       6
Modifibale Tax Rate                              4
\nModifiable Tax Rate                            2
Modifuable Tax Rate                              1
Modifibale Tax Rate,Modifiable Tax Address       1
Modifiable Tax Address,Hidden Balance Modify     1
\n                                               1
Name: count, dtype: int64

In [320]:
def extract_variable_manipulation_labels(value):
    if pd.isna(value) or not str(value).strip():
        return []

    value = re.sub(r'[\n\r]+', ',', value)

    # Correct typos
    value = value.replace("Modifibale Tax Rate", "Modifiable Tax Rate")
    value = value.replace("Modifuable Tax Rate", "Modifiable Tax Rate")
    value = value.replace("Hidden Balance Modify", "")  # strip unrelated label

    # Split and clean
    parts = [p.strip() for p in value.split(',') if p.strip()]

    # Deduplicate
    parts = sorted(set(parts))
    return parts

# Apply function
total_df['Variable Manipulation'] = total_df['Variable Manipulation'].apply(extract_variable_manipulation_labels)
experiment_df['Variable Manipulation'] = experiment_df['Variable Manipulation'].apply(extract_variable_manipulation_labels)

# Binarize
mlb_var = MultiLabelBinarizer()
var_encoded = pd.DataFrame(mlb_var.fit_transform(total_df['Variable Manipulation']),
                           columns=mlb_var.classes_,
                           index=total_df.index)

# Merge and drop
total_df = pd.concat([total_df.drop(columns=['Variable Manipulation']), var_encoded], axis=1)

# Repeat for experiment_df
exp_var_encoded = pd.DataFrame(mlb_var.transform(experiment_df['Variable Manipulation']),
                               columns=mlb_var.classes_,
                               index=experiment_df.index)

experiment_df = pd.concat([experiment_df.drop(columns=['Variable Manipulation']), exp_var_encoded], axis=1)

In [321]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 688 entries, 0 to 688
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Project Name                 688 non-null    object
 1   Chain                        688 non-null    object
 2   Address                      688 non-null    object
 3   Open Source                  688 non-null    object
 4   Source                       688 non-null    object
 5   Hidden Balance Modification  688 non-null    int64 
 6   Hidden Mint/Burn             688 non-null    int64 
 7   Address Restrict             688 non-null    int64 
 8   Amount Restrict              688 non-null    int64 
 9   Modifiable External Call     688 non-null    int64 
 10  TimeStamp Restrict           688 non-null    int64 
 11  Hidden Balance Modification  688 non-null    int64 
 12  Modifiable External Call     688 non-null    int64 
 13  Modifiable Tax Address       688 non-nul

In [322]:
experiment_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 689 entries, 0 to 688
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Project Name                 416 non-null    object 
 1   Chain                        416 non-null    object 
 2   Address                      416 non-null    object 
 3   Open Source                  416 non-null    object 
 4   Source                       416 non-null    object 
 5   Hidden Balance Modification  688 non-null    float64
 6   Hidden Mint/Burn             688 non-null    float64
 7   Address Restrict             689 non-null    int64  
 8   Amount Restrict              689 non-null    int64  
 9   Modifiable External Call     689 non-null    int64  
 10  TimeStamp Restrict           689 non-null    int64  
 11  Hidden Balance Modification  689 non-null    int64  
 12  Modifiable External Call     689 non-null    int64  
 13  Modifiable Tax Address   

In [323]:
total_df.head()

Unnamed: 0,Project Name,Chain,Address,Open Source,Source,Hidden Balance Modification,Hidden Mint/Burn,Address Restrict,Amount Restrict,Modifiable External Call,TimeStamp Restrict,Hidden Balance Modification.1,Modifiable External Call.1,Modifiable Tax Address,Modifiable Tax Rate
0,GMETA,BSC,0X93023F1D3525E273F291B6F76D2F5027A39BF302,Yes,https://twitter.com/BeosinAlert/status/1681240...,0,1,0,0,0,0,0,0,0,1
1,PokémonFi,BSC,0X2753DCE37A7EDB052A77832039BCC9AA49AD8B25,Yes,https://twitter.com/CertiKAlert/status/1562555...,0,0,1,0,0,0,0,0,0,0
2,PokémonFi,BSC,0X0AA5CAE4D1C9230543542E998E04EA795EEDF738,Yes,https://twitter.com/CertiKAlert/status/1562555...,0,0,1,0,0,0,0,0,0,0
3,Sudorare,ETH,0X5404EFAFDD8CC30053069DF2A1B0C4BA881B3E1E,Yes,https://x.com/PeckShieldAlert/status/156196749...,0,1,0,0,0,0,0,0,0,0
4,DRAC Network,ETH,0X10F6F2B97F3AB29583D9D38BABF2994DF7220C21,Yes,https://twitter.com/PeckShieldAlert/status/155...,0,1,0,0,0,0,0,0,0,1


In [324]:
experiment_df.head()

Unnamed: 0,Project Name,Chain,Address,Open Source,Source,Hidden Balance Modification,Hidden Mint/Burn,Address Restrict,Amount Restrict,Modifiable External Call,TimeStamp Restrict,Hidden Balance Modification.1,Modifiable External Call.1,Modifiable Tax Address,Modifiable Tax Rate
0,GMETA,BSC,0X93023F1D3525E273F291B6F76D2F5027A39BF302,Yes,https://twitter.com/BeosinAlert/status/1681240...,0.0,1.0,0,0,0,0,0,0,0,1
1,Sudorare,ETH,0X5404EFAFDD8CC30053069DF2A1B0C4BA881B3E1E,Yes,https://x.com/PeckShieldAlert/status/156196749...,0.0,0.0,0,0,0,0,0,0,0,0
2,DRAC Network,ETH,0X10F6F2B97F3AB29583D9D38BABF2994DF7220C21,Yes,https://twitter.com/PeckShieldAlert/status/155...,0.0,0.0,0,0,0,0,0,0,0,1
3,DHE,BSC,0X11CBC781DADAAD13FC3A361772C80B1C027820AF,Yes,https://twitter.com/CertiKAlert/status/1539031...,0.0,1.0,1,0,0,0,0,0,0,0
4,ElonMVP,BSC,0X3E597EA168A85AA2AE5E2C4333665BCD875ED10F,Yes,https://twitter.com/PeckShieldAlert/status/153...,0.0,1.0,1,0,0,0,0,0,0,0


### Filter only have bytecode file

In [325]:
normal_list = [f.replace('.txt', "") for f in os.listdir(os.path.join(PATH, "Normal-Bytecode"))]
rug_list = [f.replace('.txt', "") for f in os.listdir(os.path.join(PATH, "Rug-Bytecode"))]
combined_list = list(set(normal_list + rug_list))
hex_df = pd.DataFrame({'Address': combined_list})

In [326]:
hex_address_set = set(hex_df['Address'].str.lower())
total_df = total_df[total_df['Address'].str.lower().isin(hex_address_set)]
experiment_df = experiment_df[experiment_df['Address'].str.lower().isin(hex_address_set)]

In [329]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 415 entries, 0 to 688
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Project Name                 415 non-null    object
 1   Chain                        415 non-null    object
 2   Address                      415 non-null    object
 3   Open Source                  415 non-null    object
 4   Source                       415 non-null    object
 5   Hidden Balance Modification  415 non-null    int64 
 6   Hidden Mint/Burn             415 non-null    int64 
 7   Address Restrict             415 non-null    int64 
 8   Amount Restrict              415 non-null    int64 
 9   Modifiable External Call     415 non-null    int64 
 10  TimeStamp Restrict           415 non-null    int64 
 11  Hidden Balance Modification  415 non-null    int64 
 12  Modifiable External Call     415 non-null    int64 
 13  Modifiable Tax Address       415 non-nul

In [330]:
experiment_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 415 entries, 0 to 415
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Project Name                 415 non-null    object 
 1   Chain                        415 non-null    object 
 2   Address                      415 non-null    object 
 3   Open Source                  415 non-null    object 
 4   Source                       415 non-null    object 
 5   Hidden Balance Modification  414 non-null    float64
 6   Hidden Mint/Burn             414 non-null    float64
 7   Address Restrict             415 non-null    int64  
 8   Amount Restrict              415 non-null    int64  
 9   Modifiable External Call     415 non-null    int64  
 10  TimeStamp Restrict           415 non-null    int64  
 11  Hidden Balance Modification  415 non-null    int64  
 12  Modifiable External Call     415 non-null    int64  
 13  Modifiable Tax Address   

## Save the clean file

In [334]:
SAVE_PATH = f"{Path.cwd().parents[1]}/data/interim/rphunter"
total_df.to_csv(f"{SAVE_PATH}/total.csv")
experiment_df.to_csv(f"{SAVE_PATH}/experiment.csv")