# Guidance on how to access our data 

We store our data on KiltHub, the institutional repository hosted by Carnegie Mellon University Library, which is publicly accessible. 
KiltHub intends to provide the permanent access to the content. 
Refer to the official document [here](https://guides.library.cmu.edu/kilthub/policies).

## How to restore and access the database?
1. We highly recommend using PostgreSQL to access our dataset, so install the latest version of PostgreSQL. 

2. Configure the database. Use the following command to decompress the file and restore the table in your PostgreSQL database.
```
gunzip -c address_poisoning_ethereum.sql.gz | psql -h localhost -U USERNAME -d DATABASE 
```
Please specify your PostgreSQL credentials (USERNAME, DATABASE) above.

3. (To run `example.ipynb`) Install the python packages (`pip install -r requirements.txt`). We recommend using the python virtual environment. 

4. Execute this jupyter notebook to learn how to compute the basic statistics (e.g., the number of poisoning transfers). 


In [1]:
# import libraries
import json
import psycopg2
import pandas as pd

# save your PostgreSQL credentials in postgre.json file 
try:
    with open("./postgre.json", "r", encoding="latin-1") as credential:
        json_credential = json.load(credential)

except FileNotFoundError:
    print("Please create a postgre.json file with your PostgreSQL credentials.")
    print("The file should contain the following keys: host, database, username, password.")
    print('{"host": "localhost", "database": "your_database_name", "username": "your_username", "password": "your_password"}')
    exit()

HOST = json_credential["host"]
DATABASE = json_credential["database"]
USERNAME = json_credential["username"]
PASSWORD = json_credential["password"]

# connect to your database
conn = psycopg2.connect(database=DATABASE, user=USERNAME, host="localhost", password=PASSWORD)
c = conn.cursor()
print('Successfully connected to the database')

db_name = 'address_poisoning_ethereum'

Successfully connected to the database


## Dataset (address_poisoning_ethereum)
Each data entry corresponds to one of token transfers related to address poisoning on Ethereum: intended, tiny, zero-value, counterfeit token, and payoff transfers. 
The description and the type of each column is in the table below.  

| Column | Type | Description |
|--------|------|-------------|
| block_number | integer | The block number of the transfer |
| tx_hash | character(66) | The transaction hash of the transfer |
| addr | character(42) | The address of the token transferred |
| topics_from_addr | character(42) | The sender |
| topics_to_addr | character(42) | The receiver |
| is_sender_victim | boolean | True if the sender is a victim |
| value | real | The value transferred |
| value_usd | real | The value converted to USD at the time of the transfer |
| intended_transfer | boolean | True if the transfer is an intended transfer |
| tiny_transfer | boolean | True if the transfer is a tiny transfer |
| zero_value_transfer | boolean | True if the transfer is a zero-value transfer |
| counterfeit_token_transfer | boolean | True if the transfer is a counterfeit token transfer |
| payoff_transfer | boolean | True if the transfer is a payoff transfer (confirmed) |
| payoff_transfer_unconfirmed | boolean | True if the transfer is potentially a payoff transfer (more investigation required) |
| is_not_categorized | boolean | True if the transfer is not labeled for any categories |
| intended_addr | character(42) | The intended address of the transfer |
| num_first_matched_digits | integer | The number of first matched digits |
| num_last_matched_digits | integer | The number of last matched digits |

Remarks
- `payoff_transfer` does not represent all the payoff transfers we captured. We confirm each transfer in `payoff_transfer_unconfirmed` on Etherscan to check if the poisoning transfer exists between the intended and the payoff transfer. Alternatively, you can find the final results in `payoff_transfers_ethereum.csv`.
- Most of the uncategorized transfers (is_not_categorized transfers=True) are tiny transfers that exceed the threshold (>=10 USD). There are only 8,651 out of 34,905,969 (0.025%). 
- The number of matched first (last) digits is the number of matched digits between the lookalike address and the intended address. For the intended transfer, it would be (20, 20). 

## Attack statistics
We calculate the number of poisoning transactions, poisoning transfers (tiny, zero-value, counterfeit), and couterfeit token contracts

In [2]:
# get the basic attack statistics 
c.execute(f"SELECT count(distinct(tx_hash)) FROM {db_name} WHERE ((tiny_transfer = True) OR (zero_value_transfer = True) OR (counterfeit_token_transfer = True));")
results = c.fetchall()
num_poisoning_txs = results[0][0]
print('We have {} poisoning transactions'.format(num_poisoning_txs))

c.execute(f"SELECT count(*) FROM {db_name} WHERE ((tiny_transfer = True) OR (zero_value_transfer = True) OR (counterfeit_token_transfer = True));")
results = c.fetchall()
num_poisoning_transfer = results[0][0]
print('We have {} poisoning transfers'.format(num_poisoning_transfer))

c.execute(f"SELECT count(*) FROM {db_name} WHERE tiny_transfer = True;")
results = c.fetchall()
num_tiny_transfer = results[0][0]
print('We have {} tiny transfers'.format(num_tiny_transfer))

c.execute(f"SELECT count(*) FROM {db_name} WHERE zero_value_transfer = True;")
results = c.fetchall()
num_zero_value_transfer = results[0][0]
print('We have {} zero-value transfers'.format(num_zero_value_transfer))

c.execute(f"SELECT count(*) FROM {db_name} WHERE counterfeit_token_transfer = True;")
results = c.fetchall()
num_counterfeit_token_transfer = results[0][0]
print('We have {} counterfeit token transfers'.format(num_counterfeit_token_transfer))

c.execute(f"SELECT count(distinct(addr)) FROM {db_name} WHERE (counterfeit_token_transfer = True);")
results = c.fetchall()
num_counterfeit_tokens = results[0][0] 
print('We have {} counterfeit token contracts'.format(num_counterfeit_tokens))


We have 1691529 poisoning transactions
We have 17365954 poisoning transfers
We have 308881 tiny transfers
We have 7185298 zero-value transfers
We have 9871775 counterfeit token transfers
We have 6280 counterfeit token contracts


We also calculate the number of victims and lookalike addresses

In [3]:
##%%
# get the number of victims and lookalike addresses
# when the sender is the victim
c.execute(f"SELECT distinct(topics_from_addr) FROM {db_name} WHERE (is_sender_victim = True) AND ((tiny_transfer = True) OR (zero_value_transfer = True) OR (counterfeit_token_transfer = True));")
results = c.fetchall()
victim_addr_sender = [i[0] for i in results]

# when the receiver is the victim
c.execute(f"SELECT distinct(topics_to_addr) FROM {db_name} WHERE (is_sender_victim = False) AND ((tiny_transfer = True) OR (zero_value_transfer = True) OR (counterfeit_token_transfer = True));")
results = c.fetchall()
victim_addr_receiver = [i[0] for i in results]

# combine the two lists 
victim_addr = list(set(victim_addr_sender + victim_addr_receiver))
num_victim_addr = len(victim_addr)
print('We have {} victim addresses'.format(num_victim_addr))

# when the sender is the lookalike address
c.execute(f"SELECT distinct(topics_from_addr) FROM {db_name} WHERE (is_sender_victim = False) AND ((tiny_transfer = True) OR (zero_value_transfer = True) OR (counterfeit_token_transfer = True));")
results = c.fetchall()
lookalike_addr_sender = [i[0] for i in results]

# when the receiver is the lookalike address
c.execute(f"SELECT distinct(topics_to_addr) FROM {db_name} WHERE (is_sender_victim = True) AND ((tiny_transfer = True) OR (zero_value_transfer = True) OR (counterfeit_token_transfer = True));")
results = c.fetchall()
lookalike_addr_receiver = [i[0] for i in results]

# combine the two lists
lookalike_addr = list(set(lookalike_addr_sender + lookalike_addr_receiver))
num_lookalike_addr = len(lookalike_addr)
print('We have {} lookalike addresses'.format(num_lookalike_addr))


We have 1330948 victim addresses
We have 6492215 lookalike addresses


## Payoff transfers

In [4]:
# import the payoff transfers (sorted by the amount of loss in descending order)
df_eth_payoff = pd.read_csv('payoff_transfers_ethereum.csv') 
df_bsc_payoff = pd.read_csv('payoff_transfers_bsc.csv')
df_eth_payoff.head()

Unnamed: 0,block_number,tx_hash,addr,topics_from_addr,topics_to_addr,value,value_usd,is_sender_victim,num_first_matched_digits,num_last_matched_digits,intended_addr
0,17818298,0x08255ca0e42a872559437141fa46980e66d907f76689...,0xdAC17F958D2ee523a2206206994597C13D831ec7,0x407e4B78C16D22b21B5dDcd0B0f7AD9Cb07b9Cbc,0xa7Bf48749D2E4aA29e3209879956b9bAa9E90570,20000000000000.0,20000000.0,True,3,6,0xa7B4BAC8f0f9692e56750aEFB5f6cB5516E90570
1,16332750,0x2dea87ff34b0d4712d59e582e810d1a8396e469ce783...,0xdAC17F958D2ee523a2206206994597C13D831ec7,0x2f10e2358594fD72c7d9e2EEa137Ef1222c03bD5,0x61b697FE7F51d1D6C8299A450cdb15d1C0074c5C,3800000000000.0,3800000.0,True,3,5,0x61b9d3646140C74Fb6D85002d48B019aa7174c5c
2,17022527,0x11ac26acce0620c50731a94b528419771211779903d6...,0xdAC17F958D2ee523a2206206994597C13D831ec7,0x6636cEc1dFAEeADAAEdA5C5b2a7540578Fd451f6,0x1cbB23dBB1649de39D3E95DAA0AD1838269B758a,3554610000000.0,3554610.2,True,7,6,0x1CbB23DFbB023D72a1e85f923c2FF8E1299B758a
3,16648806,0x0911bd6713493e9ab75ef82cc909114218996f0e717b...,0xA0b86991c6218b36c1d19D4a2e9Eb0cE3606eB48,0x081714D70d61d80b078eF0dC88022E08dD53236E,0x74C9bdBea7eAf3d78f13Faa71caE94d38560E1cA,2030000000000.0,2030000.0,True,3,5,0x74C32c78900c23B6773603D320d8061b8450E1cA
4,16896975,0x48e591d562a5098527c0de850ba44ce2101472643418...,0xdAC17F958D2ee523a2206206994597C13D831ec7,0x49D7E0cE8EcB4562E4222d07647871a780BBd1e9,0xbb2EDba85dec700927A8478f70Dee42EdD619455,2000000000000.0,2000000.0,True,3,5,0xBB217c93D4bb15aacc68aE4A3aB7d24Ed4219455



## How to verify the payoff transfers on Etherscan (BscScan)? 
Payoff tranfers are the final step of the attack, in which victims mistakenly send funds to the attackers. 
We can verify the authenticity of these transfers discovered by our algorithm on [Etherscan](https://etherscan.io/) or [BscScan](https://bscscan.com/) through the following steps:

0. Make sure to go to [Site Setting](https://etherscan.io/settings) and toggle off ``Ignore Tokens with Poor Reputation`` and ``Zero-Value Token Transfers`` to display poisoning transfers. 
1. Look at each row in ``eth_payoff_transfers.csv`` (already sorted by the amount of loss in descending order). The column name follows the same as our database table above. Paste `tx_hash` into the Etherscan serach bar. 
2. In the transaction page, click the sender *from* (i.e., the victim) of the token transfer event, and jump into the victim's page. 
3. Click the *Token Transfer (ERC-20)* tab.
4. Find the payoff transfer. For example, in the first case of Ethereum, look for the block 17,818,298. Sometimes you need to click `(View all)` to display all transfers.  
6. Identify the poisoning transfer from/to lookalike address (`topics_to_addr` in the csv file). In the first case, the counterfeit token transfer (with the value of 10,000,000) took place in block 17,778,047. 
7. Find the intended transfer from the victim to the intended address that matches the first/last few digits to the lookalike address. 
In the first case, a transfer of 10,000,000 USDT is sent from the victim to the intended address in block 17,778,044.
8. Confirm that the intended, poisoning, and the payoff transfer happened in the right order. 

In [5]:
# calculate the total loss, min, medium, avg, max, std, count, and the number of victims
def calculate_stats(df):
    total_loss = df['value_usd'].sum()
    min_loss = df['value_usd'].min()
    median_loss = df['value_usd'].median()
    avg_loss = df['value_usd'].mean()
    max_loss = df['value_usd'].max()
    std_loss = df['value_usd'].std()
    num_payoff = df.shape[0]
    num_victims = df['topics_from_addr'].nunique()

    return {
        'total_loss': total_loss,
        'min_loss': min_loss,
        'median_loss': median_loss,
        'avg_loss': avg_loss,
        'max_loss': max_loss,
        'std_loss': std_loss,
        'num_payoff': num_payoff,
        'num_victims': num_victims
    }

eth_payoff_stats = calculate_stats(df_eth_payoff)
print('Ethereum Payoff Stats:')
print(json.dumps(eth_payoff_stats, indent=4))
bsc_payoff_stats = calculate_stats(df_bsc_payoff)
print('BSC Payoff Stats:')
print(json.dumps(bsc_payoff_stats, indent=4))

Ethereum Payoff Stats:
{
    "total_loss": 79344411.66702148,
    "min_loss": 0.001,
    "median_loss": 2160.0,
    "avg_loss": 45652.71097066829,
    "max_loss": 20000000.0,
    "std_loss": 514141.78036441305,
    "num_payoff": 1738,
    "num_victims": 1502
}
BSC Payoff Stats:
{
    "total_loss": 4490804.320314925,
    "min_loss": 1e-08,
    "median_loss": 85.0,
    "avg_loss": 1163.7222908305066,
    "max_loss": 279489.4,
    "std_loss": 7681.496967139977,
    "num_payoff": 4895,
    "num_victims": 4004
}
