## Getting the data
We obtain the data using Google Big Query.
In order to obtain the deposits and withdrawals transactions, just run the following querys:

### Withdrawals 
SELECT * FROM `bigquery-public-data.crypto_ethereum.transactions` WHERE `to_address` 
IN (SELECT `address` FROM `tornado_cash_transactions.tornadocontracts`) AND SUBSTR(`input`, 1, 10) = "0x21a0adb6";

### Deposits
SELECT * FROM `bigquery-public-data.crypto_ethereum.transactions` WHERE `to_address` 
IN (SELECT `address` FROM `tornado_cash_transactions.tornadocontracts`) AND SUBSTR(`input`, 1, 10) = "0xb214faa5";

References here:
https://github.com/Phread420/tornado_bigquery/blob/main/Notes.md

This results are stored in the following files:
- withdraw_transactions.csv
- deposit_transactions.csv

After obtaining those two files, the next thing to do is to get the "recipient_addresses" that are making the withdrawl (if the user withdraws via a relayer, the relayer's account will be displayed in the "from_addres" field). That information is encoded in the "input" field of the withdrawal transactions.

The process to do so is in the following notebook:
https://github.com/lambdaclass/tornado_cash_anonymity_tool/blob/main/notebooks/complete_withdraw_data_set.ipynb

Note that the recipient addresses obtained in this df are in upper case and the addresses coming from the big query are in lower case, so you have to make the transformation when running the heuristic.

## Second heuristic - Preliminary implementation

### Description

If there is a deposit and a withdraw transaction with **unique** gas prices (e.g., 3.1415926 Gwei), then we consider the deposit and the withdraw transactions linked. The corresponding deposit transaction can be removed from any other withdraw transaction’s anonymity set.

In [3]:
# Import relevant packages
import pandas as pd
from tqdm import tqdm

In [4]:
# Load transactions data

withdraw_transactions_df = pd.read_csv("../data/lighter_complete_withdraw_txs.csv")
# Change recipient_address to lowercase.
withdraw_transactions_df["recipient_address"] = withdraw_transactions_df["recipient_address"].str.lower()
# Change block_timestamp field to be a timestamp object.
withdraw_transactions_df["block_timestamp"] = withdraw_transactions_df["block_timestamp"].apply(pd.Timestamp)

deposit_transactions_df = pd.read_csv("../data/lighter_complete_deposit_txs.csv")
# Change block_timestamp field to be a timestamp object.
deposit_transactions_df["block_timestamp"] = deposit_transactions_df["block_timestamp"].apply(pd.Timestamp)

# tornado_addresses = pd.read_csv("../data/tornadocontracts_abi.csv", names=['address', 'contract_currency', 'value', '4'])

### Function summary: filter_by_unique_gas_price

Filters a transaction DataFrame, leaving only the transactions (rows) that have unique gas_price within all transactions.

In [30]:
def filter_by_unique_gas_price(transactions_df):
    
    # Count the appearances of each gas price in the transactions DataFrame.
    
    gas_prices_count = transactions_df["gas_price"].value_counts()
    
    # Filter the gas prices that are unique, i.e., the ones with a count equal to 1.
    # The gas prices are stored in unique_gas_prices
    
    unique_gas_prices = gas_prices_count[gas_prices_count == 1].keys()
    
    # Return the transactions that verify having these unique gas price values.
    
    return transactions_df[transactions_df["gas_price"].isin(unique_gas_prices)]   

### Function summary: filter_by_unique_gas_price_by_pool

Filters a transaction DataFrame, leaving only the transactions (rows) that have unique gas_price within all transactions of the same Tornado Cash pool.

In [37]:
def filter_by_unique_gas_price_by_pool(transactions_df):
    
    # Count the appearances of each gas price by pool in the transactions DataFrame.
    
    gas_prices_count = deposit_transactions_df[["gas_price", "tornado_cash_address"]].value_counts()
    
    # Filter the gas prices that are unique for each pool, i.e., the ones with a count equal to 1.
    # The gas prices are stored in unique_gas_prices
    
    unique_gas_prices_by_pool = pd.DataFrame(gas_prices_count[gas_prices_count == 1])
    
    # Tuple set with the values (gas_price, tornado_cash_address) is made to filter efficiently
    
    tuple_set = set([(row.Index[0], row.Index[1]) for row in unique_gas_prices_by_pool.itertuples()])
    
    # Filter all the transactions data, leaving only the ones that have unique gas price by pool.
    
    return pd.DataFrame(filter(lambda iter_tuple: (iter_tuple.gas_price, iter_tuple.tornado_cash_address) in tuple_set, deposit_transactions_df.itertuples()))

In [29]:
filter_by_unique_gas_price_by_pool(deposit_transactions_df)

Unnamed: 0,Index,_1,_2,hash,nonce,transaction_index,from_address,to_address,value,gas,...,receipt_root,receipt_status,block_timestamp,block_number,block_hash,max_fee_per_gas,max_priority_fee_per_gas,transaction_type,receipt_effective_gas_price,tornado_cash_address
0,0,0,0,0xcf97c470a56d96625c7240d3004ae2abd9141d7ffc43...,4,10,0xb050dec5a9010f8b77a3962369b7bc737d3ed4a5,0x4736dcf1b7a3d580672cce6e7c65cd5cc9cfba9d,0,1200000,...,,1,2020-11-02 17:47:30+00:00,11179130,0x21d86cba454fea4f7e43c68763d4cffec101b6145546...,,,,56000000000,0x4736dcf1b7a3d580672cce6e7c65cd5cc9cfba9d
1,2,2,10,0x7baf0a76f35c1dece97fff883aa7174454bed460b1ba...,240,171,0x8c4c44fd06f7f98f08bf6a9ca156cec9ee1f31f8,0xfd8610d20aa15b7b2e3be39b396a1bc3516c7144,0,800000,...,,0,2021-01-06 19:04:40+00:00,11602841,0x7c5f21ea2a92f5182ce8648f152b6fb3b4379096309d...,,,,105000000000,0xfd8610d20aa15b7b2e3be39b396a1bc3516c7144
2,4,4,12,0xbd83053f8afa7777f54a4aca6b8e112fa31b888922dc...,7,63,0x6c6e4816ecfa4481472ff88f32a3e00f2eaa95a1,0x12d66f87a04a9e220743712ce6d9bb1b5616b8fc,100000000000000000,800000,...,,0,2020-05-27 03:30:44+00:00,10145408,0x837b3482443f027f6f045644bf002243f72304686015...,,,,30838446643,0x12d66f87a04a9e220743712ce6d9bb1b5616b8fc
3,8,8,130,0xba7d56fea776705a937d912674cc56cf3ea71485c8fb...,2,69,0x3a456bc9083bfe147719504aee8f296eb7355ee1,0x0836222f2b2b24a3f36f98668ed8f0b38d1a872f,0,1200000,...,,1,2020-08-15 11:34:23+00:00,10664413,0x4075362cd93950767b2e5c3cd6765810b2973a2dd073...,,,,100000000000,0x0836222f2b2b24a3f36f98668ed8f0b38d1a872f
4,10,10,132,0x6c416af65ea3a4bc096663c94f5b1fb0cba91607f617...,3,51,0x27972d10f153099b3649ea8546a11d91315455e5,0x0836222f2b2b24a3f36f98668ed8f0b38d1a872f,0,1200000,...,,1,2020-09-26 08:10:08+00:00,10937092,0x32f0f0fd04d3af8210d2eb956fdec21e09ebff5a5520...,,,,71302125000,0x0836222f2b2b24a3f36f98668ed8f0b38d1a872f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9829,97360,97360,181260,0xb16040d97e3c75a137c2a826a2c5307d58612a0f881f...,19,382,0xb12814fdbbdc8c8f9c2741bde788813a90588be6,0x722122df12d4e14e13ac3b6895a86e84145b6967,100000000000000000000,1019350,...,,1,2021-09-07 03:08:40+00:00,13176145,0xf1b67ec807e6be55bc6995e31c3ee4bf8cb6e6610754...,8.255631e+10,1.220177e+09,2.0,81846727568,0xa160cdab225685da1d56aa342ad8841c3b53f291
9830,97361,97361,181261,0x29d727360e68b35743d73d95f344ed885e2a9e33887f...,3,173,0xdcaaa4717c46840e6bdc1e067478210b0758055c,0x722122df12d4e14e13ac3b6895a86e84145b6967,100000000000000000000,200000,...,,0,2021-09-12 17:59:59+00:00,13212438,0xa50afdfc6c73d19a886575e40b8a88e8f00d76635a6e...,6.413296e+10,1.188590e+09,2.0,53892307010,0xa160cdab225685da1d56aa342ad8841c3b53f291
9831,97362,97362,181265,0x9c5e4b8b996349237a9df035f9e38dbd69f3bb280ecc...,8,44,0x8cf22ec65677475aa41b0537b98ccee63ceca973,0x722122df12d4e14e13ac3b6895a86e84145b6967,10000000000000000000,1024570,...,,1,2021-08-29 00:56:36+00:00,13117282,0x2c3d7ef9d4e85085b72b86807278679e07b078d85996...,1.287044e+11,5.910977e+09,2.0,71609345538,0x910cbd523d972eb0a6f4cae4618ad62622b39dbf
9832,97363,97363,181266,0xeae7c8d21bb97391f21b7f3443ac51429092e0290fe6...,189,276,0x68661550f759a41d65a8eefd3e47dd606ca76ae2,0x722122df12d4e14e13ac3b6895a86e84145b6967,100000000000000000000,1008130,...,,1,2021-10-17 01:38:32+00:00,13432575,0xc59f24dbf6573c5c17f16d08657c78dea20377230f11...,6.838974e+10,1.301710e+09,2.0,51523487883,0xa160cdab225685da1d56aa342ad8841c3b53f291


A test of the function applied to our deposit transactions data.

### Function summary: same_gas_price_heuristic

This function receives a particular withdraw transaction and a DataFrame with the unique gas price deposits.

It returns a tuple:
* $(True, deposit$ $hash)$ when a deposit transaction with the same gas price as the withdrawal transaction is found.
* $(False, None)$ when such a deposit is not found.

In [31]:
def same_gas_price_heuristic(withdrawal_transaction, unique_gas_price_deposit_df):
    
    # Iterate over each deposit transaction of unique_gas_price_deposit_df
    for deposit_row in unique_gas_price_deposit_df.itertuples():
        
        # When a deposit transaction with the same gas price as the withdrawal transaction is found, and
        # it also satisfies having an earlier timestamp than it, the tuple (True, deposit_hash) is returned.
        if (withdrawal_transaction.gas_price == deposit_row.gas_price) and (withdrawal_transaction.block_timestamp > deposit_row.block_timestamp):
            return (True, deposit_row.hash)
        
    return (False, None)

### Function summary: same_gas_price_heuristic_by_pool

This function receives a particular withdraw transaction and a DataFrame with the unique gas price deposits.

It returns a tuple:
* $(True, deposit$ $hash)$ when a deposit transaction with the same gas price and belonging from the same pool (for example, 1ETH) as the withdrawal transaction is found.
* $(False, None)$ when such a deposit is not found.

In [38]:
def same_gas_price_heuristic_by_pool(withdrawal_transaction, unique_gas_price_deposit_df):
    
    # Iterate over each deposit transaction of unique_gas_price_deposit_df
    for deposit_row in unique_gas_price_deposit_df.itertuples():
        
        # When a deposit transaction with the same gas price as the withdrawal transaction is found, and
        # it also satisfies having an earlier timestamp than it, the tuple (True, deposit_hash) is returned.
        # There is an additional condition, the tornado cash pools of both transactions must match.
        if (withdrawal_transaction.gas_price == deposit_row.gas_price) and (withdrawal_transaction.block_timestamp > deposit_row.block_timestamp) and (withdrawal_transaction.tornado_cash_address == deposit_row.tornado_cash_address):
            return (True, deposit_row.hash)
        
    return (False, None)

### Function summary: apply_same_gas_price_heuristic

Applies the heuristic to all the withdraw_transactions DataFrame. Returns a dicionary mapping linked withdrawal and deposit transaction hashes.

In [16]:
def apply_same_gas_price_heuristic(deposit_transactions_df, withdraw_transactions_df):
    
    # Get deposit transactions with unique gas prices.
    
    unique_gas_price_deposits = filter_by_unique_gas_price(deposit_transactions_df)
    
    # Initialize an empty dictionary to store the linked transactions.
    
    withdrawal_to_deposit = {}
    
    # Iterate over the withdraw transactions.
    for withdraw_row in tqdm(withdraw_transactions_df.itertuples(), total=withdraw_transactions_df.shape[0], mininterval=10):     
        # Apply heuristic for the given withdraw transaction.
        same_gas_deposit_hash = same_gas_price_heuristic(withdraw_row, unique_gas_price_deposits)
        
        # When a deposit transaction matching the withdraw transaction gas price is found, add
        # the linked transactions to the dictionary.
        if same_gas_deposit_hash[0]:
            withdrawal_to_deposit[withdraw_row.hash] = same_gas_deposit_hash[1]

    # Return the linked transactions dictionary.
    return withdrawal_to_deposit

### Function summary: apply_same_gas_price_heuristic_by_pool

Applies the heuristic to all the withdraw_transactions DataFrame, filtering also by the particular pool. Returns a dicionary mapping linked withdrawal and deposit transaction hashes.

In [30]:
def apply_same_gas_price_heuristic_by_pool(deposit_transactions_df, withdraw_transactions_df):
    
    # Get deposit transactions with unique gas prices.
    
    unique_gas_price_deposits = filter_by_unique_gas_price_by_pool(deposit_transactions_df)
    
    # Initialize an empty dictionary to store the linked transactions.
    
    withdrawal_to_deposit = {}
    
    # Iterate over the withdraw transactions.
    for withdraw_row in tqdm(withdraw_transactions_df.itertuples(), total=withdraw_transactions_df.shape[0], mininterval=10):     
        # Apply heuristic for the given withdraw transaction.
        same_gas_deposit_hash = same_gas_price_heuristic_by_pool(withdraw_row, unique_gas_price_deposits)
        
        # When a deposit transaction matching the withdraw transaction gas price is found, add
        # the linked transactions to the dictionary.
        if same_gas_deposit_hash[0]:
            withdrawal_to_deposit[withdraw_row.hash] = same_gas_deposit_hash[1]

    # Return the linked transactions dictionary.
    return withdrawal_to_deposit

We run the heuristic for the entire withdraw dataset

In [18]:
linked_transactions = apply_same_gas_price_heuristic(deposit_transactions_df, withdraw_transactions_df)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83782/83782 [44:09<00:00, 31.62it/s]


In [19]:
linked_transactions = pd.DataFrame.from_dict({"withdraw_hashes": linked_transactions.keys(), "deposit_hashes": linked_transactions.values()})
linked_transactions

Unnamed: 0,withdraw_hashes,deposit_hashes
0,0x6b1ffed49640dce414b1110fb8111c6e6d5cae0707db...,0x84508c35b90983284f9605117333ed9c379be64bf82f...
1,0x0f742f92162f7a30c07d314a891acebae3baf229a667...,0x5018fe2c5bcc2a062357edf5f367a9237c7b332a9376...
2,0xf6b761a8d27dddf0710b4cc2c0c5f8a1d740b5500cd6...,0x3156e68753b1d12cd68da86a45f1f2b96b48c73c3d98...
3,0xc5205147de3a60cb321c005e59a8ea0004bcccb8389b...,0x8c72a5f48918f1c6a79ce219e042ddb81f1a2442138a...
4,0xc40d7044612eae0ab3986df553b0043374516332f535...,0x6a082db2993c2beaa1134cd3f266d9b79b0cfad43dbc...
...,...,...
1120,0x9668053f20bc95ef17dc427ef913883cefc0a639bbc2...,0x0360baaa220ad16d11edc64123f17dbdbc5547904d50...
1121,0x73fd2f663dc86d8bb252726dc7cbf9af5ff6eb786a0e...,0xb61a3104f74c1af05c8b8742288da600cee5744ad4af...
1122,0x155c67c52f442d8e311f8dcfeb802db1fe0539c8b624...,0xfde60cd9b4b53885ae6bdc1ac39b5760232634a595e7...
1123,0x92af7fcb7e26b1fbaf166347f144f4ed2b536c3724e1...,0x79ae9cc02ab3a954ad50bcedce6761ab4d7ea4760361...


Now we run the heuristic that filters by pool, again applying it to the entire withdraw dataset.

In [34]:
linked_transactions_by_pool = apply_same_gas_price_heuristic_by_pool(deposit_transactions_df, withdraw_transactions_df)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83782/83782 [57:32<00:00, 24.27it/s]


In [36]:
linked_transactions_by_pool_df = pd.DataFrame.from_dict({"withdraw_hashes": linked_transactions_by_pool.keys(), "deposit_hashes": linked_transactions_by_pool.values()})
linked_transactions_by_pool_df

Unnamed: 0,withdraw_hashes,deposit_hashes
0,0xb0835cab1d39dfcc6e43f5382adb9a14eff1b6594066...,0x7734d2f258455fbf845ce3ddb889a711c0493a0fc93a...
1,0xb286bf92448d1551fd3be5fb97709d1015086118437d...,0xe33a7ab84bb4d31b487529e42f869b5f2310d9aae9c8...
2,0x3ef076bef26c1d3a8fd871ecbd125f98c2bb85d4fb4b...,0x4da8899d4764c3a8d64f95f1bf0c67b61dc00fed3381...
3,0xda1eb6593a724f270d6668932f874546a0e98317ff76...,0xd382985d27002cbd8c803b324a42dcc93e39955fb9a4...
4,0xb958620f035bca2c1776deeef03f642c3bfeb40b7da9...,0x52712f9e22e5b0c822d88ce78b16130d91064fd7f1ad...
...,...,...
1258,0x4611c3b8eb6b90d267295aa822878a4200e5257fa381...,0x1af19bf59b0a2c3877a784d7c0e9d879e0550236bb25...
1259,0x88e3e2d091efaa07f3ca7bae94d7cd6314ca68127d6b...,0x773d0041752549dbffa735b3a5778746789adcdce51b...
1260,0x2516a26c3ff0a0583c72b9b5666c36370789843ff85e...,0x658be4fc6bf021744611067c36319be83f49f011fa7a...
1261,0x9e1a17c0286f15358687ae7cb4f206ccfa14a08b5384...,0x1cdf20fc3fcf4ef61a3e3e45e579611e1070a0de8771...
