## Heuristic 3 - Transactions outside TCash

### Overview
The main goal of this heuristic is to link Ethereum accounts which interacted with TCash by inspecting Ethereum transactions outside it. 

This is done constructing two sets, one corresponding to the unique TCash deposit addresses ($S_{D}$) and one to the unique TCash withdraw addresses ($S_{W}$), to then make a query to reveal transactions between addresses of each set.

Withdraw transactions are inspected one by one, searching for Ethereum transactions done between the withdraw address and any of the deposit addresses. When such a transaction is found, the withdraw transaction is linked with all the deposits of the deposit addresses that were done before the withdraw.


### Data
The query to the public BigQuery database was done like so,

```
INSERT `tornado_cash_transactions.transactions_between_withdraw_and_deposit_addresses` 
SELECT * FROM `bigquery-public-data.crypto_ethereum.transactions`
WHERE 
    (
       (`from_address` IN ( SELECT `from_address` FROM `tornado_cash_transactions.deposit_addresses`))
       AND 
       (`to_address` IN ( SELECT `withdraw_address` FROM `tornado_cash_transactions.withdraw_addresses`))
    )
    OR
    (
       (`from_address` IN (  SELECT `withdraw_address` FROM `tornado_cash_transactions.withdraw_addresses`))
       AND 
       (`to_address` IN ( SELECT `from_address` FROM `tornado_cash_transactions.deposit_addresses`))
    )
```

The resulting table has two columns, **from_address** and **to_address**, and each row corresponds to a transaction done between a TCash deposit address and a TCash withdraw address, outside TCash.
From this table, we want to know which of the two addresses was the one that made the deposit and which one made the withdraw. In this way, we are able to link the corresponding deposit and withdraw transactions.

For example, consider this entry from the resulting table,

| from_address  | to_address  |
|---------------|-------------|
| address1      | address2    |

Suppose that `address1` is an address that withdrawed in TCash and `address2` one that made a deposit. Then, we want to transform said entry to one with colums **deposit_address** and **withdraw_address** like so,

| deposit_address | withdraw_address |
|-----------------|------------------|
| address2        | address1         |

With this new table, it is straightforward to make de linking between TCash transactions. With it, we can create a mapping to know with what deposit addresses each withdraw address interacted with.


### Some definitions 
A problem raises when there are addresses that belong to the two sets of TCash addresses $S_{D}$ and $S_{W}$.
Let us define an address of type `D` when it belongs to $S_{D}$ and not to $S_{W}$.
Likewise, an address of type `W` is defined when it belongs to $S_{W}$ and not to $S_{D}$. 
Finally, when an address belongs to both sets, we classify it as a `DW` type.

For the cases where we got outside TCash transactions of type `D -> W` (i.e., a transaction from a *D* type address to a *W* type address) or `W -> D` (i.e., a transaction from a *W* type address to a *D* type address), it is trivial to transform this entry to the new table.

In the cases where we have transactions of type `DW -> W`, `DW -> D`, `W -> DW` and `D -> DW`, it is also straightforward to transform the corresponding entries. For example, consider again this particular entry of Ethereum transactions table,


| from_address  | to_address  |
|---------------|-------------|
| address1      | address2    |


Suppose now that `address1` is of type `DW` and `address2` of type `D`. Then, `address2` is trivially placed in the `deposit_address` column. By a process of elimination, then `address1` is set in `withdraw_address` column,


| deposit_address | withdraw_address |
|-----------------|------------------|
| address2        | address1         |


When we have a transaction of type `DW -> DW`, it cannot be known which address deposited and which one made the withdraw, so the two combinations are considered. Considering again the same entry, the resulting table will be as follows,

| deposit_address | withdraw_address |
|-----------------|------------------|
| address1        | address2         |
| address2        | address1         |


Then, deposits of `address1` are linked to withdraws of `address2` and deposits of `address2` are linked to withdraws of `address1`.


### Results data structure
The results of this heuristic are returned as a dictionary where each element has a TCash withdraw transaction hash  as key and a list of the linked TCash deposit transaction hashes as a value. For example,

```
    '0x4858': ['0x2fad', '0x750a']
```
This would mean that withdraw transaction `0x4858` is linked to `0x2fad` and `0x750a` deposit transactions.

In [24]:
using CSV
using DataFrames
using ProgressBars
using JSON


In [32]:
function load_addresses_and_pools_to_deposits_json(filepath)
    raw_dict_list = JSON.parsefile(filepath)
    addresses_and_pools_to_deposits = Dict{@NamedTuple{address::String, pool::String}, Vector{@NamedTuple{deposit_hash::String, timestamp::String}}}()
        
        #HashTimestamp = namedtuple("HashTimestamp", ["deposit_hash", "timestamp"])
        #AddressPool = namedtuple("AddressPool", ["address", "pool"])
        
        for dic in ProgressBar(raw_dict_list, printing_delay=1)
            addresses_and_pools_to_deposits[(address=dic["key"][1], pool=dic["key"][2])] = [(deposit_hash=l[1], timestamp=l[2]) for l in dic["value"]]
        end
    
    return addresses_and_pools_to_deposits
end

load_addresses_and_pools_to_deposits_json (generic function with 1 method)

In [None]:

deposit_txs = pd.read_csv("../data/lighter_complete_deposit_txs.csv")
deposit_txs["tcash_pool"] = deposit_txs["tornado_cash_address"].apply(lambda addr: tornado_addresses[addr])
withdraw_txs = pd.read_csv("../data/lighter_complete_withdraw_txs.csv")
withdraw_txs["tcash_pool"] = withdraw_txs["tornado_cash_address"].apply(lambda addr: tornado_addresses[addr])

unique_deposit_addresses = set(deposit_txs["from_address"])
unique_withdraw_addresses = set(withdraw_txs["recipient_address"])

addresses_and_pools_to_deposits_dict = load_addresses_and_pools_to_deposits_json('../data/addresses_and_pools_to_deposits.json')

address_and_withdraw_df = pd.read_csv("../data/transactions_between_deposit_and_withdraw_addresses.csv")[["from_address", "to_address"]]

In [35]:
deposit_txs = CSV.read("../data/lighter_complete_deposit_txs.csv", DataFrame)
withdraw_txs = CSV.read("../data/lighter_complete_withdraw_txs.csv", DataFrame)

const unique_deposit_addresses = Set(deposit_txs[!, :from_address])
const unique_withdraw_addresses = Set(withdraw_txs[!, :recipient_address])

addresses_and_pools_to_deposits_dict = load_addresses_and_pools_to_deposits_json("../data/addresses_and_pools_to_deposits.json")

outside_tcash_txs = CSV.read("../data/transactions_between_deposit_and_withdraw_addresses.csv", DataFrame)
address_and_withdraw_df = outside_tcash_txs[!, [:from_address, :to_address]];

0.0%┣                                        ┫ 0/31.9k [00:01<-8:-52:-7, -1s/it]
0.0%┣                                        ┫ 1/31.9k [00:01<Inf:Inf, InfGs/it]
100.0%┣███████████████████████████████████┫ 31.9k/31.9k [00:01<00:00, 28.3kit/s]


### Data preprocessing
From the data obtained in the query, we want to filter repeated and permuted transactions, since they don't provide any new information. Once a transaction between a withdraw address and a deposit address is found, they are considereded linked regardless of the direction in what the transaction was done and the number of times they interacted.

In [48]:
function filter_repeated_and_permuted(address_and_withdraw_df)
    filtered_addresses_set = Set(Set([]))
    
    for row in eachrow(address_and_withdraw_df)
        push!(filtered_addresses_set, Set([row.from_address, row.to_address]))
    end
  
    filtered_addresses_set
end

function dataframe_from_set_of_sets(set_of_sets)
    df = DataFrame(address_1=[], address_2=[])
    for set in set_of_sets
        push!(df, collect(set))
    end
    df
end

function preprocess_data(address_and_withdraw_df)
    set = filter(x -> length(x) == 2, filter_repeated_and_permuted(address_and_withdraw_df))
    dataframe_from_set_of_sets(set)
end

preprocess_data (generic function with 1 method)

In [49]:
clean_addresses = preprocess_data(address_and_withdraw_df)

Unnamed: 0_level_0,address_1,address_2
Unnamed: 0_level_1,Any,Any
1,0x0c63d55a244657f5606d62856bd9f1ff227c05f2,0x0e54db73f82bd9fde34ebce53ea83bd197e9044c
2,0xd9ee088c6ca2a90d6f0d059af17c2ec2c908bb0f,0xc73ef94bc339a2cb9a1b67820af46bf47484a1ed
3,0xbf7c205febae32f7874b28b9f371fe522e1fd97a,0xe5b5df72187f7d867973615f5e1144b7a95b495f
4,0x35f081bdf4740ffa8a56ff98e4b971fbcb7d82a7,0x09fe8f71f8e14b3d6b6456fbafaaef4a27f042cd
5,0xa8308e994d180ca87c6a784fcb8612dec9ede03d,0x46ba0af6bc60e6fabd9957744c057d031c720ace
6,0xf62e92b2452d8a0fbb2c4b03424d679c86660001,0xf94571dbdff33446dabd17040cd6236b0d2c2545
7,0xce91fddab3c544b59ebac665a7635561043a7def,0x865ec62a7f46aab0976ad22573fcf319c3f939ce
8,0x134b9eab4aa4c1489687c18c10d7338656fde32d,0x68a99f89e475a078645f4bac491360afe255dff1
9,0xcd1690b5ae49b4bd1ac5d201dccb461887a76dcd,0x8a83716acd66d9e1fb18c9b79540b72e04f80ac0
10,0xc77fa6c05b4e472feee7c0f9b20e70c5bf33a99b,0x4e1ce0b96fc37f81f5508c6608687af4f78f23b2


### Outside TCash transactions classification
This functions are used to classify the address type following the definitions already talked about,
and finally classify the transaction type done by the addresses outside TCash.

In [50]:
"""
To classify the addresses by their inclusion in the unique_deposit_addresses and 
the unique_withdraw_addresses sets.
"""

function is_D_type(address)
    address ∈ unique_deposit_addresses && address ∉ unique_withdraw_addresses
end

function is_W_type(address)
    address ∉ unique_deposit_addresses && address ∈ unique_withdraw_addresses
end

function is_DW_type(address)
    address ∈ unique_deposit_addresses && address ∈ unique_withdraw_addresses
end

is_DW_type (generic function with 1 method)

In [51]:
# To classify outside TCash transactions, based on the classification of addresses.

function is_D_W_tx(from_address, to_address)
    is_D_type(from_address) && is_W_type(to_address)
end

function is_W_D_tx(from_address, to_address)
    is_W_type(from_address) && is_D_type(to_address)
end

function is_D_DW_tx(from_address, to_address)
    is_D_type(from_address) && is_DW_type(to_address)
end

function is_DW_D_tx(from_address, to_address)
    is_DW_type(from_address) && is_D_type(to_address)
end

function is_W_DW_tx(from_address, to_address)
    is_W_type(from_address) && is_DW_type(to_address)
end

function is_DW_W_tx(from_address, to_address)
    is_DW_type(from_address) && is_W_type(to_address)
end

function is_DW_DW_tx(from_address, to_address)
    is_DW_type(from_address) && is_DW_type(to_address)
end 

is_DW_DW_tx (generic function with 1 method)

### Function description: map_withdraw2deposit_interactions_outside_tcash
This function receives the clean addresses data, transforms it to the table mentioned in the introduction and returns a dictionary mapping the interaction of each withdraw address with deposit addresses.

In [104]:
function map_withdraw2deposit_interactions_outside_tcash(clean_addresses_df)
    
    deposit_and_withdraw_matrix = Array{String}(undef, 0, 2) #np.empty((0, 2), dtype=str)
    
    for row in ProgressBar(eachrow(clean_addresses_df), printing_delay=1 )
                
        if is_D_W_tx(row.address_1, row.address_2) || is_D_DW_tx(row.address_1, row.address_2) || is_DW_W_tx(row.address_1, row.address_2)
            deposit_and_withdraw_matrix = vcat(deposit_and_withdraw_matrix, [row.address_1 row.address_2])
            
        elseif is_W_D_tx(row.address_1, row.address_2) || is_W_DW_tx(row.address_1, row.address_2) || is_DW_D_tx(row.address_1, row.address_2)
            deposit_and_withdraw_matrix = vcat(deposit_and_withdraw_matrix, [row.address_2 row.address_1])
            
        elseif is_DW_DW_tx(row.address_1, row.address_2)
            deposit_and_withdraw_matrix = vcat(deposit_and_withdraw_matrix, [row.address_1 row.address_2])
            deposit_and_withdraw_matrix = vcat(deposit_and_withdraw_matrix, [row.address_2 row.address_1])
        else
            print(row.address_1, row.address_2)
            ValueError("The transaction is not from any of the types: D_W, W_D, D_DW, DW_D, W_DW, DW_W, DW_DW")
        end
    end

    D_W_df = DataFrame(deposit_and_withdraw_matrix, ["deposit_address", "withdraw_address"])
    
    dict = Dict()
    for row in eachrow(D_W_df)
        if haskey(dict, row.withdraw_address)
            dict[row.withdraw_address] = push!(dict[row.withdraw_address], row.deposit_address)
        else
            dict[row.withdraw_address] = [row.deposit_address]
        end
    end
    
    return dict 
    end

map_withdraw2deposit_interactions_outside_tcash (generic function with 1 method)

In [106]:
waddr2daddr = map_withdraw2deposit_interactions_outside_tcash(clean_addresses)

0.0%┣                                        ┫ 0/11.2k [00:01<-3:-7:-10, -1s/it]
0.0%┣                                        ┫ 1/11.2k [00:01<Inf:Inf, InfGs/it]
99.8%┣█████████████████████████████████████┫ 11.2k/11.2k [00:02<00:00, 5.6kit/s]
100.0%┣████████████████████████████████████┫ 11.2k/11.2k [00:02<00:00, 5.6kit/s]


Dict{Any, Any} with 8055 entries:
  "0x50892e106095f415b4b12… => ["0xb3d76302aecdf0683ad3b39ccb56508a066c243d", "…
  "0xdf93a32c083207cd1ea00… => ["0x8b01d375e274213c860ef6ac013dbdd5286cd816"]
  "0xd1ccc07177c0c27ab78cf… => ["0x4ea0d6576e606778cc9dcc329d06ec70c3906cc2"]
  "0x6996c90cedd6b7ef51971… => ["0x4de6c05654503d0c54d44d68493308fbb5b0a886"]
  "0x1d62ca769fcf94d24484b… => ["0x70631b7376f4956185dac1b9cb4e9f83ccbc2764"]
  "0xab17da946b4ee971e6cd9… => ["0x32ef7405421aed9cd879c6f7059fa150d0aeef88"]
  "0x5d57f2e5f61b484eadc14… => ["0x94c8fe79a01d10cc1f56a8107e67d38e3fd74754"]
  "0xb232e7c376462dcc96004… => ["0xd36cd37f6488d87fb41c6a525e603e9c4a49f565"]
  "0x1cbfd11c477bb948742ef… => ["0x96feaff0673a2b6afe65eb456d35d347d658469f", "…
  "0xfed2cb2342f345bab5e55… => ["0x7bf4f7b96f010e6859ef4b0d947a07094546555d"]
  "0xd099ff6d9f00ce2d35bd6… => ["0x04bce19720ede47ddd3b44e6159686f0fffe0034"]
  "0x06b3a0712f26c7e72dc37… => ["0x9bde7bf4d5b13ef94373ced7c8ee0be59735a298"]
  "0x447791d58f691687f88

### Function description: first_neighbors_heuristic
Given a withdraw transaction, checks if its address had interacted with any deposit address outside TCash. If it has, then fetch all the deposits those deposit addresses had done and link them to the withdraw transaction.