In [2]:
import pandas as pd
import numpy as np

# Filtering

First, we introduce the following definitions:

- Dust-creating transactions: those with at least one dust output.
- Dust-consuming transactions: those with at least one dust input.

We have a data set of 2,114,335 dust-creating and/or dust-consuming transactions (notice: a transaction can be both dust-creating and dust-consuming at the same time).

In [3]:
inputs = pd.read_csv('data/txs_inputs.csv')
outputs = pd.read_csv('data/txs_outputs.csv')
addrId2Hash = pd.read_csv('data/txs_addr_map.csv')
numTxs = len(np.union1d(inputs.txId.unique(), outputs.txId.unique()))
print(numTxs)

2114335


**Goal**

The goal of this notebook is to filter the data set by excluding all transactions that are not related to dust attacks (from both sides of attackers and victims).

First, we classify the input addresses using the Entity-Address data set
(https://github.com/Maru92/EntityAddressBitcoin/blob/master/README.md) enriched with Satoshi Dice addresses.

In [3]:
#addrHash2Label = pd.read_csv('data/labels.csv', names=['label', 'addrHash'])
#df = pd.DataFrame({'addrId' : inputs.address.unique()})
#df = df.merge(addrId2Hash, how='left', on='addrId')
#df = df.merge(addrHash2Label, how='left', on='addrHash')
#df

We import the data set of classified input addresses.

In [4]:
inputAddrs = pd.read_csv('data/txs_addr_input.csv')
inputAddrs = inputAddrs.fillna('Unknown')
inputAddrs

Unnamed: 0,addrId,addrHash,label
0,118901,1GmpnvUTTw1nxCtCurV8ASNAozKs4TtSev,Unknown
1,118902,166LcMzBUP2tNDhNDinNAM2pvo9A7ckUBt,Unknown
2,118918,1GwrmsmaULujeabYvSyHGofyuuSrUKzDKU,Unknown
3,118919,1CpDrFsoHmmviU6LYK9o3vND1BME5VPTQ4,Unknown
4,118920,1DcvhbDdWCxpnRdmFyMACxUbKX1FRX3zA8,Unknown
...,...,...,...
2125850,29114529,1CyRZHHPH5QvSLo6bVjfNn8M16AXVDBe7Z,Unknown
2125851,35034396,1ByN1TqM6FPJ5NUCCyHH2bqUyVPMuK6Hfk,Unknown
2125852,292018547,1PHYyWACWNMSEdf2hG4EZQ4DM2YjAVUGej,Unknown
2125853,292018534,1GAcmdKTAfBEqexBWJzzjt9NnCTz4fFCne,Unknown


Now we would like to compute the set of all transactions with at least one known address among their inputs.

In [5]:
tmp = inputs.rename(columns={'address':'addrId'})
tmp = tmp.merge(inputAddrs, how='left', on='addrId')
knownTxs = tmp[tmp.label != 'Unknown'].txId.unique()
print('N. of known transactions: {}'.format(len(knownTxs)))

N. of known transactions: 1550843


We examine the top 10 services in the known transactions.

In [19]:
x = pd.DataFrame({'numTxs': tmp[tmp.label != 'Unknown'].groupby('label').txId.nunique()}).reset_index()
x = x.sort_values('numTxs', ascending=False)
x['percentage'] = ((x['numTxs'] * 100) / numTxs).round(3)
x = x.head(10)
x['description'] = ["On-chain betting service","Faucet service","On-chain betting service","On-chain betting service","On-chain betting service","Online black market","On-chain betting service","Online wallet service","On-chain betting service","Exchange service"]
x = x[['label', 'description', 'numTxs', 'percentage']]
print(f'Total number of transactions: {np.sum(x.numTxs)}\nPercentage over full data set: {np.sum(x.percentage)}')
x

Total number of transactions: 1535873
Percentage over full data set: 72.64


Unnamed: 0,label,description,numTxs,percentage
205,SatoshiDice,On-chain betting service,1464813,69.28
230,ePay.info,Faucet service,43259,2.046
69,BtcDice.com,On-chain betting service,8500,0.402
132,DiceOnCrack.com,On-chain betting service,7114,0.336
31,Betcoins.net,On-chain betting service,3218,0.152
213,SilkRoadMarketplace,Online black market,2877,0.136
50,Bitcoin-Roulette.com,On-chain betting service,1742,0.082
158,Instawallet.org,Online wallet service,1516,0.072
46,BitZino.com,On-chain betting service,1497,0.071
82,Cex.io,Exchange service,1337,0.063


To identify transactions that are related to dust attacks, we take the data set of all dust-creating and dust-consuming transactions and keep only those that satisfy the following conditions.

1. The transaction does not contain the address of a known entity (e.g., on-chain services, mining pools, etc.) among its inputs.
2. At least one of the following is true:
    - The transaction has at least one dust output that is not associated with the \texttt{OP\_RETURN} scripting instruction.
    - The transaction has at least one dust input that was not created in a transaction from a known entity.

In [6]:
filteredOut = outputs[(outputs.amount >= 1) & (outputs.amount <= 545) 
                      & (outputs.scriptType != 4) 
                      & (~(outputs.txId.isin(knownTxs)))]
filteredIn = inputs[(inputs.amount >= 1) & (inputs.amount <= 545)
                   & (~(inputs.txId.isin(knownTxs)))
                   & (~(inputs.prevTxId.isin(knownTxs)))]
filteredTxs = np.union1d(filteredOut.txId.unique(), filteredIn.txId.unique())
# Save the list of interesting transactions (to be kept) on a text file.
np.savetxt('data/txs_filtered_ids.csv', filteredTxs, fmt='%d', delimiter='\n')
print('Size of filtered data set: {} TXs'.format(len(filteredTxs)))

Size of filtered data set: 387330 TXs
