# Blockchain analysis

In case you know nothing about the Bitcoin and Blockchain, you can start by watching the following video.

In [4]:
import IPython.display
IPython.display.HTML('<iframe width="750" height="430" src="https://www.youtube.com/embed/Lx9zgZCMqXE?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

## Basic setup

Here we will import the `pyspark` module and set up a `SparkSession`. By default, we'll use a `SparkSession` running locally, with one Spark executor; we're dealing with small data, so it doesn't make sense to run against a cluster, but the `local[1]` can be changed with the ip of the Spark cluster.


In [5]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import regexp_replace

spark = SparkSession.builder \
                    .master("local[4]") \
                    .config("spark.driver.memory", "4g") \
                    .getOrCreate()

sc = spark.sparkContext

## Loading the data

To obtain the graph representing the transaction in the Bitcoin network, we need to load set of nodes representing the wallets (fingerprints of the public keys) and the set of edges representing each transaction. For this example we will use two parquet files that were generated from the blockchain data by this [convertor](https://github.com/Jiri-Kremser/bitcoin-insights/tree/master/parquet-converter).

In [6]:
raw_nodes = spark.read.load("/tmp/nodes.parquet") \
                      .withColumnRenamed("_1", "id") \
                      .withColumnRenamed("_2", "Wallet")
raw_nodes.show(5)

+---+--------------------+
| id|              Wallet|
+---+--------------------+
|  0|bitcoinaddress_93...|
|  1|bitcoinaddress_4D...|
|  2|bitcoinaddress_BE...|
|  3|bitcoinaddress_4B...|
|  4|bitcoinaddress_44...|
+---+--------------------+
only showing top 5 rows



As you can see, each record in the wallet column contains a string `bitcoinaddress_<hash>`, where the hash is the actual address of the wallet. Let's remove the redundant prefix.

In [7]:
nodes = raw_nodes.withColumn("Wallet", regexp_replace("Wallet", "bitcoinaddress_", "")).cache()
nodes.show(5)

+---+--------------------+
| id|              Wallet|
+---+--------------------+
|  0|9303DBB4C75A56057...|
|  1|4D3826A813A4B4E9B...|
|  2|BECC6154EEF33464E...|
|  3|4B5E0300F11C2932F...|
|  4|44730B80C9D5EF65D...|
+---+--------------------+
only showing top 5 rows



As you can see, each record in the wallet column contains a string `bitcoinaddress_<hash>`, where the hash is the actual address of the wallet. Let's remove the redundant prefix.

We can also verify, that these addresses are real on https://blockchain.info/address/. 

Example:
 1. get a random wallet address
 1. create the link from the address

In [9]:
random_address = nodes.rdd.takeSample(False, 1)[0][1]
IPython.display.Markdown('link of the random wallet: https://blockchain.info/address/' + random_address)

link of the random wallet: https://blockchain.info/address/B4F37D584299E5B1B1C4CF24C463DBBF524DDC2D

In [15]:
raw_edges = spark.read.load("/tmp/edges.parquet") \
                      .withColumnRenamed("srcId", "src") \
                      .withColumnRenamed("dstId", "dst") \
                      .cache()
raw_edges.show(5)
raw_edges.count()

+------+------+-----+
|   src|   dst| attr|
+------+------+-----+
|150102|107378|input|
|470403|107378|input|
|232249| 97703|input|
|539070| 97703|input|
|131174|176711|input|
+------+------+-----+
only showing top 5 rows



2087249

## Data cleansing

Remove the self-loops.

In [22]:
edges = raw_edges.filter("src != dst")
edges.show(5)
edges.count()

+------+------+-----+
|   src|   dst| attr|
+------+------+-----+
|150102|107378|input|
|470403|107378|input|
|232249| 97703|input|
|539070| 97703|input|
|131174|176711|input|
+------+------+-----+
only showing top 5 rows



2081796

## Constructing the graph representation

Spark contains API for graph processing. It's called [graphx](https://spark.apache.org/graphx/) and it also comes with multiple built-in algorithms like page-rank. It uses the [Pregel API](https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api).

In [24]:
from graphframes import *

g = GraphFrame(nodes, edges).cache()

#### Get the top 10 wallets with respect to the transaction count

First, by sorting the nodes by `inDegree` which corresponds to the number of transactions received.

In [8]:
vertexInDegrees = g.inDegrees
vertexInDegrees.join(nodes, vertexInDegrees.id == nodes.id) \
               .drop("id") \
               .orderBy("inDegree", ascending=False) \
               .take(10)

[Row(inDegree=121885, Wallet=u'F4B004C3CA2E7F96F9FC5BCA767708967AF67A44'),
 Row(inDegree=69445, Wallet=u'E1BB16A26D591FD766C1B23FAEC067301AFA8A07'),
 Row(inDegree=68626, Wallet=u'5605C6DC8A9672A014225BAC565DB25BCEC649A2'),
 Row(inDegree=5763, Wallet=u'66A731E9FEB460A3A48B5C56BB635B3A409DBD56'),
 Row(inDegree=3100, Wallet=u'17EBC6064CB035D84DC177CE763D2C93FD5556C6'),
 Row(inDegree=2380, Wallet=u'7B1DD94268DF8E11BA27AB1C99C61E914C717246'),
 Row(inDegree=2338, Wallet=u'A84458A0E8F009B3780CAC779873D298CF22BE8A'),
 Row(inDegree=2113, Wallet=u'EB9A27CF65B32AABA6FD08D7C4DE65F4C7CF9361'),
 Row(inDegree=2080, Wallet=u'EFB0CD7206CCF14275643097970DB8A38497B61D'),
 Row(inDegree=2002, Wallet=u'9B3A7C3A61712270055C1E4AC6FF2704D3ACB1E0')]

Then by using the `outDegree` ~ # transactions sent

In [9]:
vertexOutDegrees = g.outDegrees
senders = vertexOutDegrees.join(nodes, vertexOutDegrees.id == nodes.id) \
                          .drop("id") \
                          .orderBy("outDegree", ascending=False)
senders.take(10)

[Row(outDegree=2870, Wallet=u'7B1DD94268DF8E11BA27AB1C99C61E914C717246'),
 Row(outDegree=1965, Wallet=u'CBD3148D93C205F862599C080556A2A531146A3B'),
 Row(outDegree=1952, Wallet=u'9B3A7C3A61712270055C1E4AC6FF2704D3ACB1E0'),
 Row(outDegree=1694, Wallet=u'580740AC5A1B84C24D7EDB3F3AFC635BDFFB587B'),
 Row(outDegree=1675, Wallet=u'826AC9812C2BE5FE349A61ACD393F1F38E8C7D2D'),
 Row(outDegree=1590, Wallet=u'17EBC6064CB035D84DC177CE763D2C93FD5556C6'),
 Row(outDegree=1577, Wallet=u'EFB0CD7206CCF14275643097970DB8A38497B61D'),
 Row(outDegree=1568, Wallet=u'A84458A0E8F009B3780CAC779873D298CF22BE8A'),
 Row(outDegree=1559, Wallet=u'64B7D877B3DB608FD62EC8034770FD0AEFE9975A'),
 Row(outDegree=1559, Wallet=u'5445DAC9A29FCDE8C7C896CF063923FBED4F5D2C')]

You can verify on blockchain.info that the actual number of transaction is lower than what we have just calculated. This stems from the fact how Bitcoin works, when sending BTC from a wallet to another one, it actually sends all the BTC and the receiving node will sends back the rest. We are oversimplyfying here and you can find the details [here](https://en.bitcoin.it/wiki/Transaction) or [here](https://bitcoin.stackexchange.com/questions/9007/why-are-there-two-transaction-outputs-when-sending-to-one-address).

#### Find circles of length 2

In [25]:
# motifs = g.find("(a)-[]->(b); (b)-[]->(a)")
# motifs.count()
# # motifs.show(5)

motifs = g.find("(a)-[]->(b)") \
          .filter("a = b")
motifs.count()
# motifs.show(5)

0

#### Resource consuming foo

In [10]:
# this fails with OOM error
#results = g.pageRank(resetProbability=0.15, maxIter=1)
#results.vertices.select("id", "pagerank").show()
#results.edges.select("src", "dst", "weight").show()

# g.labelPropagation(maxIter)

# this would be nice display(ranks.vertices.orderBy(ranks.vertices.pagerank.desc()).limit(20))

## Visualization of a sub-graph

Our data contain a lot of transactions (2 087 249 transactions among 546 651 wallets) so let's show only a small fraction of the transaction graph. We will show all the outgoing transaction of particular bitcoin address.

In [86]:
from pyspark.sql.functions import col
import random

# feel free to use any address that is present in the dataset
address = senders.take(1000)[999].Wallet

sub_graph = g.find("(src)-[e]->(dst)") \
             .filter(col('src.Wallet') == address)
    
def node_to_dict(r):
    return {
        'id': r[0],
        'label': r[1],
        'x': random.uniform(0,1),
        'y': random.uniform(0,1),
        'size': random.uniform(0.2,1)
    }

sub_nodes = sub_graph.select("dst.id", "dst.Wallet").distinct()
sub_edges = sub_graph.select("e.src", "e.dst")

target_nodes_dict = map(node_to_dict, sub_nodes.collect())

def edge_to_dict(i, r):
    return {
        'id': i,
        'source': r[0],
        'target': r[1]
    }

sub_edges_dict = [edge_to_dict(i, r) for i, r in enumerate(sub_edges.collect())]

target_nodes_dict.append({
    'id': sub_edges.first()['src'],
    'label': address,
    'color': '#999',
    'x': -1,
    'y': 0.5,
    'size': 2
})

Now we are ready to show the data using the [sigmajs](sigmajs.org) library.

In [65]:
%%javascript
require.config({
    paths: {
        sigmajs: 'https://cdnjs.cloudflare.com/ajax/libs/sigma.js/1.2.0/sigma.min'
    }
});

require(['sigmajs']);

<IPython.core.display.Javascript object>

In [87]:
from IPython.core.display import display, HTML
from string import Template
import json

js_text_template = Template(open('js/sigma-graph.js','r').read())

graph_data = { 'nodes': target_nodes_dict, 'edges': sub_edges_dict }

js_text = js_text_template.substitute({'graph_data': json.dumps(graph_data),
                                       'container': 'graph-div'})

html_template = Template('''
<div id="graph-div" style="height:400px"></div>
<script> $js_text </script>
''')

HTML(html_template.substitute({'js_text': js_text}))

In [106]:
sub_g = GraphFrame(sub_nodes.union(sub_graph.select("src.id", "src.Wallet").distinct()), sub_edges).cache()
sub_g

# #results = g.pageRank(resetProbability=0.15, maxIter=1)
# #results.vertices.select("id", "pagerank").show()
# #results.edges.select("src", "dst", "weight").show()
# # g.labelPropagation(maxIter)

# # this would be nice display(ranks.vertices.orderBy(ranks.vertices.pagerank.desc()).limit(20))


results = sub_g.pageRank(resetProbability=0.15, maxIter=3)

+------+-------------------+
|    id|           pagerank|
+------+-------------------+
| 74400|  0.151003937007874|
|211607|  0.151003937007874|
|108012|  0.151003937007874|
|291616|  0.151003937007874|
|535629|  0.151003937007874|
|479632|  0.151003937007874|
|110032|  0.151003937007874|
|232437|  0.151003937007874|
| 19644|  0.151003937007874|
|257649|  0.151003937007874|
|494852|  0.151003937007874|
|206454|  0.151003937007874|
|456064|  0.151003937007874|
|528466|0.15501968503937008|
|104068|0.15501968503937008|
|272072|  0.151003937007874|
| 22087|  0.151003937007874|
|234499|  0.151003937007874|
|124502|  0.151003937007874|
|264106|0.15501968503937008|
+------+-------------------+
only showing top 20 rows

+------+--------------------+-----------------+
|    id|              Wallet|         pagerank|
+------+--------------------+-----------------+
|337408|07060F98D74A94FD7...|             0.15|
|182139|27762B34711B5F815...|0.151003937007874|
|123745|99E3296C89710331D...|0.1510039

In [109]:
results.vertices.orderBy(results.vertices.pagerank.asc()).limit(20).show()
results.edges.select("src", "dst", "weight").show()

+------+--------------------+-----------------+
|    id|              Wallet|         pagerank|
+------+--------------------+-----------------+
|337408|07060F98D74A94FD7...|             0.15|
|123745|99E3296C89710331D...|0.151003937007874|
| 19644|25F6E93F316D37E5F...|0.151003937007874|
|535629|52C616969321918CD...|0.151003937007874|
|291616|36476D0F87E6E0ABB...|0.151003937007874|
| 74400|C6EAF8EC21D454299...|0.151003937007874|
|108012|D0539D54960842FAE...|0.151003937007874|
|211607|8DDF38700813D734F...|0.151003937007874|
|232437|2AA1A99E592391756...|0.151003937007874|
|257649|486A931FDEEB1DE4E...|0.151003937007874|
| 54138|96C498A9E4A614048...|0.151003937007874|
|206454|EE32ADD1F1AFE00F9...|0.151003937007874|
|494852|99D8033F57AF6098D...|0.151003937007874|
|479632|F0C94074ED7F7068E...|0.151003937007874|
|456064|CA642C960EBE18C67...|0.151003937007874|
|110032|901E078965FE13206...|0.151003937007874|
|272072|14BA659DC996E5178...|0.151003937007874|
| 22087|78079040588D3E924...|0.151003937

In [None]:
labeled = g.labelPropagation(maxIter=3)
g.show()

In [None]:
g.show()