# Blockchain analysis


## Basic setup

Here we will import the `pyspark` module and set up a `SparkSession`. By default, we'll use a `SparkSession` running locally, with one Spark executor, but the `local[4]` can be changed with the ip of the Spark master.


In [None]:
spark

In [None]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import *

spark = SparkSession.builder \
                    .master("local[4]") \
                    .config("spark.driver.memory", "4g") \
                    .getOrCreate()

## Loading the data

To obtain the graph representing the transaction in the Bitcoin network, we need to load sets of nodes representing the addresses (fingerprints of the public keys), transactions and blocks. We also need the set of edges representing the relations between entities. For this example we will use following parquet files that were generated from the blockchain data by this [converter](https://github.com/Jiri-Kremser/bitcoin-insights/tree/master/parquet-converter).

In [None]:
addresses = spark.read.load("/tmp/addresses.parquet")
addresses.show(5)

blocks = spark.read.load("/tmp/blocks.parquet")
blocks.show(5)

transactions = spark.read.load("/tmp/transactions.parquet")
transactions.show(5)

Unify 

In [None]:
allNodes = addresses.withColumn("type", lit("A")) \
                    .union(blocks) \
                    .union(transactions.withColumn("type", lit("T"))) \
                    .withColumnRenamed("address", "id")

allNodes.show()

In [None]:
raw_edges = spark.read.load("/tmp/edges.parquet") \
                      .cache()
raw_edges.show(5)
raw_edges.count()

## Constructing the graph representation

In [None]:
from graphframes import *

g = GraphFrame(allNodes, raw_edges).cache()

#### Get the vertex degrees

In [None]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

update_type = udf(lambda t: t if t == 'A' or t == 'T' else 'B', StringType())


vertexDegreesAndIds = g.inDegrees.join(g.outDegrees, "id").join(g.vertices, "id")
vertexDegrees = vertexDegreesAndIds.drop("id") \
                                   .withColumn('type', update_type(col("type")))

vertexDegrees.sort(desc("inDegree")).show(2, False)

#### Calculate some basic statistics

In [None]:
vertexDegrees.groupBy("type") \
             .agg(avg(col("inDegree")), stddev(col("inDegree")), \
                  avg(col("outDegree")), stddev(col("outDegree"))).show()

#### Find some patterns in the graph
It uses simple ASCII-like DSL and because it's all Dataframe based, it optimizes the query execution by the Catalyst.

In [None]:
motifs = g.find("(address)-[e1]->(tx);(block)-[e2]->(tx);(tx)-[e3]->(dstAddress)") \
                       .filter("address.type = 'A'") \
                       .filter("tx.type = 'T'") \
                       .filter("e1.value < 10000") \
                       .filter("block.type > unix_timestamp('2013-01-01 00:00:00')") \
                       .filter("block.type < unix_timestamp('2014-01-01 00:00:00')") \
                       .filter("dstAddress.type = 'A'") \
                       .filter("e3.value < 10000") \
                       .cache()
motifs.selectExpr("e1.src as src_address" ,"e1.value as src_value",
                  "e3.value as dst_value", "e3.dst as dst_address").show()
motifs.count()

# Visualization methods

Our data contain a lot of nodes and edges so let's show only a small fraction of the transaction graph. We will show all the outgoing transaction of particular bitcoin address.

In [None]:
from pyspark.sql.functions import col
import random

vertexOutDegrees = g.outDegrees
txs = vertexOutDegrees.join(allNodes, vertexOutDegrees.id == allNodes.id) \
                               .filter(col("type") == "T") \
                               .orderBy("outDegree", ascending=False)

# feel free to use any tx that is present in the dataset
tx = txs.take(1000)[800].id

inputs = g.find("(input)-[e]->(tx)") \
             .filter(col("tx.id") == tx) \
             .filter(col("input.type") == "A")
        
outputs = g.find("(tx)-[e]->()") \
             .filter(col("tx.id") == tx)

def node_to_dict(r):
    return {
        'id': r[0],
        'label': '<font color="red">' + r[0] + '</font>',
        'type': r[1],
        'color': '#090' if r[1] == "in" else '#900',
        'x': random.uniform(0,1) + (-1 if r[1] == "in" else 1),
        'y': random.uniform(0,1),
        'size': random.uniform(0.2,1)
    }

sub_nodes = inputs.select(concat(lit("in"), col("e.src")), lit("in")) \
                  .withColumnRenamed("src", "id") \
                  .union(outputs.select(concat(lit("out"), col("e.dst")), \
                                        lit("out")).withColumnRenamed("dst", "id")) \
                  .distinct()
        
sub_edges = inputs.select(concat(lit("in"), col("e.src")), lit(tx)) \
                  .union(outputs.select(lit(tx), concat(lit("out"), col("e.dst"))))

target_nodes_dict = map(node_to_dict, sub_nodes.collect())


def edge_to_dict(i, r):
    return {
        'id': i,
        'source': r[0],
        'target': r[1]
    }

sub_edges_dict = [edge_to_dict(i, r) for i, r in enumerate(sub_edges.collect())]

target_nodes_dict.append({
    'id': tx,
    'label': '<font color="red">' + tx + '</font>',
    'color': '#999',
    'x': 0,
    'y': 0.5,
    'size': 2
})

## Sigmajs library

Now we are ready to show the data using the [sigmajs](http://sigmajs.org) library.

In [None]:
%%javascript
require.config({
    paths: {
        sigmajs: 'https://cdnjs.cloudflare.com/ajax/libs/sigma.js/1.2.0/sigma.min',
        force: 'https://unpkg.com/3d-force-graph@1.31.1/dist/3d-force-graph.min'
    }
});

require(['sigmajs']);

In [None]:
from IPython.core.display import display, HTML
from string import Template
import json

js_text_template = Template(open('js/sigma-graph.js','r').read())

graph_data = { 'nodes': target_nodes_dict, 'edges': sub_edges_dict }

js_text = js_text_template.substitute({'graph_data': json.dumps(graph_data),
                                       'container': 'graph-div'})

html_template = Template('''
<div id="graph-div" style="height:400px"></div>
<script> $js_text </script>
''')

HTML(html_template.substitute({'js_text': js_text}))

## ForceGraph3D library

https://github.com/vasturiano/3d-force-graph

In [None]:
graph_data = { 'nodes': target_nodes_dict, 'links': sub_edges_dict }
json_data = json.dumps(graph_data)

html_template = Template('''
<div id="3d-graph"></div>
<script>
require(['force'], function(ForceGraph3D) {

    const open = (node) => {
      console.log(node.type);
      let id;
      if (node.type === 'out') {
        id = 'address/' + node.id.substring(3);
      } else if (node.type === 'in') {
        id = 'address/' + node.id.substring(2);
      } else {
        id = 'tx/' + node.id;
      }
      window.open('https://blockchain.info/' + id , '_blank');
    }

    const Graph = ForceGraph3D()(document.getElementById('3d-graph'))
                                         .width(980)
                                         .height(700)
                                         .onNodeClick(open)
                                         .cameraPosition({ z: 200 })
                                         .backgroundColor('#fff')
                                         .graphData($json_data)
                                         .nodeLabel('label')
});
</script>
''')

HTML(html_template.substitute({'json_data': json_data}))

In [None]:
sample_data = raw_edges.sample(False, 0.0015).collect()
foo_dict = map((lambda row: {'source': row[0], 'target': row[1]}) , sample_data)
foo_dict2 = map((lambda row: {'id' : row[0]}) , sample_data) + map((lambda row: {'id' : row[1]}) , sample_data)
foo_dict2

In [None]:
graph_data = { 'nodes': foo_dict2, 'links': foo_dict }
json_data = json.dumps(graph_data)

html_template = Template('''
<div id="3d-graph2"></div>
<script>
require(['force'], function(ForceGraph3D) {

    const Graph = ForceGraph3D()(document.getElementById('3d-graph2'))
                                         .width(980)
                                         .height(700)
                                         .onNodeClick(open)
                                         .cameraPosition({ z: 2000 })
                                         .backgroundColor('#fff')
                                         .graphData($json_data)
                                         .nodeLabel('label')
});
</script>
''')

HTML(html_template.substitute({'json_data': json_data}))


In [None]:
transactions.join(vertexDegreesAndIds, transactions.hash == vertexDegreesAndIds.id) \
            .filter(col("inDegree") > 10) \
            .filter(col("outDegree") > 10) \
            .select("id") \
            .show(1, False)

## Networkx & matplotlib
Or we can visualize using `networkx` and `matplotlib`

In [None]:
import networkx as nx
G = nx.Graph()
G.add_edges_from(sub_edges.collect())

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
options = {
    'node_color': 'g',
    'node_size': 70,
    'width': 0.2,
    'node_shape': 'd',
    'with_labels': False,
    'font_size': 5,
}
nx.draw(G, **options)

We can also show some random sub-graph using random layout

In [None]:
import hashlib
hash = lambda str: long(hashlib.md5(str).hexdigest()[:24], 24)

sample_data = raw_edges.sample(False, 0.0004).collect()

G2 = nx.Graph()
G2.add_edges_from(map((lambda row: [hash(row[0]), hash(row[1])]), sample_data))

options = {
    'node_color': 'g',
    'node_size': 1,
    'width': 0.05,
    'node_shape': 'o',
    'vmin': 100.1,
    'vmax': 10.1,
    'with_labels': False,
}
nx.draw_random(G2, **options)

In [None]:
nx.draw_shell(G2, **options)

using spring algorithm

**Warning: this may take couple of minutes to render**

In [None]:
nx.draw_spring(G2, **options)

Unfortunatelly, the plotting mechanism in `networkx` doesn't support the zooming, but it's possible to export the data and explore them by tools like `Gephi`.