# Blockchain analysis
todo: this

## Basic setup

Here we will import the `pyspark` module and set up a `SparkSession`.  By default, we'll use a `SparkSession` running locally, with one Spark executor; we're dealing with small data, so it doesn't make sense to run against a cluster.


In [None]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import regexp_replace

spark = SparkSession.builder.master("local[1]").getOrCreate()
sc = spark.sparkContext

## Loading the data

To obtain the graph representing the transaction in the Bitcoin network, we need to load set of nodes representing the wallets (fingerprints of the public keys) and the set of edges representing each transaction. For this example we will use two parquet files that were generated from the blockchain data by this [convertor](https://github.com/Jiri-Kremser/bitcoin-insights/tree/master/parquet-converter).

In [None]:
raw_nodes = spark.read.load("/data/nodes.parquet") \
                      .withColumnRenamed("_1", "id") \
                      .withColumnRenamed("_2", "Wallet")
raw_nodes.show(5)

As you can see, each record in the wallet column contains a string `bitcoinaddress_<hash>`, where the hash is the actual address of the wallet. Let's remove the redundant prefix.

In [None]:
nodes = raw_nodes.withColumn("Wallet", regexp_replace("Wallet", "bitcoinaddress_", "")).cache()
nodes.show(5)

As you can see, each record in the wallet column contains a string `bitcoinaddress_<hash>`, where the hash is the actual address of the wallet. Let's remove the redundant prefix.

We can also verify, that these addresses are real on https://blockchain.info/address/. 

Example:
 * get random address

In [None]:
random_address = nodes.rdd.takeSample(False, 1)[0][1]
random_address

 * create the link from the address: https://blockchain.info/address/{{random_address}} 
 
 (todo: http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/python-markdown/readme.html)

In [None]:
edges = spark.read.load("/data/edges.parquet") \
                  .withColumnRenamed("srcId", "src") \
                  .withColumnRenamed("dstId", "dst") \
                  .cache()
edges.show(5)
edges.count()

## Constructing the graph representation

Spark contains API for graph processing. It's called [graphx](https://spark.apache.org/graphx/) and it also comes with multiple built-in algorithms like page-rank. It uses the [Pregel API](https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api).

In [None]:
from graphframes import *

g = GraphFrame(nodes, edges).cache()

#### Get the top 10 wallets with respect to the transaction count

First, by sorting the nodes by `inDegree` which corresponds to the number of transactions received.

In [None]:
vertexInDegrees = g.inDegrees
vertexInDegrees.join(nodes, vertexInDegrees.id == nodes.id) \
               .drop("id") \
               .orderBy("inDegree", ascending=False) \
               .take(10)

Then by using the `outDegree` ~ # transactions sent

In [None]:
vertexOutDegrees = g.outDegrees
senders = vertexOutDegrees.join(nodes, vertexOutDegrees.id == nodes.id) \
                          .drop("id") \
                          .orderBy("outDegree", ascending=False)
senders.take(10)

You can verify on blockchain.info that the actual number of transaction is lower than what we have just calculated. This stems from the fact how Bitcoin works, when sending BTC from a wallet to another one, it actually sends all the BTC and the receiving node will sends back the rest. We are oversimplyfying here and you can find the details [here](https://en.bitcoin.it/wiki/Transaction) or [here](https://bitcoin.stackexchange.com/questions/9007/why-are-there-two-transaction-outputs-when-sending-to-one-address).

#### Find circles of length 2

In [None]:
motifs = g.find("(a)-[]->(b); (b)-[]->(a)")
# motifs.count()
motifs.show(5)

#### Resource consuming foo

In [None]:
# this fails with OOM error
#results = g.pageRank(resetProbability=0.15, maxIter=1)
#results.vertices.select("id", "pagerank").show()
#results.edges.select("src", "dst", "weight").show()

# g.labelPropagation(maxIter)

# this would be nice display(ranks.vertices.orderBy(ranks.vertices.pagerank.desc()).limit(20))

## Visualization of a sub-graph

Our data contain a lot of transactions (2 087 249 transactions among 546 651 wallets) so let's show only a small fraction of the transaction graph. We will show all the outgoing transaction of particular bitcoin address.

In [None]:
from pyspark.sql.functions import col
import random

# feel free to use any address that is present in the dataset
address = senders.take(1000)[999].Wallet

sub_graph = g.find("(src)-[e]->(dst)") \
             .filter(col('src.Wallet') == address)
    
def node_to_dict(r):
    return {
        'id': r.dst[0],
        'label': r.dst[1],
        'x': random.uniform(0,1),
        'y': random.uniform(0,1),
        'size': random.uniform(0.2,1)
    }

target_nodes = map(node_to_dict, sub_graph.select("dst").distinct().collect())

def edge_to_dict(i, r):
    return {
        'id': i,
        'source': r.e[0],
        'target': r.e[1]
    }

sub_edges = [edge_to_dict(i, r) for i, r in enumerate(sub_graph.select("e").collect())]

target_nodes.append({
    'id': sub_edges[0]['source'],
    'label': address,
    'color': '#999',
    'x': -1,
    'y': 0.5,
    'size': 2
})

Now we are ready to show the data using the [sigmajs](sigmajs.org) library.

In [None]:
%%javascript
require.config({
    paths: {
        sigmajs: 'https://cdnjs.cloudflare.com/ajax/libs/sigma.js/1.2.0/sigma.min'
    }
});

require(['sigmajs']);

In [None]:
from IPython.core.display import display, HTML
from string import Template
import json

graph_data = { 'nodes': [], 'edges': [] }

js_text_template = Template(open('js/sigma-graph.js','r').read())

graph_data = { 'nodes': target_nodes, 'edges': sub_edges }

js_text = js_text_template.substitute({'graph_data': json.dumps(graph_data),
                                       'container': 'graph-div'})

html_template = Template('''
<div id="graph-div" style="height:400px"></div>
<script> $js_text </script>
''')

HTML(html_template.substitute({'js_text': js_text}))