# Analyse Blockchain with GraphX

_Trying identify interesting addresses in the blockchain transaction graph_

## Basic setup

Here we will create spark session that is necessary for further dataframe processing.


In [ ]:
val spark = SparkSession.builder
                    .master("local[4]")
                    .getOrCreate()

## Check the data on disk
Graph data is stored on the dist as two Parquet files. One with vertices and the second one with the edges.

In [ ]:
:sh du -h /tmp/nodes.parquet

In [ ]:
:sh du -h /tmp/edges.parquet

## Load the data

In [ ]:
val rawNodes = spark.read.load("/tmp/nodes.parquet")
rawNodes.show(5, false)

#### Number of vertices

In [ ]:
rawNodes.count

### Clean the data

In [ ]:
import org.apache.spark.sql.functions.regexp_replace

val nodes = rawNodes.na.drop()
                    .withColumnRenamed("_1", "id")
                    .withColumnRenamed("_2", "address")
                    .withColumn("address", regexp_replace($"address", "bitcoinaddress_", ""))
nodes.show(5, false)

In [ ]:
val edges = spark.read.load("/tmp/edges.parquet")
                      .drop($"attr")
edges.show(5)

#### Number of edges

In [ ]:
edges.count()

# Creating the Graph
GraphX library expects RDDs, so we need to do the conversion from the dataframes here

In [ ]:
// todo: ugly
import org.apache.spark.graphx._
val nodesRdd: RDD[(VertexId, String)] = nodes.rdd.map(row => (row(0).asInstanceOf[Long], row(1).asInstanceOf[String]))
val edgesRdd: RDD[Edge[Option[String]]] = edges.rdd.map(row => Edge(row(0).asInstanceOf[Long], row(1).asInstanceOf[Long]))


In [ ]:
val graph = Graph(nodesRdd, edgesRdd)

## Calculate the Page Rank

This may take couple of minutes depending on the size of the data. The implementation of the algorithm is described [here](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.PageRank$).

In [ ]:
val ranks = graph.pageRank(0.001)
                 .vertices
                 .toDF("id", "rank")

ranks.show

Now we can sort the vertices by their calculated page ranks.

In [ ]:
val sortedRanks = ranks.join(nodes, "id")
                       .sort(desc("rank"))

sortedRanks.show(5, false)

In [ ]:
val top10 = sortedRanks.take(10).map(_(2).toString)

top10

top10: Array[String] = Array(C825A1ECF2A6830C4401620C3A16F1995057C2AB, DE21D51F82F065DF011CFB3CDCE09C6F71FC716B, D63066643AFA128CE4BEBB2523242ADF5F07A0A9, AA3750AA18B8A0F3F0590731E1FAB934856680CF, 4FA170CFDE2372AC91D479F989DC4DB5AA8D47E0, 9A4E5250E56CA29765635022FB11624116B226BE, 200413B74F3B34198333778C79AF1728AC9A912A, 7773B5B0576CCC2FC79E94098B7D879CCE8BB377, 7C154ED1DC59609E3D26ABB2DF2EA3D587CD8C41, 9B71CA50A249F283DCE5848A6259EFDD2E47FA4B)
res299: Array[String] = Array(C825A1ECF2A6830C4401620C3A16F1995057C2AB, DE21D51F82F065DF011CFB3CDCE09C6F71FC716B, D63066643AFA128CE4BEBB2523242ADF5F07A0A9, AA3750AA18B8A0F3F0590731E1FAB934856680CF, 4FA170CFDE2372AC91D479F989DC4DB5AA8D47E0, 9A4E5250E56CA29765635022FB11624116B226BE, 200413B74F3B34198333778C79AF1728AC9A912A, 7773B5B0576CCC2FC79E9409...

### Helper functions

Bitcoin address is essentially a hash or fingerprint of the public key. In the blockchain for the addresses Bitcoin uses internally `hash160` with zero redundancy. However, humans tend to make mistakes and in order to mittigate the risk of sending money to wrong address by making a typo in the address, there is also address that uses a checksum. It's possible to convert between the two forms of the address.

We will be using `blockchain.info` API for fetching some useful information about the top ten addresses in our Page Rank calculation. To do that we need to define couple of helper functions.

In [ ]:
import scala.io.Source.fromURL

def makeFunc(path: String)(param: String) = 
  fromURL(s"https://blockchain.info/q/$path/$param").mkString

def hashToAddress = makeFunc("hashtoaddress") _
def balance = makeFunc("addressbalance") _
def totalReceived = makeFunc("getreceivedbyaddress") _
def totalSent = makeFunc("getsentbyaddress") _
def firstSeen = makeFunc("addressfirstseen") _
val rawJson = (addr: String) => fromURL(s"https://blockchain.info/rawaddr/$addr?limit=0").mkString

val parseJson = (jsonStr: String) => {
  val result = scala.util.parsing.json.JSON.parseFull(jsonStr)
  result match {
    case Some(hash: Map[String, Any]) => List("address", "total_received", "total_sent", "final_balance", "n_tx")
                                              .map(x => hash(x))
    case _ => Nil
  }
}

val getInfo = rawJson.andThen(parseJson)
val satoshi2BTC = (input: Double) => input / 1.0E8

// https://blockchain.info/ticker
val btcInUsd = 4279.92
val BTC2USD = (input: Double) => input * btcInUsd
val toUSD = satoshi2BTC.andThen(BTC2USD)
val formatter = java.text.NumberFormat.getCurrencyInstance
val toReadable = satoshi2BTC.andThen(BTC2USD).andThen(formatter.format(_))

Now, let's apply the `getInfo` function to our top 10 addresses.

In [ ]:
val top10detailed = top10.map(getInfo)
top10detailed

And present the results in an HTML table.

In [ ]:
<table>
  <tr><td><b>Address</b></td><td><b>Received Ttl</b></td>
  <td><b>Sent Ttl</b></td><td><b>Balance</b></td><td><b>Transactions</b></td></tr>
{
top10detailed.map(record => {
  val address = record(0)
  val totalRcv = toReadable(record(1).toString.toDouble)
  val totalSnt = toReadable(record(2).toString.toDouble)
  val balance = toReadable(record(3).toString.toDouble)
  val txNumber = record(4)
  <tr><td><a href={"https://blockchain.info/address/" + address}>{address}</a></td>
  <td>{totalRcv}</td>
  <td>{totalSnt}</td>
  <td>{balance}</td>
  <td>{txNumber}</td>
  </tr>
})
}
</table>

We can also display the detailed information about any given Bitcoin address.

In [ ]:
def displayAddress(address: String) = <iframe 
  width="1024" frameborder="0" height="630" 
  src={"http://bitcoinwhoswho.com/address/" + address}></iframe>

displayAddress("1MFXYK1XucKFfhPhW9HDHD3vsM9BKey4qm")

displayAddress: (address: String)scala.xml.Elem
res318: scala.xml.Elem = <iframe width="1024" frameborder="0" height="630" src="http://bitcoinwhoswho.com/address/1MFXYK1XucKFfhPhW9HDHD3vsM9BKey4qm"></iframe>
