# Data collection procedure

### Step 1. Block information

a. Get most recent Hash (BlockID) from here [https://blockchain.info/blocks](https://blockchain.info/blocks)
- Example: [3140f440245d6db71e374eae1e73c595d1ee566590ed7e086aace8fda22434f4](3140f440245d6db71e374eae1e73c595d1ee566590ed7e086aace8fda22434f4)

b. For that BlockID extract all the TransactionID
- Example: [8088eeadbb0c6cbc6cc87ffacc05045f50195bd3837ec392a894465693578b57](8088eeadbb0c6cbc6cc87ffacc05045f50195bd3837ec392a894465693578b57)

c. For each TransactionID follow the link to scripts & coin base
- Example: [https://blockchain.info/tx/8088eeadbb0c6cbc6cc87ffacc05045f50195bd3837ec392a894465693578b57?show_adv=true](https://blockchain.info/tx/8088eeadbb0c6cbc6cc87ffacc05045f50195bd3837ec392a894465693578b57?show_adv=true)

### Step 2. Transaction information

For each TransactionID extract the following

a. Vector of Input addresses
- Example: [13znVNnEg8qmorjD8dktpdPomuymqKScZE](13znVNnEg8qmorjD8dktpdPomuymqKScZE)

b. Vector of Input amount
- Example: 0.39706

c. Vector of output address
- Example: [18NzmGzcY9HsWsQdr5PmmTTYoX1GuCqJu9](18NzmGzcY9HsWsQdr5PmmTTYoX1GuCqJu9)

d. Vector of output amount
- Example: 2.90466

### Step 3. Create transactions list

Store outputs from the above in a convenient list structure.

In [466]:
# For now the list can can include all of the transactions in the most
# recent block. Then we can add transactions from other blocks as needed.
# For example, if there were two blocks, each with two transactions, we
# would have something like this

In [467]:
# Structure summary of the object
# str(blocks, vec.len=1, nchar.max=20)

# Environment setup

To download and install conda, open a terminal window and do the following

In [468]:
# wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# chmod 771 Miniconda3-latest-Linux-x86_64.sh
# ./Miniconda3-latest-Linux-x86_64.sh
# source ~/.bashrc

Then install the follwoing packages

In [469]:
# conda install r-XML
# conda install r-rvest
# conda install r-stringr
# conda install r-JSON

Then we should be able to load the packages

In [470]:
library(XML)
library(jsonlite)
library(rvest)
library(stringr)

# Building a web crawler

Read in block information in html format

In [471]:
doc <- read_html("https://blockchain.info/block/0000000000000000008d8e5c2cce7ac3ad991e4d8564e2e4ce98cb36a8c473d7")

In [472]:
# htmlParse(doc)

Get all of the transaction ID (Hash) for the most recent block


In [473]:
tx <- html_nodes(doc,".txdiv .hash-link") %>% html_attr('href')  

Construct the link to each transaction's information

In [474]:
tx_link  <-  paste0('https://blockchain.info',tx) %>% paste0('?show_adv=true')
# remove '/tx/' prefix from Hash
tx  <- str_replace(string=tx, pattern='/tx/', replacement='')

In [475]:
# head(tx)
# head(tx_link)

In [476]:
# I didn't need this, but may be useful later
# Create session at the transaction URL
# tx1 <- html_session('https://blockchain.info/tx/c656081142f3989cf659beb976de74e62854be68d7fca9cc5572713a59d7bbaf')
# Browse to 'Show scripts & coinbase' the single transaction
# scripts1 <- follow_link(tx1, 'Show scripts & coinbase')  %>% read_html()

## Get information for a transaction

In [477]:
# Choose a transaction to work with
tx1 <- tx[1]
tx_link1 <- tx_link[1]
scripts <- read_html(tx_link1)

# tmstmp: date time transaction was recieved as a character string
# senderAddress: character vector of addresses bitcoin were sent from
# senderAmount: numeric vector bitcoin amounts sent
# receiverAddress: character vector of addresses bitcoin were sent to
# recieverAmount: numeric vector bitcoin amounts recieved

# Extract html element and store the text
tmstmp <- html_nodes(scripts, '.col-sm-6:nth-child(1) tr:nth-child(4) td+ td') %>% html_text()

# Remove unwanted strings from the text
tmstmp <- str_replace(tmstmp,'\\s+','')
tmstmp <- str_replace_all(tmstmp,'\n','')

# Extract html element and store the text
senderAddress <- html_nodes(scripts, '.hidden-phone .tag-address') %>% html_text()

# Remove unwanted strings from the text
senderAddress <- senderAddress[senderAddress != 'Output']

# If senderAddress is empty, set it to 'No Inputs (Newly Generated Coins)'
# For some reason this was hard to extract automatically..
if(length(senderAddress)==0) {
    senderAddress <- 'No Inputs (Newly Generated Coins)'
    }

# Extract html element and store the text
senderAmount <- html_nodes(scripts, '.tag+ span') %>% html_text()
# Remove unwanted strings from the text and convert to numeric
senderAmount <- as.numeric(str_replace(senderAmount,' BTC', ''))

# Extract html element and store the text
receiverAddress <- html_nodes(scripts, '.tx-arrow-col+ .stack-mobile a') %>% html_text()
# Remove unwanted strings from the text
receiverAddress <- receiverAddress[!is.na(receiverAddress) & receiverAddress != 'Spent']

# Extract html element and store the text
receiverAmount <- html_nodes(scripts, '.pull-right span') %>% html_text()
# Remove unwanted strings from the text and convert to numeric
receiverAmount <- as.numeric(str_replace(receiverAmount,' BTC', ''))

In [478]:
tmstmp
senderAddress
senderAmount
receiverAddress
receiverAmount

For each transaction we'll store a list like the following

In [479]:
# list(tx=tx1,
#      tx_link=tx_link1,
#      tmstmp=tmstmp
#      senderAddress=senderAddress, 
#      senderAmount=senderAmount,
#      receiverAddress=receiverAddress,
#      receiverAmount=receiverAmount)

We can convert the list to json using toJSON from the jsonlite package

In [480]:
# toJSON(list(tx=tx1,
#      tx_link=tx_link1,
#      inputAddress=inputAddress, 
#      inputAmount=inputAmount,
#      outputAddress=outputAddress,
#      outputAmount=outputAmount))

# Putting the code in a loop

This loop will iterate over the list of urls containing each transactions' information

In [481]:
# tx_link

We'll use the lapply function, which is a standard way to vectorize for loops in R

In [None]:
system.time(
block <- lapply(X=1:length(tx_link), FUN=function(i){
    
    cat(paste0(i,'\n'))
        
    scripts <- read_html(tx_link[i])
    
    # tmstmp: date time transaction was recieved as a character string
    # senderAddress: character vector of addresses bitcoin were sent from
    # senderAmount: numeric vector bitcoin amounts sent
    # receiverAddress: character vector of addresses bitcoin were sent to
    # recieverAmount: numeric vector bitcoin amounts recieved
    
    # Extract html element and store the text
    tmstmp <- html_nodes(scripts, '.col-sm-6:nth-child(1) tr:nth-child(4) td+ td') %>% html_text()

    # Remove unwanted strings from the text
    tmstmp <- str_replace(tmstmp,'\\s+','')
    tmstmp <- str_replace_all(tmstmp,'\n','')

    # Extract html element and store the text
    senderAddress <- html_nodes(scripts, '.hidden-phone .tag-address') %>% html_text()

    # Remove unwanted strings from the text
    senderAddress <- senderAddress[senderAddress != 'Output']

    # If senderAddress is empty, set it to 'No Inputs (Newly Generated Coins)'
    # For some reason this was hard to extract automatically..
    if(length(senderAddress)==0) {
        senderAddress <- 'No Inputs (Newly Generated Coins)'
        }

    # Extract html element and store the text
    senderAmount <- html_nodes(scripts, '.tag+ span') %>% html_text()
    # Remove unwanted strings from the text and convert to numeric
    senderAmount <- as.numeric(str_replace(senderAmount,' BTC', ''))

    # Extract html element and store the text
    receiverAddress <- html_nodes(scripts, '.tx-arrow-col+ .stack-mobile a') %>% html_text()
    # Remove unwanted strings from the text
    receiverAddress <- receiverAddress[!is.na(receiverAddress) & receiverAddress != 'Spent']

    # Extract html element and store the text
    receiverAmount <- html_nodes(scripts, '.pull-right span') %>% html_text()
    # Remove unwanted strings from the text and convert to numeric
    receiverAmount <- as.numeric(str_replace(receiverAmount,' BTC', ''))
    # Remove zeros
    receiverAmount <- receiverAmount[receiverAmount!=0]
    
    
    list(tx = tx[i],
         tx_link = tx_link[i],
         tmstmp = tmstmp,
         senderAddress = senderAddress, 
         senderAmount = senderAmount,
         receiverAddress = receiverAddress,
         receiverAmount = receiverAmount)

}))

## Store the block of transactions in a table

First look at the block's structure

In [None]:
library(tidyr)
library(dplyr)

In [None]:
# df_nested <- data.frame(Reduce(rbind, block))

In [None]:
# df <- df_nested %>% rowwise %>% do(expand.grid(., stringsAsFactors=FALSE))