# Overview

This is an effort to programatically analyze token whitepapers through a simple annotation framework using the tagging feature of the [hypothes.is](https://hypothes.is). 

I confine the vocabulary to include the most salient concepts within the abstract and conclusion, with the assumption that crux the authors' arguments are summarized in these sections. 

Each concept is highlighted and tagged with the label *concept*. We can then use the manual tagging to algorithmically generate tagging of this concept's appearance throughout the other sections of the document as well as within other documents. This cross document annotation should reveal similarities, differences and links between the constraints and solutions addressed in each paper.

In [167]:
from IPython.display import IFrame
from hypothesis import HypothesisApi
from pandas.io.json import json_normalize
import re
import os
import requests
import pandas as pd
import textract

## Contents
- [Get annotations](#Get-annotations)
- [Normalize annotation data](#Normalize-annotation-data)
- Analysis
 - [Abstract](#Abstract)
 - [Body](#Body)
 - [Conclusion](#Conclusion)
- [Update annotations](#Update-annotations)

### Get annotations

I've created a simple wrapper for the [hypothes.is API](https://h.readthedocs.io/en/latest/api-reference/). We use this class to get all the annotations on the bitcoin whitepaper, first confining our search to the group *Token Whitepapers* and then to those that have *bitcoin* in the file name. 

In [4]:
hypothesis = HypothesisApi()

hypothesis.search_by_group_name(
    'Token Whitepapers', 
    params={
        'limit': 200,
        'uri.parts': 'bitcoin'
    }
)

[{'updated': '2019-02-05T18:24:49.608813+00:00',
  'group': 'LALjzxP7',
  'target': [{'source': 'https://bitcoin.org/bitcoin.pdf',
    'selector': [{'type': 'TextPositionSelector', 'end': 1280, 'start': 1275},
     {'exact': 'proof',
      'prefix': '  longestproof-of-work chain as ',
      'type': 'TextQuoteSelector',
      'suffix': ' of what happened while they wer'}]}],
  'links': {'json': 'https://hypothes.is/api/annotations/UrMJWilzEemJkTc4McKOYw',
   'html': 'https://hypothes.is/a/UrMJWilzEemJkTc4McKOYw',
   'incontext': 'https://hyp.is/UrMJWilzEemJkTc4McKOYw/bitcoin.org/bitcoin.pdf'},
  'tags': ['requirement'],
  'text': '',
  'created': '2019-02-05T18:24:49.608813+00:00',
  'uri': 'https://bitcoin.org/bitcoin.pdf',
  'flagged': False,
  'user_info': {'display_name': None},
  'moderation': {'flagCount': 0},
  'user': 'acct:malcolmjmr@hypothes.is',
  'hidden': False,
  'document': {'title': ['bitcoin.pdf']},
  'id': 'UrMJWilzEemJkTc4McKOYw',
  'permissions': {'read': ['group:LAL

The result of our query is a list of annotations and their various details. The annotation objects are fairly hierarchal and require preprcessing to flatten and normalize the data.

In [5]:
hypothesis.previous_searches

[({'limit': 200, 'uri.parts': 'bitcoin', 'group': 'LALjzxP7'}, 27)]

### Normalize Annotation Data

In [254]:
annotations = json_normalize(hypothesis.last_search_results)
for col in ['created','updated']:
    annotations[col] = pd.to_datetime(annotations[col])
    
annotations['concept'] = annotations.target.apply(lambda t: t[0]['selector'][1]['exact'].strip().lower())
annotations['tag'] = annotations.tags.apply(lambda t: ','.join(t))
annotations[['concept','tag']].sort_values('tag')

Unnamed: 0,concept,tag
23,financial institution,problem
6,attack,problem
19,double-spend,problem
0,proof,requirement
1,rejoin,requirement
2,leave,requirement
24,onlinepayments,requirement
21,trust,requirement
25,electronic cash,requirement
17,transaction,"requirement,problem"


In [8]:
annotations['links.incontext'][0]

'https://hyp.is/UrMJWilzEemJkTc4McKOYw/bitcoin.org/bitcoin.pdf'

In [247]:
DATA_DIR = 'tmp'

file_url = annotations['target'][0][0]['source']

def download(url):
    
    if not os.path.isdir(DATA_DIR):
        os.mkdir(DATA_DIR)
    
    filename = url.split('/')[-1]
    filepath = f'{DATA_DIR}/{filename}'
    
    if os.path.exists(filepath):
        print(f'File already downloaded: {filepath}')
        return filepath
    
    r = requests.get(url, stream=True)
    with open(filepath, 'wb') as f:
        f.write(r.content)
        print(f'Successfully saved file: {filepath}')
        return filepath

In [188]:
filename = download(file_url)
raw_text = textract.process(filename=filename).decode('utf-8').lower()
text = raw_text.replace(' \n',' ').replace('-\n',' ').replace('  ',' ')

File already downloaded: tmp/bitcoin.pdf


### Abstract

In [260]:
abstract = raw_text.split('abstract. ')[1].split('\n\n')[0]
abstract

" a purely peer-to-peer version of electronic cash would allow online \npayments to be sent directly from one party to another without going through a \nfinancial institution.  digital signatures provide part of the solution, but the main \nbenefits are lost if a trusted third party is still required to prevent double-spending. \nwe propose a solution to the double-spending problem using a peer-to-peer network. \nthe network timestamps transactions by hashing them into an ongoing chain of \nhash-based proof-of-work, forming a record that cannot be changed without redoing \nthe proof-of-work.  the longest chain not only serves as proof of the sequence of \nevents witnessed, but proof that it came from the largest pool of cpu power.  as \nlong as a majority of cpu power is controlled by nodes that are not cooperating to \nattack the network, they'll generate the longest chain and outpace attackers.  the \nnetwork itself requires minimal structure.  messages are broadcast on a best effort

In [261]:
annotations.groupby('tag').concept.count()

tag
problem                  3
requirement              6
requirement,problem      1
requirement,solution     1
solution                12
solution,problem         2
solution,requirement     2
Name: concept, dtype: int64

In [193]:
annotations[annotations.tag.str.contains('requirement')].concept

0                 proof
1                rejoin
2                 leave
3               message
5               network
7                  node
17          transaction
21                trust
24       onlinepayments
25    electronic   cash
Name: concept, dtype: object

Trusted online payments with electronic cash. Nodes can message transaction to the network provide proof to other nodes whether they leave and rejoin (have a dedicated connection) to the network   

In [194]:
annotations[annotations.tag.str.contains('problem')].concept

6                    attack
8                  majority
17              transaction
19             double-spend
20              third party
23    financial institution
Name: concept, dtype: object

All without the network getting attacked through double spend or majority. And without the need for third party financial instutions. 

In [195]:
annotations[annotations.tag.str.contains('solution')].concept

3                message
4              broadcast
5                network
7                   node
8               majority
9              cpu power
10                   cpu
11                  pool
12         longest chain
13                record
14         proof-of-work
15                 chain
16                  hash
18             timestamp
20           third party
22    digital signatures
26          peer-to-peer
Name: concept, dtype: object

Node(s) broadcast message(s) with digital signiatures to peer to peer network. Node(s) create hash as proof of work (dedicated cpu power). Those with majority of pooled cpu power determine the record. They timestamp and hash (unique identifier) of message that is linked to chain hash. Longest chain is the one nodes trust. 


### Introduction

In [217]:
introduction = text.split('introduction')[1].split('\n\n')[1]
introduction

'commerce on the internet has come to rely almost exclusively on financial institutions serving as trusted third parties to process electronic payments. while the system works well enough for most transactions, it still suffers from the inherent weaknesses of the trust based model. completely non-reversible transactions are not really possible, since financial institutions cannot avoid mediating disputes. the cost of mediation increases transaction costs, limiting the minimum practical transaction size and cutting off the possibility for small casual transactions, and there is a broader cost in the loss of ability to make non-reversible payments for non reversible services. with the possibility of reversal, the need for trust spreads. merchants must be wary of their customers, hassling them for more information than they would otherwise need. a certain percentage of fraud is accepted as unavoidable. these costs and payment uncertainties can be avoided in person by using physical curren

In [203]:
annotations['concept_freq_intro'] = annotations.concept.apply(lambda c: len(re.findall(c, introduction)))


### Conclusion

In [224]:
conclusion = text.split('conclusion')[1].split('\n\n')[1]
conclusion

'we have proposed a system for electronic transactions without relying on trust. we started with the usual framework of coins made from digital signatures, which provides strong control of ownership, but is incomplete without a way to prevent double-spending. to solve this, we proposed a peer-to-peer network using proof-of-work to record a public history of transactions that quickly becomes computationally impractical for an attacker to change if honest nodes control a majority of cpu power. the network is robust in its unstructured simplicity. nodes work all at once with little coordination. they do not need to be identified, since messages are not routed to any particular place and only need to be delivered on a best effort basis. nodes can leave and rejoin the network at will, accepting the proof-of-work chain as proof of what happened while they were gone. they vote with their cpu power, expressing their acceptance of valid blocks by working on extending them and rejecting invalid 

Todo:
- Manually annotate conclusion

### Body

In [255]:
body_start = re.search(introduction[-10:], text).end()
body_end = re.search(conclusion[10:], text).start()
body = text[body_start:body_end]

In [272]:
for section_name in ['abstract', 'introduction', 'body', 'conclusion']:
    col_name = f'concept_freq_{section_name}'
    section = locals()[section_name]
    annotations[col_name] = annotations.concept.apply(lambda c: len(re.findall(c, section)))
annotations['concept_freq_total'] = annotations.concept.apply(lambda c: len(re.findall(c, text)))


In [273]:
rel_cols = ['tag'] + [c for c in annotations.columns if 'concept' in c]
annotations[rel_cols].set_index('concept').sort_values('concept_freq_total', ascending=False)


Unnamed: 0_level_0,tag,concept_freq_abstract,concept_freq_introduction,concept_freq_body,concept_freq_conclusion,concept_freq_total
concept,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
transaction,"requirement,problem",1,7,59,2,69
hash,solution,2,0,47,0,52
node,"solution,requirement",2,2,31,3,38
proof,requirement,6,2,16,3,27
chain,solution,4,0,22,1,27
attack,problem,2,1,21,1,25
network,"solution,requirement",5,0,13,3,21
proof-of-work,solution,3,0,12,2,17
trust,requirement,1,6,5,1,14
timestamp,solution,1,1,11,0,14


### Update annotations 

Todo
- add "bitcoin" and "concept" to annotations in abstract
- download whitepaper pdf's to tmp folder 
- add annotations for each occurance in the rest of the document and other whitepapers
