# A Hitchhiker's Guide to ATP

## CID

### Theory

Link rot is a thing because URLs tell you where to find some particular content by addressing some entity providing it. 
If that entity goes away, the link is broken. 

Enter the [CID](https://github.com/multiformats/cid), or **Content IDentifier**.

CIDs are a way to address content itself. If you have a CID, you can verify that the content you have is the content you want. This is
generally useful in distributed systems because you can ask for content and *anyone* can return it to you. They could lie and give 
you something else, but the CID lets you verify cryptographically.

This is useful to ATP for the same reason. All content shared has an associated, canonical CID. **No service creates 
this CID, it is a property of the content itself.**


### Applied

If you don't want to use python but have a CID, check out IPFS's [CID inspector](https://cid.ipfs.tech/#zdpuAx7GYAybGShxy9wvkK5eJt6a5G47tz5z5yeFcDqChfYE3).

In [70]:
from multiformats.cid import CID  # pip install multiformats
from multiformats.multihash import get as get_hash_func

Expanded, a cid is,

```<cid> ::= <cid-version><multicodec><multihash>```

where,

- `<cid-version>` refers to the verison of the CID spec used
- `<multicodec>` refers to the codec used to encode the content. [there are lots of them](https://github.com/multiformats/multicodec/blob/master/table.csv).
- `<multihash>` refers to the hash algorithm used to generate the hash

Imagine we have some data,

In [None]:
# The original data
hw_data = b"hello world"

Now imagine we had some magic blob store referenced by this. So long as we all agree to use sha2-256 everywhere, this would work. 
"Here is my digest, give me the data." But that's kinda a fairy tale because we don't all agree and technology changes for good reasons. 
So we need something a bit more robust. 

A decent start would be to use a [multihash](https://github.com/multiformats/multihash), 
which is a self-describing hash,

```
multihash ::= <hash function code><digest size><hash function output>
```

In [109]:
hw_digest = get_hash_func("sha2-256").digest(hw_data).hex()
hw_digest

'1220b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9'

But that doesn't solve the problem completely. Now we can negotiate the shared
algorithm, but how is the digest encoded? Is it hex? Base64? Base58? Something else?
There is still ambiguity. Moreover, I'm assuming we're just sharing bytes. But
what if we were sharing json documents? We need some way of agreeing how 
to encode the data.



This motivates the CID. Expanded, a cid is,

```<cid> ::= <cid-version><multicodec><multihash>```

where,

- `<cid-version>` refers to the verison of the CID spec used
- `<multicodec>` refers to the codec used to encode the content ([there are lots of them](https://github.com/multiformats/multicodec/blob/master/table.csv))
- `<multihash>` refers to the hash algorithm used to generate the hash


In [108]:
hw_cid = CID(
    base="base58btc", 
    version=1,
    codec="raw", 
    digest=hw_digest
)

hw_cid.human_readable


'base58btc - cidv1 - raw - (sha2-256 : 256 : B94D27B9934D3E08A52E52D7DA7DABFAC484EFE37A5380EE9088F7ACE2EFCDE9)'

In [97]:
assert hw_cid.digest.hex() == hw_digest

Now, I can ask the magic blob store if it has the data for the following CID instead,

In [94]:
str(hw_cid)  # base58btc encoded, which saves some bytes and removes some ambiguity

'zb2rhj7crUKTQYRGCRATFaQ6YFLTde2YzdqbbhAASkL9uRDXn'

The magic blob store may decode the CID for some reason (maybe the internally don't use base58btc),

In [96]:
CID.decode(str(hw_cid))

CID('base58btc', 1, 'raw', '1220b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9')

but in any case it returns the data we requested. Which we can validate. 

In [99]:
returned_data = b"hello world"

hw_cid.hashfun.digest(returned_data).hex() == hw_digest

True

Extropolate from here:

If you have a CID, you can get the data *from anyone who might have it*. 

- Trust isn't part of the process. They provide, you validate.
- Any particular provider can fail so long as someone retains the data and provides it to you in some way


Restated, the CID describes some `dag-cbor`-encoded content that has been hashed with `sha2-256`. 

In [48]:
cid.human_readable

'base58btc - cidv1 - dag-cbor - (sha2-256 : 256 : ADA48B7C8394D2855F97E9E47EC0EA63D57778E1AF14283EA1A0C2D6A86DC1A0)'

In [49]:
# import base58btc encoder
from multiformats.multibase import get

#get("base58btc").encode(cid.digest)
get('base32').encode(cid.digest)

'bciqk3jelpsbzjuufl6l6tzd6ydvghvlxpdq26fbih2q2bqwwvbw4dia'

In [50]:
cid_str.encode("utf-8")  # Back to bytes

b'zdpuAx7GYAybGShxy9wvkK5eJt6a5G47tz5z5yeFcDqChfYE3'

## CBOR

### Theory

[CBOR](https://cbor.io/) is Concise Binary Object Representation. 

For our purposes, it's JSON in binary form. 

So think "JSON" but say "CBOR" and you're most of the way there.

### Applied

In [124]:
import cbor2  # pip install cbor2

In [125]:
doc = {
    "actor": "@generativist.xyz",
    "burritos_per_week": 2
}

as_cbor = cbor2.dumps(doc)
as_cbor

b'\xa2eactorq@generativist.xyzqburritos_per_week\x02'

In [126]:
from_cbor = cbor2.loads(as_cbor)
from_cbor

{'actor': '@generativist.xyz', 'burritos_per_week': 2}

## DAG-CBOR

Here is a CID for *something* produced on ATP,

In [130]:
mystery_cid = CID.decode("zdpuAx7GYAybGShxy9wvkK5eJt6a5G47tz5z5yeFcDqChfYE3")
mystery_cid.human_readable

'base58btc - cidv1 - dag-cbor - (sha2-256 : 256 : ADA48B7C8394D2855F97E9E47EC0EA63D57778E1AF14283EA1A0C2D6A86DC1A0)'

We know what everything in that human readable format means 
except the `dag` part of the `dag-cbor` encoding. To motivate why we need `dag-cbor` consider the following,

In [133]:
from collections import OrderedDict

order_1 = OrderedDict([('a', 1), ('b', 2)])
order_2 = OrderedDict([('b', 2), ('a', 1)])

cbor2.dumps(order_1) != cbor2.dumps(order_2)

True

CBOR does not have an ordered dict, so the order is arbitrary. (I enforced an order by using `OrderedDict` just to show this is true.) If you were to take a hash of the cbor encoded map, you would introduce a new source of fragility. `dag-cbor` removes this by [enforcing some strict rules](https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-cbor.md#strictness) on what is allowed in the map as well as a canonical encoding.


In [136]:
import dag_cbor

dag_cbor.encode(order_1) == dag_cbor.encode(order_2)

True

### Protocol

In [137]:
mystery_cid

CID('base58btc', 1, 'dag-cbor', '1220ada48b7c8394d2855f97e9e47ec0ea63d57778e1af14283ea1a0c2d6a86dc1a0')

In [66]:
CID(
    "base58btc", 
    1,
    "dag-cbor", 
    "1220ada48b7c8394d2855f97e9e47ec0ea63d57778e1af14283ea1a0c2d6a86dc1a0"
).human_readable # human readable CID

'base58btc - cidv1 - dag-cbor - (sha2-256 : 256 : ADA48B7C8394D2855F97E9E47EC0EA63D57778E1AF14283EA1A0C2D6A86DC1A0)'

In [52]:
import cbor2

example = {"handle": "@generativist", "name": "breaker"}
cbor2.dumps(example)

b'\xa2fhandlem@generativistdnamegbreaker'

Kinds: maps, etc

node: is our "any" in IPLD

node tree isomorphic to json document

node kind as link then you have a graph, but links are immutable so its a dag

# IPLD

- nodes are immutable
- copy on right algorithms?

The interface is important. Why? You can do lazy stuff over links!
- Kind()
- TraverseField(key) 
- TraversIndex(idx) # list

In [None]:
# Selectors (GraphQL-like)
# https://en.wikipedia.org/wiki/Hash_array_mapped_trie
# HAMT (Hash Array Mapped Trie)
# Advanced layouts