In [None]:
### RUN THIS CELL FIRST

%load_ext autoreload
%autoreload 2

# Prologue

The next few lessons are a meditation on [`bytes`](https://docs.python.org/3/library/stdtypes.html#bytes). 

`bytes` are just a sequence of numbers 0 <= x < 256 with [ASCII](http://www.asciitable.com/) [literals](https://stackoverflow.com/a/34189196/2542016) and [representations (`.__repr__`)](https://stackoverflow.com/questions/1436703/difference-between-str-and-repr/2626364#2626364). That's it. Nothing more. Don't forget this and don't take your eyes off this fact.

The byte literal `b"I'm a string"` may look like a string to you but it is not. 

In [None]:
list(b"I'm a string")

It's a sequence of numbers under the hood.

Python doesn't show these numbers themselves because there is a strong convention to give certain of these numbers special meaning according to the [ASCII scheme](http://www.asciitable.com/). For example, the byte `73` almost universally means `I`. So Python takes pity on the programmer and shows them `I` by default. But `73` is always only a `list(...)` away.

`73`, the ASCII code for `I` is represented in hexidecimal as `0x49` but the hexidecimal number `0xfe` doesn't not correspond to any ASCII character. Notice how `\x49` get replaced by `I` but `\xfe` doesn't get swapped out because hexidecimal number `0xfe` equals decimal `254` is outside ASCII range:

In [None]:
b'\x49\xfe'

Try charerters not included in the ASCII scheme and see what happens:

In [None]:
b"ñot ASCII"

The "character essentials" section of [this book chapter](https://www.oreilly.com/library/view/fluent-python/9781491946237/ch04.html) has an excellent description of what's going on:

![image](../images/bytes.png)

Finally, take a peak at [this StackOverflow answer](https://stackoverflow.com/questions/54358833/how-does-bytes-repr-representation-work/54358937#54358937) to see precisely how the CPython interpreter produces these representations.

The miracle of protocol engineering is that we are able to create complex, powerful, almost biological systems like Bitcoin from the primitive ability to share sequences of small numbers with each other. It blows my mind and it should blow yours, too!

So say it with me: **`bytes` are just a sequence of numbers 0 <= x < 256`**

# Review

[ping_pong.py from the digital cash videos](https://github.com/justinmoon/digital-cash/blob/master/experiments/ping_pong.py) is running on port 10000 on a computer with IP address `68.183.109.101`.

Open the terminal and type `telnet 68.183.109.101 10000`. Type `ping` followed by an enter and you should receive a `pong` back before the connection closes. Send anything besides a `ping` and the connection will just close.

##### Exercise: Connect to ping pong server using Python

In [None]:
import socket

def ping(ip, port):
    raise NotImplementedError()

In [None]:
ping('68.183.109.101', 10000)

# Connect to a Bitcoin Peer

Since the Bitcoin Network is peer-to-peer, we must find some specific peer to connect to.

[Bitnodes](https://bitnodes.earn.com/nodes/) has a nice listing of visible nodes in the Bitcoin network. Choose one. Look at the "address" column. You should see something like `"35.187.200.6:8333"`. This is the "address" of the node you've selected. This address is composed of two values: an Internet Protocol (IP) address (e.g. 35.187.200.6), and a port (e.g. 8333) separated by a colon. The IP indicates identifies the computer we are connecting to, and the port identifies a specific process running on that computer.

Paste in the IP and port of the node you selected in the cell below.

In [None]:
# FILL THESE IN!
PEER_IP = ""
PEER_PORT = 0

Let's connect to this peer using a [socket](https://docs.python.org/3/library/socket.html#socket-objects), which is like a tunnel across the internet. We can interact with out peer by writing data into the socket using `sock.send(message_bytes)` and reading from the socket using `sock.recv(number_of_bytes_to_read)`. (Kinda like a file, right?)

In [None]:
ping(PEER_IP, PEER_PORT)

What do you think is happening? Take a moment and make a guess.

Our peer is unresponsive. We're stuck on the line `response = sock.recv(1024)` attempting to receive bytes over TCP. This will wait forever until the Bitcoin node at the other end of our socket connection sends us a response or closes the connection.

You may have noticed that to the left of every Jupyter cell is the text `In [ ]:` if the cell hasn't been executed yet, and somthing like `In [7]:` if the cell was the 7th cell executed. But the cell above says `In [*]:`. This means that it's still executing. _The code is stuck_. Hit the ■ button in the menu at the top of the screen (or type `escape ii`) to kill the process in the cell above.

Why do you think it got stuck? Ponder this for a moment. I don't expect you to know the answer but try to think of some plausible ones ...

# Version Handshake

The reason the code got stuck is because we didn't properly introduce ourselves to our peer. With Bitcoin, we must perform a [Version Handshake](https://en.bitcoin.it/wiki/Version_Handshake) in order to begin exchanging messages.

So let's try again. I'm going to give you a magic `VERSION` bytestring without telling you how I came up with it. Before calling `sock.recv(1024)` we will first call `sock.send(VERSION)` because the Bitcoin Version Handshake demands that the node which initiates the connection send the first `version` message.

In [None]:
# Bitcoin network equivalent of "hello"
VERSION = b'\xf9\xbe\xb4\xd9version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'

sock = socket.socket()
sock.connect((PEER_IP, PEER_PORT))

# initiate the "version handshake"
sock.send(VERSION)

# receive their "version" response
response = sock.recv(1024)

print(response)

The code no longer gets stuck at `response = sock.recv(1024)` because the peer answered our `version` message with their own `version` message. We're learning to say "hello" in the language of the Bitcoin network protocol!

# Reading Network Messages

This table in the [protocol documentation](https://en.bitcoin.it/wiki/Protocol_documentation#version) tells us what information the `version` message contains and how to decipher it. But before we can read the version message specifically (as opposed to the other 26 types), we need to learn to read a Bitcoin protocol message generally. This ["message structure"](https://en.bitcoin.it/wiki/Protocol_documentation#Message_structure) table tells us how.

![image](../images/message-structure.png)

`command` and `payload` tell us what kind of message we're dealing with, and the contents of that message. These two attributes contain all the useful information.

`magic`, `length`, and `checksum` help us read and verify integrity of messages and don't contain any useful information on their own.

Regarding the table, the "description" and "comments" columns tell us what each row in the table means. The "field size" column tell us the number of bytes each field takes up, and the "data type" column tells us how we should interpret these bytes -- e.g. whether they a number, a string, a list etc.

At a high level we're faced with the problem of reading an arbitrary length input N bytes at a time. Our algorithm for reading network message would be:

* `magic`: first 4 bytes
* `command`: next 12 bytes
* `length`: next 4 bytes, interpreted as an integer
* `checksum`: next 4 bytes
* `payload`: next `length` bytes

Notice how we are able to stop exactly at the end of the message without reading one single byte too few or too many.

How would you do this in python? Think about that for a second ...

The most obvious tool in the Python toolbox would be the [_slice_](https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation). Let's see how that looks:

In [None]:
print('4 "magic" bytes: ', VERSION[:4])

In [None]:
print('12 "command" bytes: ', VERSION[4:4+12])

In [None]:
length_bytes = VERSION[4+12:4+12+4]
print('4 "length" bytes:', length_bytes)

##### Exercise: fill in `start` and `stop` indices for the `checksum` field

In [None]:
start = 0
stop = 0
print('4 "checksum" bytes:', VERSION[start:stop])

In [None]:
To read the payload we must interpret the `length_bytes` as a number. This is how to do it (will explain later):

In [None]:
length = int.from_bytes(length_bytes, 'little')
length

In [None]:
print(length, '"payload" bytes:', stream.read(length))

It works but eek! That's some ugly code ...

If you think about it, the challenge of reading files is somewhat similar to what we're doing. When dealing with files we read them in chunks (often "lines" separated by `\n`) and frequently don't know how long they are. The programming interfaces for reading files are very mature and powerful. 

For example:

In [None]:
with open("safu.txt") as f:
    print(f.read(5))
    print(f.read(3))
    print(f.read(4))

Promising, huh?

Python has a [`io.BytesIO`](https://docs.python.org/3/library/io.html#io.BytesIO) utility for turning `bytes` into ["file objects"](https://docs.python.org/3/glossary.html#term-file-object) we can `.read(n)` from. It's just a sequence of `bytes` which behaves like a file does. Colloquially we use the terms "file object" and "stream" interchangeably -- `stream` will make a convenient variable name going forward.

Let's try wrapping the `VERSION` bytes we've been using in a `io.BytesIO` and see if we can decompose the message more readably:

In [None]:
from io import BytesIO

def read_message(stream):
    print('4 "magic" bytes: ', stream.read(4))
    print('12 "command" bytes: ', stream.read(12))
    length_bytes = stream.read(4)
    print('4 "length" bytes', length_bytes)
    print('4 "checksum" bytes', stream.read(4))
    length = int.from_bytes(length_bytes, 'little')
    print(length, ' "payload" bytes', stream.read(length))

stream = BytesIO(VERSION)
read_message(stream)

### Readibility Win

`stream.read(payload_length)` is a lot easier to read than `version_bytes[4+12+4+4:4+12+4+4+payload_length]`, wouldn't you say?

### Respecting Message Boundaries

We previously read a fixed 1024 bytes from the socket whenever it has bytes to give us: `response = sock.recv(1024)`.

1024 bytes will rarely be enough to read a `block` message, but it's often too much for a `version` message, since the average size of these messages differs greatly. Consequently, `response = sock.recv(1024)` introduces 2 bugs:

1. We can't read messages larger than 1024 bytes because they get cut off at 1024 bytes.
2. We may treat 2 small messages as if they are one single message, causing us to completely ignore the second message.

For example, if you run it enough times (or with a strategically place `time.sleep(n)`) the socket example which sends a `version` message you will eventually output something like this: 

```
b'\xf9\xbe\xb4\xd9version\x00\x00\x00\x00\x00f\x00\x00\x00f\x9a\xe5\x06\x7f\x11\x01\x00\r\x04\x00\x00\x00\x00\x00\x00\xb9kV[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\r\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x92\xed.\xd6\xba\x90\xa8\t\x10/Satoshi:0.16.0/d#\x08\x00\x01\xf9\xbe\xb4\xd9verack\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00]\xf6\xe0\xe2'
```

If you look closely you may notice that the magic network string `\xf9\xbe\xb4\xd9` appears twice. `version` follows the first occurrence, and `verack` follows the second. This is not one message, it's two!

If we could figure out how to `.read(n)` directly from a socket we could completely eliminate this class of bugs by always reading the exact number of bytes the protocol tells us to.

### Reading From Sockets

This technique works directly on a socket with one small modification: calling [`socket.socket.makefile`](https://docs.python.org/3/library/socket.html#socket.socket.makefile) to give us a socket-backed "file object" we can `.read(n)` from. Where before the socket resembled a file, now it basically is one!

In [None]:
sock = socket.socket()
sock.connect((PEER_IP, PEER_PORT))

# get a "file object" / "stream"
# "r" for "read", "b" for "bytes"
stream = sock.makefile('rb')

sock.send(VERSION)
read_message(stream)

print('Anything left over?: ', bytes(stream.peek()[:1]), '\n')

Pretty cool, huh?

Did you notice that the last line of the output says `Anything left over?:  b'\xf9'`. Why did this change? What's the significance of `b'\xf9'`?

Initially we were dealing with a file object with exactly one version message in its buffer. When connecting to a Bitcoin peer our socket-backed file object now contains 2 messages -- so one is left over after we read the first one. As expected it begins with `b'\xf9'` -- the first character in the network bytes-representation network magic `b'\xf9\xbe\xb4\xd9'`!

The rest of the stream is a [`verack`](https://en.bitcoin.it/wiki/Protocol_documentation#verack), the second step in the [Version Handshake](https://en.bitcoin.it/wiki/Version_Handshake)

In [None]:
read_message(stream)

# Interpreting Network Messages

Now that we know how to read bytes associated with each part of a Bitcoin network message, let's learn to interpret them.

First, the "network magic" ...

###  Interpreting `magic`

Every time we receive a Bitcoin network message we want to start by reading the network magic and checking that it's equal to the bytes `b"\xf9\xbe\xb4\xd9"` (this value has been [hard-coded in Bitcoin Core](https://github.com/bitcoin/bitcoin/blob/ace87ea2b00a84b7a76e75f1ec93d1a4dce83f6f/src/chainparams.cpp#L104) since the beginning).

Before we learn to do this, let's zoom out and ask "What is this 'network magic', anyway?"

[This StackExchange post](https://bitcoin.stackexchange.com/questions/43189/what-is-the-magic-number-used-in-the-block-structure/43191#43191) contains the best answer I've seen. Make sure you browse the [Wikipedia article](https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_protocols) it links to. In brief, magic values are a common method in protocol design for tagging data structures with information indicating it's "type". 

Why did Satoshi choose `\xf9\xbe\xb4\xd9` for his prefix?
1. [A note in the source code we inherited from him](https://github.com/bitcoin/bitcoin/blob/ace87ea2b00a84b7a76e75f1ec93d1a4dce83f6f/src/chainparams.cpp#L100).
2. [It has some improbable mathematical properties](https://bitcoin.stackexchange.com/a/52456/85335) Satoshi probably liked for no practical reasons.

Now how should we deal with these magic bytes in the process of reading and interpreting real Bitcoin network messages? This one is pretty simple -- just check whether they're equal to the values in the wiki / reference implementation!

![image](../images/magic-values.png)

##### Exercise: Modify `read_message` to read the magic bytes

- Take a look at [lib.py](lib.py). Inside you will see a `read_message` function similar to the prototype we made above. In the rest of the lesson we will work to complete this function.
- It currently returns a dictionary of all the parts of a message, but the helper functions it calls each return `None`:

In [None]:
from lib import read_message

read_message(BytesIO(VERSION))

- So let's define the helper functions, starting with `read_magic`
- The cell below will run a `test_magic` unittest within [lib.py](./lib.py). It tests the `read_magic` function within the same file. Fill out the definition of `read_magic` to get the test passing!

In [None]:
import pytest

pytest.main(['-q', 'lib.py::test_read_magic'])

##### Exercise: Which is a testnet message, which is a mainnet message?

In [None]:
m1 = b'\xf9\xbe\xb4\xd9version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'
m2 = b'\x0b\x11\x09\x07version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'

print(read_message(BytesIO(m1)))
print(read_message(BytesIO(m2)))

##### Exercise: Modify `read_message` to raise a `RuntimeError` if the magic bytes are wrong

In [None]:
# As a reminder, this is how you raise a RuntimeError error in Python:

raise RuntimeError('description of the error goes here')

In [None]:
# get this test to pass

pytest.main(['-q', 'lib.py::test_bad_magic'])

### Interpret `command`

In practice, when you receive a Bitcoin network message you want to check the `command` attribute and call a handler function that knows what to do with each of the 27 different kinds of Bitcoin peer-to-peer messages.

By default commands are right padded with empty bytes (`b"\x00"`) until 12 total bytes are reached. For example, a `version` command would look like `b"version\x00\x00\x00\x00\x00"`. For readability sake, let's strip the empty byte padding so we can deal with the cleaner `b"version` instead.

This is how to strip zero bytes:

In [None]:
command = b"version\x00\x00\x00\x00\x00"
command.strip(b"\x00")

##### Exercise: Modify `read_message` to read the `command` bytes and strip of zero bytes as above


In [None]:
pytest.main(['-q', 'lib.py::test_command'])

# Read `payload`

Lastly, let's parse the 3 payload-related portions of the message: "length", "checksum", and "payload".

![image](../images/message-structure.png)

The goal of these three attributes is to read and verify the integrity of the `payload` -- which could be a newly mined block, or a transaction, or a list of peer IP addresses.

Payloads vary in length. A [`verack` message](https://en.bitcoin.it/wiki/Protocol_documentation#verack) has empty payload. A [`block` message](https://en.bitcoin.it/wiki/Protocol_documentation#block) payload may contain a thousand transactions.

To deal with the varying payload sizes, messages always include a `length` parameter which tells us exactly how large the payload is. This helps us avoid reading part of the payload and stopping in the middle, or overshooting and reading into the next message (like the example above where I read two messages by accident).

Once we read the payload, how can we be sure that what we receive is the same as what our peer node sent us? That nobody modified the message while it was being routed to us over the internet?

For this we use a "checksum":

Checksums are a simple idea: when you send data you also include a small fingerprint of that data. Your recipient can check the fingerprint against the data they receive and verify with some probability the message wasn't tampered with en route.

The Bitcoin protocol creates such a fingerprint by running a [hashing algorithm](https://blog.jscrambler.com/hashing-algorithms/) called [SHA256](https://en.wikipedia.org/wiki/SHA-2) on the data twice and then grabbing the first 4 bytes of the result. How can we be sure this fingerprint is any good? 

1. Hashing functions are "deterministic": given an input `x`, a hashing algorithm `h` will _always_ produce the same output `h(x)`. Since the output is always the same, the first, say, four digits of the output will always be the same. 
2. Calculating the inverse of a hashing function `h(x) -> x` requires brute force, you'd need to try about 256^4 ≈ 500,000,000 payload modifications (a byte contains 256 possible values and there are 4 bytes) on average to produce a viable payload modification with the same checksum.
3. TCP protocol applies a separate checksum verification to each message! \[sidenote: [Speculating why Satoshi choose to add a second checksum on top of TCP](https://bitcoin.stackexchange.com/a/22887/85335).\]

If the checksums match it is very unlikely that the message was accidentally mangled in transit.

### Reading Integers

The first field of the 3 payload-related fields is the 4-byte `length`. The `length` bytes of the `version` messages we handled at the beginning were `b'j\x00\x00\x00'`

Here's the tricky part: What number do those bytes represent? More generally, how do we turn numbers into bytes and bytes into numbers?

This is a question of "type conversions," "serialization," or "encoding". The TCP protocol only lets you send numbers between 0 and 255. But our messages almost always need to be more expressive than just a number between 0 and 255.

Therefore we must define rules for conversion of every type of Python data to and from the universal TCP-compatible format `bytes`.

Such rules were at work in this magical, unexplained line of code in the `read_message` function we defined earlier:

```
payload_length = int.from_bytes(stream.read(4), 'little')
```

I'm not going to explain how this work right now. The [homework](./Homework.ipynb) for this lesson dives into this magic _extensively_!

##### Exercise: Have `read_message`  read `length` and interpret it as an integer 

Implement `read_length` in [lib.py](./lib.py) in order to get the `test_length` test to pass.

Read the correct number of bytes according to the protocol docs and interpret them as an int using `length = int.from_bytes(some_bytes, 'little')`

In [None]:
pytest.main(['-q', 'lib.py::test_length'])

##### Exercise: Have `read_message` read the `checksum`

In [None]:
pytest.main(['-q', 'lib.py::test_checksum'])

##### Exercise: Have `read_message` read the `payload`

In [None]:
pytest.main(['-q', 'lib.py::test_read_payload'])

Bitcoin uses the SHA256 hashing function to produce and verify checksums. Here is how to run SHA256 on the bytes `b"don't trust, verify"` and get `bytes` as a result.

In [None]:
from hashlib import sha256

sha256(b"don't trust, verify").digest()

Where Bitcoin uses SHA256, it usually uses it twice. [Here's a discussion](https://bitcoin.stackexchange.com/questions/6037/why-are-hashes-in-the-bitcoin-protocol-typically-computed-twice-double-computed) of why Satoshi might have made this decision  

##### Exercise: Implement `double_sha256` in [lib.py](./lib.py) which runs `sha256` twice on input and return `bytes` as output

In [None]:
pytest.main(['-q', 'lib.py::test_double_sha256'])

Our peer takes the first 4 bytes of `double_sha256(payload)` payload and provides it as the `checksum` field. Here's how it works:

In [None]:
from lib import read_message, double_sha256

msg = read_message(BytesIO(VERSION))
print("checksum provided by our peer:", msg["checksum"])

checksum = double_sha256(msg['payload'])[:4]
print("checksum we calculate", checksum)

##### Exercise: Have `read_message` calculate a checksum and raise a `RuntimeError` it it doesn't match the checksum on the message

In [None]:
pytest.main(['-q', 'lib.py::test_bad_checksum'])

If you have all the tests passing so far, you can now parse and validate the integrity of Bitcoin message payload

# Finished Product

In [None]:
import socket

sock = socket.socket()
sock.connect((PEER_IP, PEER_PORT))
stream = sock.makefile('rb')

# initiate the "version handshake"
sock.send(VERSION)

# receive and print their "version" response
msg = read_message(stream)

print(msg)

In the next lesson we will learn to interpret payload of version message.

Until then, head over to the [homework](./Homework.ipynb). It's a doozy!