In [None]:
%load_ext autoreload
%autoreload 2

# Sockets

Before we connect to can connect to a Bitcoin node I want to connect to a simple ping/pong TCP server "68.183.109.101:10000"

In [None]:
import socket

def ping(address):
    sock = socket.socket()
    sock.connect(address)
    sock.send(b"ping")
    response = sock.recv(1024)
    print("Response: ", response)

In [None]:
ping_pong_server_address = ("68.183.109.101", 10000)

ping(ping_pong_server_address)

# Let's Try It On Bitcoin

[Bitcoin has a `ping` message](https://en.bitcoin.it/wiki/Protocol_documentation#ping)

Let's see if this same code will work on Bitcoin?

Grab an IPV4 address off [Bitnodes](https://bitnodes.earn.com/nodes/) and try it

In [None]:
PEER_IP = ?
PEER_PORT = ?
ping((PEER_IP, PEER_PORT))

# Version Handshake

The reason the code got stuck is because we didn't properly introduce ourselves to our peer. With Bitcoin, we must perform a [Version Handshake](https://en.bitcoin.it/wiki/Version_Handshake) in order to begin exchanging messages.

So let's try again. I'm going to give you a magic `VERSION` bytestring without telling you how I came up with it. Before calling `sock.recv(1024)` we will first call `sock.send(VERSION)` because the Bitcoin Version Handshake demands that the node which initiates the connection send the first `version` message.

In [None]:
# Bitcoin network equivalent of "hello"
VERSION = b'\xf9\xbe\xb4\xd9version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'

sock = socket.socket()
sock.connect((PEER_IP, PEER_PORT))

# initiate the "version handshake"
sock.send(VERSION)

# receive their "version" response
response = sock.recv(1024)

print(response)

# Interpreting Our Peer's Response

OK we can initiate some kind of connection with 3 different kinds of address on the Bitcoin network: IPV4, IPV6, or Tor.

But what does this mysterious response mean?

In [None]:
print(response)

Let's talk about what `bytes` are: just a sequence of integers

In [None]:
list(response)

* They are printed according to [ASCII](http://asciitable.com). If it doesn't have a meaning in there, it's converted to hexidecimal and escaped with a `\x`. 
* ascii was 7 bits, which is why it goes to 127
* [fun video on ascii & unicode](https://www.youtube.com/watch?v=MijmeoH9LT4)

# Reading Network Messages

This table in the [protocol documentation](https://en.bitcoin.it/wiki/Protocol_documentation#version) tells us what information the `version` message contains and how to decipher it. But before we can read the version message specifically (as opposed to the other 26 types), we need to learn to read a Bitcoin protocol message generally. This ["message structure"](https://en.bitcoin.it/wiki/Protocol_documentation#Message_structure) table tells us how.

![image](images/message-structure.png)

`command` and `payload` tell us what kind of message we're dealing with, and the contents of that message. These two attributes contain all the useful information.

`magic`, `length`, and `checksum` help us read and verify integrity of messages and don't contain any useful information on their own.

Regarding the table, the "description" and "comments" columns tell us what each row in the table means. The "field size" column tell us the number of bytes each field takes up, and the "data type" column tells us how we should interpret these bytes -- e.g. whether they a number, a string, a list etc.

At a high level we're faced with the problem of reading an arbitrary length input N bytes at a time. Our algorithm for reading network message would be:

* `magic`: first 4 bytes
* `command`: next 12 bytes
* `length`: next 4 bytes, interpreted as an integer
* `checksum`: next 4 bytes
* `payload`: next `length` bytes

Notice how we are able to stop exactly at the end of the message without reading one single byte too few or too many.

How would you do this in Python?

If you think about it, the challenge of reading files is somewhat similar to what we're doing. When dealing with files we read them in chunks (often "lines" separated by `\n`) and frequently don't know how long they are. The programming interfaces for reading files are very mature and powerful. 

For example:

In [None]:
with open("safu.txt") as f:
    print(f.read(5))
    print(f.read(3))
    print(f.read(4))

Promising, huh?

Python has a wonderful [`io.BytesIO`](https://docs.python.org/3/library/io.html#io.BytesIO) utility for turning `bytes` into ["file objects"](https://docs.python.org/3/glossary.html#term-file-object) we can `.read(n)` from. It's just a sequence of `bytes` which behaves like a file does. Colloquially we use the terms "file object" and "stream" interchangeably -- `stream` will make a convenient variable name going forward.

Let's try wrapping the `VERSION` bytes we've been using in a `io.BytesIO` and see if we can decompose the message more readably:

In [None]:
from io import BytesIO

def read_msg(stream):
    print('4 "magic" bytes: ', stream.read(4), '\n')
    print('12 "command" bytes: ', stream.read(12), '\n')
    payload_length_bytes = stream.read(4)
    print('4 "length" bytes', payload_length_bytes, '\n')
    print('4 "checksum" bytes', stream.read(4), '\n')
    payload_length = int.from_bytes(payload_length_bytes, 'little')
    print(payload_length, ' "payload" bytes', stream.read(payload_length), '\n')
    
stream = BytesIO(VERSION)
read_msg(stream)

print('Anything left over?: ', stream.read(1), '\n')

### Reading From Sockets

This technique works directly on a socket with one small modification: calling [`socket.socket().makefile`](https://docs.python.org/3/library/socket.html#socket.socket.makefile) to give us a socket-backed "file object" we can `.read(n)` from. Where before the socket resembled a file, now it basically is one!

In [None]:
sock = socket.socket()
sock.connect((PEER_IP, PEER_PORT))

# get a "file object" / "stream"
# "r" for "read", "b" for "bytes"
stream = sock.makefile('rb')

sock.send(VERSION)

# no modifications required!
read_msg(stream)

print('Anything left over?: ', bytes(stream.peek()[:1]), '\n')

Pretty cool, huh?

Did you notice that the last line of the output says `Anything left over?:  b'\xf9'`. Why did this change? What's the significance of `b'\xf9'`?

Initially we were dealing with a file object with exactly one version message in its buffer. When connecting to a Bitcoin peer our socket-backed file object now contains 2 messages -- so one is left over after we read the first one. As expected it begins with `b'\xf9'` -- the first character in the network bytes-representation network magic `b'\xf9\xbe\xb4\xd9'`!

The rest of the stream is a [`verack`](https://en.bitcoin.it/wiki/Protocol_documentation#verack), the second step in the [Version Handshake](https://en.bitcoin.it/wiki/Version_Handshake)

In [None]:
read_msg(stream)

# Interpreting Network Messages

Now that we know how to read bytes associated with each part of a Bitcoin network message, let's learn to interpret them.

First, the "network magic" ...

###  Interpreting `magic`

Every time we receive a Bitcoin network message we want to start by reading the network magic and checking that it's equal to the bytes `b"\xf9\xbe\xb4\xd9"` (this value has been [hard-coded in Bitcoin Core](https://github.com/bitcoin/bitcoin/blob/ace87ea2b00a84b7a76e75f1ec93d1a4dce83f6f/src/chainparams.cpp#L104) since the beginning).

Before we learn to do this, let's zoom out and ask "What is this 'network magic', anyway?"

[This StackExchange post](https://bitcoin.stackexchange.com/questions/43189/what-is-the-magic-number-used-in-the-block-structure/43191#43191) contains the best answer I've seen. Make sure you browse the [Wikipedia article](https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_protocols) it links to. In brief, magic values are a common method in protocol design for tagging data structures with information indicating it's "type". 

Why did Satoshi choose `\xf9\xbe\xb4\xd9` for his prefix?
1. [A note in the source code we inherited from him](https://github.com/bitcoin/bitcoin/blob/ace87ea2b00a84b7a76e75f1ec93d1a4dce83f6f/src/chainparams.cpp#L100).
2. [It has some improbable mathematical properties](https://bitcoin.stackexchange.com/a/52456/85335) Satoshi probably liked for no practical reasons.

Now how should we deal with these magic bytes in the process of reading and interpreting real Bitcoin network messages? This one is pretty simple -- just check whether they're equal to the values in the wiki / reference implementation!

![image](images/magic-values.png)
##### Exercise: Check Network Magic

In [None]:
NETWORK_MAGIC = b'\xf9\xbe\xb4\xd9'

def read_magic(stream):
    raise NotImplementedError()

def is_mainnet_msg(stream):
    magic = read_magic(stream)
    return magic == NETWORK_MAGIC

def is_testnet_msg(stream):
    magic = read_magic(stream)
    return magic == b"\x0b\x11\x09\x07"

In [None]:
def test_magic():
    mainnet_msg = b'\xf9\xbe\xb4\xd9version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'
    testnet_msg = b'\x0b\x11\x09\x07version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'
    
    assert is_mainnet_msg(BytesIO(mainnet_msg)) is True
    assert is_mainnet_msg(BytesIO(testnet_msg)) is False
    
    assert is_testnet_msg(BytesIO(testnet_msg)) is True
    assert is_testnet_msg(BytesIO(mainnet_msg)) is False
    print("Test passed!")

test_magic()

### Interpret `command`

In practice, when you receive a Bitcoin network message you want to check the `command` attribute and call a handler function that knows what to do with each of the 27 different kinds of Bitcoin peer-to-peer messages.

By default commands are right padded with empty bytes (`b"\x00"`) until 12 total bytes are reached. For example, a `version` command would look like `b"version\x00\x00\x00\x00\x00"`. For readability sake, let's strip the empty byte padding so we can deal with the cleaner `b"version` instead.

##### Exercise: Interpret `command`

In [None]:
def read_command(stream):
    raw = stream.read(12)
    command = raw.strip(b"\x00")  # remove empty byte padding
    return command
    
def is_version_msg(stream):
    command = read_command(stream)
    return "FIXME"
    
def is_verack_msg(stream):
    command = read_command(stream)
    return "FIXME"

In [None]:
VERACK = b"\xf9\xbe\xb4\xd9verack\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00]\xf6\xe0\xe2"

def read_command_test():
    stream = BytesIO(VERSION)
    stream.seek(4)
    assert is_version_msg(stream) is True
    stream.seek(-12, 1)
    assert is_verack_msg(stream) is False

    stream = BytesIO(VERACK)
    stream.seek(4)
    assert is_verack_msg(stream) is True
    stream.seek(-12, 1)
    assert is_version_msg(stream) is False
    print("Tests passed")

read_command_test()

# Read `payload`

Lastly, let's parse the 3 payload-related portions of the message: "length", "checksum", and "payload".

![image](images/message-structure.png)

The goal of these three attributes is to read and verify the integrity of the `payload` -- which could be a newly mined block, or a transaction, or a list of peer IP addresses.

Payloads vary in length. A [`verack` message](https://en.bitcoin.it/wiki/Protocol_documentation#verack) has empty payload. A [`block` message](https://en.bitcoin.it/wiki/Protocol_documentation#block) payload may contain a thousand transactions.

To deal with the varying payload sizes, messages always include a `length` parameter which tells us exactly how large the payload is. This helps us avoid reading part of the payload and stopping in the middle, or overshooting and reading into the next message (like the example above where I read two messages by accident).

Once we read the payload, how can we be sure that what we receive is the same as what our peer node sent us? That nobody modified the message while it was being routed to us over the internet?

For this we use a "checksum":

Checksums are a simple idea: when you send data you also include a small fingerprint of that data. Your recipient can check the fingerprint against the data they receive and verify with some probability the message wasn't tampered with en route.

The Bitcoin protocol creates such a fingerprint by running a [hashing algorithm](https://blog.jscrambler.com/hashing-algorithms/) called [SHA256](https://en.wikipedia.org/wiki/SHA-2) on the data twice and then grabbing the first 4 bytes of the result. How can we be sure this fingerprint is any good? 

1. Hashing functions are "deterministic": given an input `x`, a hashing algorithm `h` will _always_ produce the same output `h(x)`. Since the output is always the same, the first, say, four digits of the output will always be the same. 
2. Calculating the inverse of a hashing function `h(x) -> x` requires brute force, you'd need to try about 256^4 ≈ 500,000,000 payload modifications (a byte contains 256 possible values and there are 4 bytes) on average to produce a viable payload modification with the same checksum.
3. TCP protocol applies a separate checksum verification to each message! \[sidenote: [Speculating why Satoshi choose to add a second checksum on top of TCP](https://bitcoin.stackexchange.com/a/22887/85335).\]

If the checksums match it is very unlikely that the message was accidentally mangled in transit.

### Reading Integers

The first field of the 3 payload-related fields is the 4-byte `length`. The `length` bytes of the `version` messages we handled at the beginning were `b'j\x00\x00\x00'`

Here's the tricky part: What number do those bytes represent? More generally, how do we turn numbers into bytes and bytes into numbers?

This is a question of "type conversions," "serialization," or "encoding". The TCP protocol only lets you send numbers between 0 and 255. But our messages almost always need to be more expressive than just a number between 0 and 255.

Therefore we must define rules for conversion of every type of Python data to and from the universal TCP-compatible format `bytes`.

Such rules were at work in this magical, unexplained line of code in the `read_msg` function we defined earlier:

```
payload_length = int.from_bytes(stream.read(4), 'little')
```

##### Exercise: Write a `read_length` Function

Read the correct number of bytes according to the protocol docs and interpret them as an int using `length = int.from_bytes(some_bytes, 'little')`

In [None]:
def read_length(stream):
    raise NotImplementedError()

In [None]:
def test_read_length():
    stream = BytesIO(VERSION)
    stream.read(4 + 12)  # throw away magic and command
    assert read_length(stream) == 106
    print("Test passed")

test_read_length()

##### Exercise 5: Write a `read_checksum` Function

Just read the correct number of bytes and return them. This one's easy ...

In [None]:
def read_checksum(stream):
    raise NotImplementedError()

In [None]:
def test_read_checksum():
    stream = BytesIO(VERSION)
    stream.read(4 + 12 + 4)  # throw away magic, command and length
    assert read_checksum(stream) == b'\x9b"\x8b\x9e'
    print("Test passed")

test_read_checksum()

##### Exercise 6: Write a `read_payload` Function

This function has 2 parameters: `stream` and `length`, which represents the number of bytes we should read. In practice, `length` would come from running `read_length`

In [None]:
def read_payload(stream, length):
    raise NotImplementedError()

In [None]:
def test_read_payload():
    stream = BytesIO(VERSION + b"x")
    stream.read(4 + 12)  # throw away magic and command
    length = read_length(stream)
    stream.read(4)
    assert len(read_payload(stream, length)) == 106
    assert stream.read(1) == b"x"
    print("Test passed")
    
test_read_checksum()

Bitcoin uses the SHA256 hashing function to produce and verify checksums. Here is how to run SHA256 on the bytes `b"don't trust, verify"` and get `bytes` as a result.

In [None]:
from hashlib import sha256

sha256(b"don't trust, verify").digest()

Where Bitcoin uses SHA256, it usually uses it twice. [Here's a discussion](https://bitcoin.stackexchange.com/questions/6037/why-are-hashes-in-the-bitcoin-protocol-typically-computed-twice-double-computed) of why Satoshi might have made this decision  

##### Exercise 7: Write a `double_sha256` function which runs `sha256` twice on input and return `bytes` as output

In [None]:
def double_sha256(b):
    raise NotImplementedError()

In [None]:
def test_double_sha256():
    assert hash256(b"don't trust, verify") == b'\xdf\xdbf\x95\x14\x98|45\xda6\x1em\x06y\xc9\xee@\x85\xa5\xca\x1d\xaa\xa1.\xf9\t\x91\x9c\xc1\xa7\xf0'
    print("Test passed")

test_double_sha256()

##### Exercise 8: Write a `compute_checksum` function which returns the first four bytes of "double-sha256"

In [None]:
def compute_checksum(b):
    raise NotImplementedError()

In [None]:
def test_compute_checksum():
    assert compute_checksum(b"don't trust, verify") == b'\xdf\xdbf\x95'
    print("Test passed")

test_compute_checksum()

If you have all the tests passing so far, you can now parse and validate the integrity of Bitcoin message payload


In [None]:
stream = BytesIO(VERSION + b"x")
stream.read(4 + 12) # throw away magic and command

length = read_length(stream)
checksum = read_checksum(stream)
payload = read_payload(stream, length)

print("Length: ", length)

print("Checksum: ", checksum)

print("Payload: ", payload)

print("checksum == compute_checksum(payload)?: ", 
      checksum == compute_checksum(payload))

In [None]:
def read_msg(stream):
    magic = read_magic(stream)
    if magic != NETWORK_MAGIC:
        raise Exception(f'Magic is wrong: {magic}')
    command = read_command(stream)
    payload_length = read_length(stream)
    checksum = read_checksum(stream)
    payload = read_payload(stream, payload_length)
    if checksum != compute_checksum(payload):
        raise Exception('Checksum does not match')
    return {
        "command": command,
        "payload": payload,
    }

In [None]:
sock = socket.socket()
sock.connect((PEER_IP, PEER_PORT))
stream = sock.makefile('rb')
sock.send(VERSION)
msg = read_msg(stream)

print(msg)

print('Anything left over?: ', bytes(stream.peek()[:1]), '\n')

In [None]:
# It will fail now if the prefix is wrong
bad_version = b"oops" + VERSION[4:]

read_msg(BytesIO(bad_version))

In [None]:
# It will fail now if a byte is manipulated
bad_version = VERSION[:-1] + b"x"

read_msg(BytesIO(bad_version))

# Reading The Version Payload

In [None]:
from lib import read_version_payload

payload_stream = BytesIO(msg["payload"])
read_version_payload(payload_stream)

# The Other Direction: Python -> Bytes

When we want to send a message and we already have the command chosen and the payload prepared, we just do the opposite.

Here's how to serialize a message given `bytes` payload:

In [None]:
def serialize_msg(command, payload):
    result = NETWORK_MAGIC
    result += command + b'\x00' * (12 - len(command))
    result += len(payload).to_bytes(4, 'little')
    result += compute_checksum
    result += payload
    return result

In [None]:
from lib import serialize_version_payload

# has default values
print(serialize_version_payload())

print()

# which can be overridden
print(serialize_version_payload(user_agent=b"/pretzels/"))

##### Putting it all together

In [None]:
from db import *
from lib import *

def handshake(address):
    sock = connect(address)
    stream = sock.makefile("rb")

    # Step 1: our version message
    payload = serialize_version_payload()
    msg = serialize_msg(b"version", payload)
    sock.sendall(msg)
    print("Sent version")

    # Step 2: their version message
    msg = read_msg(stream)
    version_payload = read_version_payload(BytesIO(msg["payload"]))
    print("Version: ", msg)

    # Step 3: their version message
    msg = read_msg(stream)
    print("Verack: ", msg)

    # Step 4: our verack
    msg = serialize_msg(b"verack", b"")
    sock.sendall(msg)
    print("Sent verack")

    return sock, version_payload

In [None]:
create_table()
print("Observations before handshake:")
print(list_observations())
print()
handshake(("35.198.151.21", 8333))
print()
print("Observations after handshake:")
print(list_observations())
print()

# A Listener

Connect to a peer and print out all messages received from them

In [None]:
def listener(address):
    sock, version_payload = handshake(address)
    stream = sock.makefile("rb")
    while True:
        print(read_msg(stream))
        
listener(('2a00:ee2:1200:1900:20c:29ff:fe45:9554', 8333))

# Fails with IPv6

We want to write a crawler -- so we need to handle every kind of IP address we could encounter on the Bitcoin network.

In [None]:
# Bitcoin network equivalent of "hello"
VERSION = b'\xf9\xbe\xb4\xd9version\x00\x00\x00\x00\x00j\x00\x00\x00\x9b"\x8b\x9e\x7f\x11\x01\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x93AU[\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00rV\xc5C\x9b:\xea\x89\x14/some-cool-software/\x01\x00\x00\x00\x01'

sock = socket.socket()
sock.connect(('2a00:ee2:1200:1900:20c:29ff:fe45:9554', PEER_PORT))

# initiate the "version handshake"
sock.send(VERSION)

# receive their "version" response
response = sock.recv(1024)

print(response)

# More info about sockets

[socket.socket](https://docs.python.org/3.7/library/socket.html#socket.socket) takes 4 optional arguments. The first one is especially important -- it defines whether you will use IPV4 or IPV6 (or protocol families).

We can use the `getaddrinfo` function to automate the selection of these variables:

getaddrinfo returns `(family, type, proto, canonname, sockaddr)` ([docs](https://docs.python.org/3.7/library/socket.html#socket.getaddrinfo))

First 3 should be sent to `socket.socket()` constructor

Last 1 will be sent to `socket.socket().connect()`

In [None]:
import socket

ai = socket.getaddrinfo('2a00:ee2:1200:1900:20c:29ff:fe45:9554', 8333)

for item in ai:
    print(item, "\n")

In [None]:
tcp_listing = ai[0]
tcp_listing

In [None]:
socket_info, connect_info = tcp_listing[:-2], tcp_listing[-1]
sock = socket.socket(*socket_info)
sock.connect(connect_info)
sock.send(VERSION)
print("Now it works with IPV6:\n")
print(sock.recv(1024))

# Tor

Tor is running on this server.

Watch how our IP changes as soon as we patch `socket.socket` with one that proxies through Tor:

In [None]:
import socks
import requests

print("Old IP", requests.get("http://icanhazip.com").text)
socks.setdefaultproxy(
    proxy_type=socks.PROXY_TYPE_SOCKS5, 
    addr="127.0.0.1", 
    port=9050,
)
old_socket = socket.socket
socket.socket = socks.socksocket
print("New IP", requests.get("http://icanhazip.com").text)
socket.socket = old_socket  # change it back ...

In [None]:
import socks

address = ("aihen7kfbtscyknf.onion", 8333)

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
socksocket = socks.socksocket()
socksocket.connect(address)
socksocket.send(VERSION)

print("Also works with Tor:\n")
print(socksocket.recv(1024))

# A Simple Crawler

In [None]:
from lib import *
from db import *

addresses = [
    ("35.198.151.21", 8333),
    ("91.221.70.137", 8333),
    ("92.255.176.109", 8333),
    ("94.199.178.17", 8333),
]

def simple_crawler(addresses):
    while addresses:
        start = time.time()
        address = addresses.pop()
        print('Connecting to ', address)

        # If we can't connect, proceed to next address
        try:
            sock, version_payload = handshake(address)
        except Exception as e:
            print(f"Encountered error: {e}")
            continue
        
        # Save the address & version payload
        observe_node(address, version_payload)
            
        stream = sock.makefile("rb")
    
        # Request their peer list
        sock.send(serialize_msg(b"getaddr", b""))

        print("Waiting for addr message")
        while True:
            # Only wait 5 seconds for addr message
            if time.time() - start > 5:
                break  
            
            # If connection breaks, proceed to next address
            try:
                msg = read_msg(stream)
            except:
                break
            
            # Only handle "addr" messages
            if msg["command"] == b"addr":
                addr_payload = read_addr_payload(BytesIO(msg["payload"]))
                if len(addr_payload["addresses"]) > 1:
                    addresses.extend([(a["ip"], a["port"]) for a in addr_payload["addresses"]])
                    print(f'Received {len(addr_payload["addresses"])} addrs')
                    break
            else:
                print("ignoring ", msg["command"])
    print("Ran out of addresses. Exiting.")

In [None]:
create_table()
simple_crawler(addresses)

In [None]:
count_observations()

# A Better Crawler


In [None]:
import logging

logging.basicConfig(level="INFO", format='%(threadName)-6s | %(message)s')
logger = logging.getLogger(__name__)

In [None]:
logger.info("Hello, world!")

In [None]:
def nap():
    time.sleep(1)
    logger.info("Awake")

In [None]:
nap()

In [None]:
def synchronous_naps():
    for i in range(5):
        nap()
    logger.info("Done")

In [None]:
synchronous_naps()

In [None]:
def threaded_naps():
    threads = []
    
    for i in range(5):
        thread = Thread(target=nap)
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()
        
    logger.info("Done")

In [None]:
threaded_naps()

In [None]:
from queue import Queue
from threading import Thread

def worker(worker_id, address_queue):
    print("starting worker", worker_id)
    while True:
        start = time.time()

        address = address_queue.get()
        print(f'connecting to {address}')

        # If we can't connect, proceed to next addressf
        try:
            sock, version_payload = handshake(address)
        except Exception as e:
            print(f"Encountered error: {e}")
            continue

        stream = sock.makefile("rb")

        # Save the address & version payload
        observe_node(address, version_payload)

        # Request their peer list
        sock.send(serialize_msg(b"getaddr", b""))
        print(f'sent "getaddr"')

        print("Waiting for addr message")
        while True:
            # Only wait 5 seconds for addr message
            if time.time() - start > 5:
                break

            # If connection breaks, proceed to next address
            try:
                msg = read_msg(stream)
            except:
                print("Error reading message")
                break

            # Only handle "addr" messages
            if msg["command"] == b"addr":
                addr_payload = read_addr_payload(BytesIO(msg["payload"]))
                for address in addr_payload["addresses"]:
                    address_queue.put((address["ip"], address["port"]))            
                    print(f'Received {len(addr_payload["addresses"])} addrs from {address["ip"]}')
                break
            else:
                print("ignoring", msg['command'])
                          
    print("exiting")
                  
def threaded_crawler(addresses, workers=5):
    address_queue = Queue()
    for address in addresses:
        address_queue.put(address)
        
    threads = []
    
    for worker_id in range(workers):
        thread = Thread(target=worker, args=(worker_id, address_queue))
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

In [None]:
create_table()
threaded_crawler(addresses)

In [None]:
count_observations()

# DNS Seeds

We could use some better 

Open up the terminal and run:

```
$ nslookup seed.bitcoin.sprovoost.nl
```

getaddrinfo: domain name -> ip address
getnameinfo: ip address -> domain name

([wiki](https://en.wikipedia.org/wiki/Getaddrinfo))

You can do this from Python

In [None]:
dns_seed = "seed.bitcoin.sprovoost.nl"

socket.getaddrinfo(dns_seed,0,0,0,0)

In [None]:
dns_seeds = [
    'dnsseed.bitcoin.dashjr.org', 
    'dnsseed.bluematt.me', 
    'seed.bitcoin.sipa.be', 
    'seed.bitcoinstats.com', 
    'seed.bitcoin.sprovoost.nl', 
    'seed.bitnodes.io',
]

def fetch_ips(dns_seed):
    ip_list = []
    ais = socket.getaddrinfo(dns_seed,0,0,0,0)
    for result in ais:
        ip_list.append(result[-1][0])
    return list(set(ip_list))

def fetch_addresses(dns_seeds):
    result = []
    for dns_seed in dns_seeds:
        try:
            ips = fetch_ips(dns_seed)
            addresses = [(ip, 8333) for ip in ips]
            result.extend(addresses)
        except:
            print(f"Error fetching addresses from {dns_seed}")
            continue
    return result
            
dns_seed_addresses = fetch_addresses(dns_seeds)
dns_seed_addresses

In [None]:
threaded_crawler(dns_seed_addresses, workers=50)

In [None]:
count_observations()

In [None]:
# I ran for 5 minutes with 50 workers:
from db import *
count_observations("five-minutes.db")

Homework:
* do some data science
* record timestamps and graph over time
* get GEOIP data and stick that in the observations
* try to figure out whether nodes are running in AWS or Google cloud. Which is more popular?
* Keep track of errors. Why are they happening. Am I doing something stupid?