<hr />
<h1 align="center">Loading Bitcoin blockchain data into ArcticDB, using AWS</h1>
<center><img src="https://raw.githubusercontent.com/man-group/ArcticDB/master/static/ArcticDBCropped.png" alt="ArcticDB Logo" width="400">
<hr />

### In this demo, we illustrate how to use AWS with ArcticDB. We are going to
* Set up AWS access
* Initialise ArcticDB with AWS as storage
* Read a section of the Bitcoin blockchain from an AWS public dataset
* Store the data in ArcticDB
* Read the data back
* Perform a simple analysis on the data

<hr />

### Install ArcticDB and S3 libraries

In [None]:
# s3fs is used by pandas.read_parquet('s3://...')
%pip install arcticdb boto3 tqdm s3fs

### Imports

In [None]:
import os
from uuid import uuid4
from datetime import timedelta, datetime
from tqdm import tqdm
import boto3
import numpy as np
import pandas as pd
from botocore import UNSIGNED
from botocore.client import Config
from arcticdb import Arctic, QueryBuilder, LibraryOptions
from google.colab import drive, userdata

### Read or Create AWS config

In [None]:
# mount Google Drive for the config file to live on
drive.mount('/content/drive')
path = '/content/drive/MyDrive/config/awscli.ini'
os.environ['AWS_SHARED_CREDENTIALS_FILE'] = path

In [None]:
check = boto3.session.Session()
no_config = check.get_credentials() is None or check.region_name is None

if no_config:
    print('*'*40)
    print('Setup your AWS S3 credentials and region before continuing.')
    print('https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html')
    print('*'*40)

#### Create a config file
* You should only need to run this section once
* Enter your AWS details below and change `write_aws_config_file` to True
* Future runs can pick up the config file you have save in your Drive

In [None]:
aws_access_key = "my_access_key"
aws_secret_access_key = "my_secret_access_key"
region = "my_region"

config_text = f"""
[default]
aws_access_key_id = {aws_access_key}
aws_secret_access_key = {aws_secret_access_key}
region = {region}
"""

write_aws_config_file = False
if write_aws_config_file:
    with open(path, 'w') as f:
        f.write(text)

### Find an AWS bucket or initialise a new one

In [None]:
s3 = boto3.resource('s3')
region = boto3.session.Session().region_name

bucket = [b for b in s3.buckets.all() if b.name.startswith('arcticdb-data-')]

if bucket:
    bucket_name = bucket[0].name
    print('Bucket found:', bucket_name)
else:
    bucket_name = f'arcticdb-data-{uuid4()}'
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint':region})
    print('Bucket created:', bucket_name)

### Initialise ArcticDB

In [None]:
# create an arcticdb instance in that bucket
arctic = Arctic(f's3://s3.{region}.amazonaws.com:{bucket_name}?aws_auth=true')

if 'btc' not in arctic.list_libraries():
    # library does not already exist
    arctic.create_library('btc', library_options=LibraryOptions(dynamic_schema=True))
library = arctic.get_library('btc')
library

### Mark the BTC blockchain data from June 2023 for processing

In [None]:
# create the list of all btc blockchain files
bucket = s3.Bucket('aws-public-blockchain')
objects = bucket.objects.filter(Prefix='v1.0/btc/transactions/')
files = pd.DataFrame({'path': [obj.key for obj in objects]})

In [None]:
# filter only 2023-06 files to keep run time manageable
files_mask = files['path'].str.contains('2023-06')
print(f"Identified {np.count_nonzero(files_mask)} / {len(files)} files for processing")
to_load = files[files_mask]['path']

### Import the data into ArcticDB
This can take a few minutes to run

In [None]:
%%time
for path in tqdm(to_load):
    df = pd.read_parquet('s3://aws-public-blockchain/'+path, storage_options={"anon": True})
    # fixup types from source data
    df['hash'] =  df['hash'].astype(str)
    df['block_hash'] = df['block_hash'].astype(str)
    df['outputs'] = df['outputs'].astype(str)
    df['date'] = pd.to_datetime(df['date'], unit='ns')
    if 'inputs' in df.columns:
        df['inputs'] = df['inputs'].astype(str)
    # index on timestamp
    df.set_index('block_timestamp', inplace=True)
    df.sort_index(inplace=True)
    # write the data
    library.update('transactions', df)
    # compaction step (optional) to make future reads faster
    if library.is_symbol_fragmented('transactions'):
        library.defragment_symbol_data('transactions')


### Read the data from ArcticDB

In [None]:
%%time
plot_start = datetime(2011, 6, 1, 0, 0)
plot_end = plot_start + timedelta(days=14)
df = library.read('transactions', date_range=(plot_start, plot_end)).data
print(len(df))

### Chart the transaction fees per day

In [None]:
fees_per_day = df.groupby(pd.Grouper(freq='1D')).sum(numeric_only=True)
t = f"BTC Blockchain: Total fees per pay from {plot_start} to {plot_end}"
ax = fees_per_day.plot(kind='bar', y='fee', color='red', figsize=(16, 8), title=t)
ax.figure.autofmt_xdate(rotation=60)

### Conclusions
* We have give a simple recipe for using ArcticDB with AWS
* We have demonstrated that ArcticDB is significantly faster than Parquet files in this example
* Feel free to tweak the notebook to read a larger set of the files