# Databento data catalog

**Info:**

<div style="border:1px solid #ffcc00; padding:10px; margin-top:10px; margin-bottom:10px; background-color:#333333; color: #7F99FF;">
This tutorial is currently a work in progress (WIP).
</div>

This tutorial will walk through how to setup a Nautilus Parquet data catalog with various Databento schemas.

Prerequities:
- The `databento` Python client library should be installed to make data requests `pip install -U databento`
- A Databento account (there is a free tier)

## Requesting data

We'll use a Databento historical client for the rest of this tutorial. You can either initialize one by passing your Databento API key to the constructor, or implicitly use the `DATABENTO_API_KEY` environment variable (as shown).

In [None]:
import databento as db

client = db.Historical()  # This will use the DATABENTO_API_KEY environment variable (recommended best practice)

**It's important to note that every historical streaming request from `timeseries.get_range` will incur a cost (even for the same data), therefore we need to:**
- Know and understand the cost prior to making a request
- Not make requests for the same data more than once (not efficient)
- Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again)

We can use a metadata [get_cost endpoint](https://databento.com/docs/api-reference-historical/metadata/metadata-get-cost?historical=python&live=python) from the Databento API to get a quote on the cost, prior to each request.
Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.

Note the response returned is in USD, displayed as fractional cents.

The following request is only for a small amount of data (as used in this Medium article [Building high-frequency trading signals in Python with Databento and sklearn](https://databento.com/blog/hft-sklearn-python)), just to demonstrate the basic workflow. 

In [None]:
from pathlib import Path
from databento import DBNStore

We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial.

In [None]:
DATABENTO_DATA_DIR = Path("databento")
DATABENTO_DATA_DIR.mkdir(exist_ok=True)

In [None]:
# Request cost quote (USD) - this endpoint is 'free'
client.metadata.get_cost(
    dataset="GLBX.MDP3",
    symbols=["ES.n.0"],
    stype_in="continuous",
    schema="mbp-10",
    start="2023-12-06T14:30:00",
    end="2023-12-06T20:30:00",
)

Use the historical API to request for the data used in the Medium article.

In [None]:
path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"

if not path.exists():
    # Request data
    client.timeseries.get_range(
        dataset="GLBX.MDP3",
        symbols=["ES.n.0"],
        stype_in="continuous",
        schema="mbp-10",
        start="2023-12-06T14:30:00",
        end="2023-12-06T20:30:00",
        path=path,  # <--- Passing a `path` parameter will ensure the data is written to disk
    )

In [None]:
# Inspect the data by reading from disk and convert to a pandas.DataFrame
data = DBNStore.from_file(path)

df = data.to_df()
df

## Write to data catalog

In [None]:
import shutil
from pathlib import Path

from nautilus_trader.adapters.databento.loaders import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog

In [None]:
CATALOG_PATH = Path.cwd() / "catalog"

# Clear if it already exists
if CATALOG_PATH.exists():
    shutil.rmtree(CATALOG_PATH)
CATALOG_PATH.mkdir()

# Create a catalog instance
catalog = ParquetDataCatalog(CATALOG_PATH)

Now that we've prepared the data catalog, we need a `DatabentoDataLoader` which we'll use to decode and load the data into Nautilus objects.

In [None]:
loader = DatabentoDataLoader()

In [None]:
path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
instrument_id = InstrumentId.from_str("ES.n.0")  # This should be the raw symbol (update)

depth10 = loader.from_dbn_file(
    path=path,
    instrument_id=instrument_id,  # Not required but makes data loading faster (symbology mapping not required)
    as_legacy_cython=False,  # This will load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient)
)

In [None]:
# Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)
catalog.write_data(depth10)

In [None]:
# Test reading from catalog
depths = catalog.order_book_depth10()
len(depths)

## Preparing a month of AAPL trades

Now we'll expand on this workflow by preparing a month of AAPL trades on the Nasdaq exchange using the Databento `trade` schema, which will translate to Nautilus `TradeTick` objects.

In [None]:
# Request cost quote (USD) - this endpoint is 'free'
client.metadata.get_cost(
    dataset="XNAS.ITCH",
    symbols=["AAPL"],
    schema="trades",
    start="2024-01",
)

In [None]:
path = DATABENTO_DATA_DIR / "aapl-xnas-202401.trades.dbn.zst"

if not path.exists():
    # Request data
    client.timeseries.get_range(
        dataset="XNAS.ITCH",
        symbols=["AAPL"],
        schema="trades",
        start="2024-01",
        path=path,  # <--- Passing a `path` parameter will ensure the data is written to disk
    )

In [None]:
# Inspect the data by reading from disk and convert to a pandas.DataFrame
data = DBNStore.from_file(path)

df = data.to_df()
df

In [None]:
instrument_id = InstrumentId.from_str("AAPL.XNAS")  # Using the Nasdaq ISO 10383 MIC (Market Identifier Code) as the venue

trades = loader.from_dbn_file(
    path=path,
    instrument_id=instrument_id,  # Not required but makes data loading faster (symbology mapping not required)
    as_legacy_cython=False,  # This will load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient)
)

Here we'll organize our data in a file per month, this is a rather arbitrary choice and a file per day could be equally valid.

It may also be a good idea to create a function which can return the correct `basename_template` value for a given chunk of data.

In [None]:
# Write data to catalog
catalog.write_data(trades, basename_template="2024-01")

In [None]:
trades = catalog.trade_ticks([instrument_id])

In [None]:
len(trades)