# ETL Tutorial For Historical Data

In this tutorial we are going to:
1. Extract a sample of historical trades data from `.tar` (source: https://drive.google.com/file/d/1up5otVlfw-RX1S6K8o4d2nNRPP-lKran/view)
2. Perform exploratory analysis 
3. Transform to a format supported by other components in the library.
4. Store on S3 in a parquet tiled format tailor-suited to be used for various use-cases.
5. Show an example of loading the data back from S3

# Prerequisites

In order to go through this tutorial successfully, the following set-up/infrastructure available:
1. A virtual environment set-up running, make sure you can run `> i docker_jupyter` successfully.
2. An S3 bucket to store historical data
3. AWS API credentials set-up with permissions to access access the S3 bucket

# Decompress the input file

_Note: Assuming the tar archive is located in the root of the repository_

In [1]:
! mkdir data && tar xf /app/msfttaqcsv202308.tar -C ./data

In [2]:
!ls ./data

 metadata
'uT1dPod8mR2s_MSFT US Equity_quotes_1_1.csv.gz'
'uT1dPod8mR2s_MSFT US Equity_trades_1_1.csv.gz'


# Imports

In [6]:
import datetime
import logging

import pandas as pd

import helpers.hdbg as hdbg
import helpers.henv as henv
import helpers.hpandas as hpandas
import helpers.hprint as hprint
import helpers.hparquet as hparque

The following cell sets up logging such that it is possible to capture log messages within jupyter cells

In [10]:
hdbg.init_logger(verbosity=logging.INFO)
log_level = logging.INFO

_LOG = logging.getLogger(__name__)

_LOG.info("%s", henv.get_system_signature()[0])

hprint.config_notebook()

[0m[36mINFO[0m: > cmd='/venv/lib/python3.8/site-packages/ipykernel_launcher.py -f /home/.local/share/jupyter/runtime/kernel-be984456-886b-47f8-8f9b-c1f481b43d31.json'
[31m-----------------------------------------------------------------------------
This code is not in sync with the container:
code_version='1.8.0' != container_version='1.7.0'
-----------------------------------------------------------------------------
You need to:
- merge origin/master into your branch with `invoke git_merge_master`
- pull the latest container with `invoke docker_pull`[0m
INFO  # Git
  branch_name='CmampTask5539_Create_a_tutorial_to_load_and_resample_data'
  hash='af0228790'
  # Last commits:
    * af0228790 jsmerix  Checkpoint                                                        (61 minutes ago) Wed Sep 27 17:46:56 2023  (HEAD -> CmampTask5539_Create_a_tutorial_to_load_and_resample_data)
    * e9afa86b9 Vlad     CmampTask5466_Remove_mxnet,_gluonts_and_disable_related_tests (#5470) (   2 hours a

# Load the data

Note: `head=10000` ensures we only use a snippet of the data to run a quick example.

In [5]:
data = pd.read_csv("data/uT1dPod8mR2s_MSFT US Equity_trades_1_1.csv.gz", head=10000)

In [6]:
data.head()

Unnamed: 0,SECURITY,TICK_SEQUENCE_NUMBER,TICK_TYPE,EVT_TRADE_TIME,TRADE_REPORTED_TIME,EVT_TRADE_EXECUTION_TIME,EVT_TRADE_IDENTIFIER,EVENT_ORIGINAL_TRADE_ID,EVENT_ORIGINAL_TRADE_TIME,EVT_TRADE_PRICE,EVT_TRADE_SIZE,EVT_TRADE_LOCAL_EXCH_SOURCE,EVT_TRADE_CONDITION_CODE,EVT_TRADE_BUY_BROKER,EVT_TRADE_SELL_BROKER,TRACE_RPT_PARTY_SIDE_LAST_TRADE,EVT_TRADE_RPT_PARTY_TYP,EVT_TRADE_BIC,EVT_TRADE_MIC,EVT_TRADE_ESMA_TRADE_FLAGS,EVT_TRADE_AGGRESSOR,EVT_TRADE_RPT_CONTRA_TYP,EVT_TRADE_REMUNERATION,EVT_TRADE_ATS_INDICATOR
0,MSFT US Equity,4417360,NEW,2023-08-01T00:00:00.050Z,2023-08-01T00:00:00.050Z,,,,,336.0,0.0,UF,OC,,,,,,,,,,,
1,MSFT US Equity,4417361,NEW,2023-08-01T00:00:00.050Z,2023-08-01T00:00:00.050Z,,,,,335.95,0.0,VY,OC,,,,,,,,,,,
2,MSFT US Equity,4417362,NEW,2023-08-01T00:00:00.050Z,2023-08-01T00:00:00.050Z,,,,,335.95,0.0,UX,OC,,,,,,,,,,,
3,MSFT US Equity,4417363,NEW,2023-08-01T00:00:00.050Z,2023-08-01T00:00:00.050Z,,,,,335.94,0.0,VF,OC,,,,,,,,,,,
4,MSFT US Equity,4417364,NEW,2023-08-01T00:00:00.050Z,2023-08-01T00:00:00.050Z,,,,,336.0,0.0,VG,OC,,,,,,,,,,,


Drop columns where only NaN values are present.

In [7]:
data = data.dropna(axis=1, how='all')

In [9]:
data["SECURITY"].value_counts()

MSFT US Equity    7629374
Name: SECURITY, dtype: int64

Set datetime index

In [14]:
data["timestamp"] = pd.to_datetime(data["TRADE_REPORTED_TIME"])
data = data.set_index("timestamp", drop=True)

In [15]:
data = data[["EVT_TRADE_PRICE", "EVT_TRADE_SIZE"]]

## Compute OHLCV


A simple resampling operation is applied to the data


Time interval labelling convention used across that time interval [a, b) is labelled as b.

E.g. for interval [06:40:00, 06:41:00) the timestamp is
06:41:00

Reference: [Sorrentum whitepaper](https://drive.google.com/drive/u/0/folders/1oFRoJIpqsbCJGP54vx774eVOBCc0z6Wk)

In [16]:
data_ohlcv = data["EVT_TRADE_PRICE"].resample("1T", closed="left", label="right").ohlc()

In [17]:
data_volume = data["EVT_TRADE_SIZE"].resample("1T", closed="left", label="right").sum()
data_volume.name = "volume"

In [18]:
data = pd.concat([data_ohlcv, data_volume], axis=1)

In [19]:
data.head()

Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-08-01 00:01:00+00:00,336.0,336.0,335.6,335.92,0.0
2023-08-01 00:02:00+00:00,,,,,0.0
2023-08-01 00:03:00+00:00,,,,,0.0
2023-08-01 00:04:00+00:00,,,,,0.0
2023-08-01 00:05:00+00:00,,,,,0.0


A standardized name for asset identification column is currently `currency_pair`

In [20]:
data["currency_pair"] = "MSFT"
data["knowledge_timestamp"] = pd.Timestamp.utcnow()

In [21]:
data.head()

Unnamed: 0_level_0,open,high,low,close,volume,currency_pair,knowledge_timestamp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-08-01 00:01:00+00:00,336.0,336.0,335.6,335.92,0.0,MSFT,2023-09-26 16:16:04.623076+00:00
2023-08-01 00:02:00+00:00,,,,,0.0,MSFT,2023-09-26 16:16:04.623076+00:00
2023-08-01 00:03:00+00:00,,,,,0.0,MSFT,2023-09-26 16:16:04.623076+00:00
2023-08-01 00:04:00+00:00,,,,,0.0,MSFT,2023-09-26 16:16:04.623076+00:00
2023-08-01 00:05:00+00:00,,,,,0.0,MSFT,2023-09-26 16:16:04.623076+00:00


## Save as parquet to S3

- The S3 path is formed based on a dataset schema we use. It allows use to have a predictable, unified structure. See Sorrentum whitepaper and `dataset_schema/` directory

In [22]:
# See docstring of hparque.add_date_partition_columns and hparque.to_partitioned_parquet.
partition_mode = "by_year_month"
s3_path = "s3://cryptokaizen-data-test/v3/bulk/manual/resampled_1min/parquet/ohlcv/spot/v1/bloomberg/us_market/v1_0_0/"
# The value of aws_profile depends on your organization set-up.
aws_profile = "ck"

In [23]:
data, partition_cols = hparque.add_date_partition_columns(
        data, partition_mode
    )
hparque.to_partitioned_parquet(
    data,
    ["currency_pair"] + partition_cols,
    s3_path,
    partition_filename=None,
    aws_profile=aws_profile,
)

# Demonstration of example usage of stored resampled data

Once the data is stored to S3 it can be used for any vairous use-cases. An example is using `HistoricalPqByCurrencyPairTileClient` to load data back from S3 and create a `MarketData` object which can be used for example in simulations or backtests.

## Add additional import for the use case

In [4]:
import im_v2.common.data.client.historical_pq_clients as imvcdchpcl
import market_data as mdata
import core.config as cconfig

  from tqdm.autonotebook import tqdm


## Build Config

Config is the standardized way of setting parameters. See more in the documentation

In [7]:
config = {
    "start_ts": None,
    "end_ts": None,
    "wall_clock_time": pd.Timestamp("2100-01-01T00:00:00+00:00"),
    "columns": None,
    "columns_remap": None,
    "ts_col_name": "end_ts",
    "im_client": {
        "vendor": "bloomberg",
        "universe_version": "v1",
        "root_dir": "s3://cryptokaizen-data-test/v3/bulk",
        "partition_mode": "by_year_month",
        "dataset": "ohlcv",
        "contract_type": "spot",
        "data_snapshot": "",
        "download_mode": "manual",
        "downloading_entity": "",
        "aws_profile": "ck",
        "resample_1min": False,
        "version": "v1_0_0",
        "tag": "resampled_1min",
    },
}
config = cconfig.Config.from_dict(config)
print(config)

start_ts: None
end_ts: None
wall_clock_time: 2100-01-01 00:00:00+00:00
columns: None
columns_remap: None
ts_col_name: end_ts
im_client: 
  vendor: bloomberg
  universe_version: v1
  root_dir: s3://cryptokaizen-data-test/v3/bulk
  partition_mode: by_year_month
  dataset: ohlcv
  contract_type: spot
  data_snapshot: 
  download_mode: manual
  downloading_entity: 
  aws_profile: ck
  resample_1min: False
  version: v1_0_0
  tag: resampled_1min


## Load data

Client provides an interface to load data from storage medium. More on clients in the dedicated documentation

In [8]:
im_client = imvcdchpcl.HistoricalPqByCurrencyPairTileClient(**config["im_client"])

To represent a set of assets which are used for a specific use case (for example set of assets traded with a given model) we use a universe. To find out more, search for "universe" documentation.

In [11]:
full_symbols = im_client.get_universe()
filter_data_mode = "assert"
actual_df = im_client.read_data(
    full_symbols,
    config["start_ts"],
    config["end_ts"],
    config["columns"],
    filter_data_mode,
)
hpandas.df_to_str(actual_df, num_rows=5, log_level=logging.INFO)

Unnamed: 0,full_symbol,open,high,low,close,volume,knowledge_timestamp
2023-08-01 00:01:00+00:00,us_market::MSFT,336.0,336.0,335.6,335.92,0.0,2023-09-26 16:16:04.623076+00:00
2023-08-01 00:02:00+00:00,us_market::MSFT,,,,,0.0,2023-09-26 16:16:04.623076+00:00
,...,...,...,...,...,...,...
2023-08-31 23:29:00+00:00,us_market::MSFT,328.1,328.11,328.05,328.11,107.0,2023-09-26 16:16:04.623076+00:00
2023-08-31 23:30:00+00:00,us_market::MSFT,328.1,328.1,328.1,328.1,8.0,2023-09-26 16:16:04.623076+00:00


### Initialize MarketData

#TODO(Juraj): The problem I see here is that we have these cryptically sounding powerful functions such as `get_HistoricalImClientMarketData_example1` but it's difficult for a newcomer/client to understand what to do in case the use-case or input arguments are a little bit different

In [12]:
asset_ids = im_client.get_asset_ids_from_full_symbols(full_symbols)
market_data = mdata.get_HistoricalImClientMarketData_example1(
    im_client,
    asset_ids,
    config["columns"],
    config["columns_remap"],
    wall_clock_time=config["wall_clock_time"],
)

In [13]:
asset_ids = None
market_data_df = market_data.get_data_for_interval(
    config["start_ts"], config["end_ts"], config["ts_col_name"], asset_ids
)
hpandas.df_to_str(market_data_df, num_rows=5, log_level=logging.INFO)

Unnamed: 0,asset_id,full_symbol,open,high,low,close,volume,knowledge_timestamp,start_ts
2023-07-31 20:01:00-04:00,1343146433,us_market::MSFT,336.0,336.0,335.6,335.92,0.0,2023-09-26 16:16:04.623076+00:00,2023-07-31 20:00:00-04:00
2023-07-31 20:02:00-04:00,1343146433,us_market::MSFT,,,,,0.0,2023-09-26 16:16:04.623076+00:00,2023-07-31 20:01:00-04:00
,...,...,...,...,...,...,...,...,...
2023-08-31 19:29:00-04:00,1343146433,us_market::MSFT,328.1,328.11,328.05,328.11,107.0,2023-09-26 16:16:04.623076+00:00,2023-08-31 19:28:00-04:00
2023-08-31 19:30:00-04:00,1343146433,us_market::MSFT,328.1,328.1,328.1,328.1,8.0,2023-09-26 16:16:04.623076+00:00,2023-08-31 19:29:00-04:00
