In [2]:
import polars as pl
import numpy as np
from datetime import datetime

import torch
import torch.nn as nn

from adausdt_qml import research

In [3]:
sym = "ADAUSDT"
time_interval = "16h"
max_lags = 4
forecast_horizon = 1

start_date = datetime(2023, 2, 8, 0, 0)
end_date   = datetime(2026, 2, 8, 0, 0)

In [4]:
ts = research.load_ohlc_timeseries_range(sym, time_interval, start_date, end_date)

Loading ADAUSDT:   0%|          | 0/1097 [00:00<?, ?day/s]

Loading ADAUSDT: 100%|██████████| 1097/1097 [00:21<00:00, 52.02day/s]


We set the seed to ensure reproducibility.

In [5]:
import random 

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)

In [6]:
set_seed(99)

## Feature Engineering

In this section, we transform raw OHLC price data into a structured feature set suitable for supervised machine learning.

The objective is to represent price dynamics in **return space**, which is standard practice in quantitative finance due to better statistical properties such as stationarity and scale invariance.




### Target Variable: Log Returns

The prediction target is defined as the **log return of the closing price** over a fixed forecast horizon.

In plain terms, the log return measures the percentage change in price between two consecutive time steps, expressed in logarithmic form. This representation has two important advantages:

- Log returns are **additive over time**, which aligns naturally with portfolio compounding
- They reduce the impact of price scale, making the series more stable for modeling

In [7]:
# 1) target (log return)
ts = ts.with_columns(
    (pl.col("close") / pl.col("close").shift(forecast_horizon)).log().alias("close_log_return")
)

In [9]:
ts = ts.drop_nulls()

## Directional Balance of Returns

Before modeling, we analyze the **directional distribution** of returns.

Since the target variable is derived from the sign of log returns, an imbalanced dataset (e.g. mostly upward moves) could artificially inflate accuracy without providing real predictive power.

In [10]:
# Create direction column
ts = ts.with_columns(
    (pl.col("close_log_return") > 0).cast(pl.Int8).alias("return_dir")
)

# value_counts -> then unnest the struct into columns
dir_counts = (
    ts.select(pl.col("return_dir").value_counts())
      .unnest("return_dir")   # <-- creates columns: return_dir, count
      .with_columns(
          (pl.col("count") / pl.col("count").sum()).alias("frequency")
      )
      .sort("return_dir")
)

dir_counts

return_dir,count,frequency
i8,u32,f64
0,849,0.516109
1,796,0.483891


### Directional Balance of the Dataset

The dataset is close to directionally balanced.

- Down moves (`return_dir = 0`): **51.55%**
- Up moves (`return_dir = 1`): **48.45%**

This near-symmetry suggests there is **no strong directional bias** in the raw data.  
As a result, any predictive performance is unlikely to come from trivial class imbalance and must be driven by **temporal structure in returns**, which aligns with the autoregressive modeling approach used in this study.

## Model 

The model is going to be an AR model, so we create the lagged features.

In [11]:
# 2) lag features 
target = "close_log_return"
lr = pl.col(target)

ts = ts.with_columns(
    lr.shift(forecast_horizon * 1).alias(f"{target}_lag_1"),
    lr.shift(forecast_horizon * 2).alias(f"{target}_lag_2"),
    lr.shift(forecast_horizon * 3).alias(f"{target}_lag_3"),
    lr.shift(forecast_horizon * 4).alias(f"{target}_lag_4"),
)

# 3) clean dataset
ts = ts.drop_nulls()

ts.select(["datetime", "close", target,
           f"{target}_lag_1", f"{target}_lag_2", f"{target}_lag_3", f"{target}_lag_4"]).head(10)

datetime,close,close_log_return,close_log_return_lag_1,close_log_return_lag_2,close_log_return_lag_3,close_log_return_lag_4
datetime[μs],f64,f64,f64,f64,f64,f64
2023-02-11 08:00:00,0.3683,0.029203,-0.006965,-0.004985,-0.085721,0.004829
2023-02-12 00:00:00,0.3697,0.003794,0.029203,-0.006965,-0.004985,-0.085721
2023-02-12 16:00:00,0.3641,-0.015263,0.003794,0.029203,-0.006965,-0.004985
2023-02-13 08:00:00,0.3589,-0.014385,-0.015263,0.003794,0.029203,-0.006965
2023-02-14 00:00:00,0.3789,0.054229,-0.014385,-0.015263,0.003794,0.029203
2023-02-14 16:00:00,0.3867,0.020377,0.054229,-0.014385,-0.015263,0.003794
2023-02-15 08:00:00,0.4193,0.080937,0.020377,0.054229,-0.014385,-0.015263
2023-02-16 00:00:00,0.4126,-0.016108,0.080937,0.020377,0.054229,-0.014385
2023-02-16 16:00:00,0.3875,-0.062763,-0.016108,0.080937,0.020377,0.054229
2023-02-17 08:00:00,0.4033,0.039965,-0.062763,-0.016108,0.080937,0.020377


As we can see in the following plots, the distribution of the log returns seem to follow a **normal distribution**, while the distribution of close is much more noisy. This is another reason with time additivity to why we model usingf log returns and not normal closing prices.

In [12]:
research.plot_distribution(ts, target, no_bins = 100)

In [13]:
research.plot_distribution(ts, 'close', no_bins = 100)