# Labeling

## Fixed-Time Horizon Method
* Fixed threshold
* Time bar
* Not take into consideration the change of scale
* To improve:
    * Label per a varying threshold depending on estimated $sigma_t$
    * Use dollar or volume based bars

## The Triple-Barrier Method
* Three thresholds
    * Touching Upper: label 1
    * Touching Lower: label -1
    * Touching Vertical: 0 or sign of return

# Financial Data Structures
## Types of Data
* Fundamental Data
* Market Data
* Analytics
* Alternative Data

## Bars
### Standard Bars
* Time Bars: Sampled with fixed time interval
* Tick Bars: Sampled with fixed number of ticks
* Volume Bars: Sampled with fixed volume
* Dollar Bars: Sampled with fixed amount of value

### Information-Driven Bars
* Use the followings to estimate the amount of information
    * $b_t = \begin{cases}
        b_{t-1} if \Delta p_t = 0\\
        \frac{\Delta p_t}{| \Delta p_t| } if \Delta p_t \neq 0
      \end{cases} $
    * $T^* = \underset{T}{arg min} \{|\theta_T| \geq E[\theta_T]\}$ for defined $\theta_T$
* Tick Imbalanced Bars (TIB)
    * Take into consideration how many times prices changes
    * $\theta_T = \sum_{t=1}^T b_t$
    * Look at flow imbalance. If imbalance is more than expected, make a new bar
* Volume/Dollar Imbalanced Bars (VIB and DIB)
    * $\theta_T = \sum_{t=1}^T b_t v_t$
* Tick Runs Bars
    * Monitor the sequence of buys
    * $\theta_T = max\{ \sum_{t|b_t=1}^T b_t, - \sum_{t|b_t=-1}^T b_t\}$
* Volume/Dollar Runs Bars
    * $\theta_T = max\{ \sum_{t|b_t=1}^T b_t v_t, - \sum_{t|b_t=-1}^T b_t v_t\}$
    
## Dealing with Multi-Product Series
* Example cases:
    * Model spreads with changing weights
    * Basket of securities where dividends/coupons must be reinvested
    * Basket that must be rebalanced
    * Index whose constitutes changed
    * Replace an expired/matured contract/security
* Goal is to transform any complex multi-product dataset into a single dataset that resembles a total-return ETF

### ETF Trick
* Problems when trading a spread of futures
    * The spread is characterized by a vector of weights changing over time and may converge.
    * Spreads can be negative values
    * Trading times  will not align exactly for all constituents
* The goal is to model a basket of future as if it was a single non-expiring cash product
    * Changes in the series reflects PnL
    * Strictly positive
    * Shortfall is taken into consideration
    
##### Method
For instrument $i = 1, \dots, I$ at bar $t = 1, \dots, T$
* $o_{i, t}$: Raw open price
* $p_{i, t}$: Raw close price
* $\phi_{i, t}$: Exchange rate to USD
* $v_{i, t}$: Volume
* $d_{i, t}$: Dividend or coupon
    
For allocation vector $\omega_t$ rebalanced on bars $B \subseteq \{1, \dots, T\}$,
* $h_{i, t} = \begin{cases}
        \frac{\omega_{i, t} K_t}{o_{i, t + 1} \phi_{i, t} \sum_i |\omega_{i, t}|} if t \in B\\
        \frac{\Delta p_t}{| \Delta p_t| } if \Delta p_t \neq 0
      \end{cases} $
      

## Sampling Features
* Not all of ML algorithms are scalable, e.g., SVM
* ML works well when trained on relevant features
* Event-Based Sampling: Sample feature relevant to certain events, e.g., spike of volatility
    * CUMSUM Filter: Sample when target value deviates larger than defined threshold