# Multiprocessors
## Parallelism
* Multithreading: Run more than one threads under the same core
* Multiprocessing: Run on more than one cores
* GIL (Global Interpreter Lock) assigns write access to one thread per core => Python parallelize through multiprocessing
* Atom: Single task
* Molecule: Subset of tasks


# Sample Weights
## Overlapping Outcomes
* Range of samples could overlap to each other => Not IID
* Consider the following properties:
    * Number of concurrent labels: how many labels uses data at a certain time, i.e., $c_t = \sum_{i=1}^I 1_{t, i}$
    * Uniqueness: $u_{t, i} = 1_{t, i} / c_t$
    * Average uniqueness of a label: uniqueness averaging over from 1 to T, i.e., $\bar{u}_i = (\sum_t u_{t, i}) / (\sum_t 1_{t, i})$
 
## Bagging Classifiers and Uniqueness
* Assuming IID leads to oversampling
* Let $\bar{u}$ be a average uniqueness
* If $I^{-1} \sum_{i=1}^I \bar{u}_i << 1$,
    * Redundant to each other
    * Very similar to out-of-bag
* Solutions:
    * Drop overlapping outcomes => extreme loss of information
    * Lower maximum number of samples
* [Sequential Bootstrap](https://ac.els-cdn.com/S0378375897000414/1-s2.0-S0378375897000414-main.pdf?_tid=c9bde38a-9f30-43ca-abcf-98033c135788&acdnat=1529805932_882bdfdde2acf41470f32510b3a5a03c)
* Variation of Sequential Bootstrap
    * $\bar{u}_j^{(i+1)} = 1_{t, j} (1 + \sum_{k \in \phi^{(i)}} 1_{t, k})^{-1}$
    * Resample with adapted weights: $\delta_j^{(i)} = \bar{u}_j^{(i)} (\sum_{k=1}^I \bar{u}_k^{(i)})^{-1}$
    * After every samaple, update weights
    * The process is repeated until $I$ draws
    
 
## Prioritized Sampling
* Return Attribution: Assign weight for training according to the value of returns
* Time Decay: prioritized decay linearly as data becomes old
* Class Weights: Make different prioritization among classes

# Labeling

## Fixed-Time Horizon Method
* Fixed threshold
* Time bar
* Not take into consideration the change of scale
* To improve:
    * Label per a varying threshold depending on estimated $sigma_t$
    * Use dollar or volume based bars

## The Triple-Barrier Method
* Three thresholds
    * Touching Upper: label 1
    * Touching Lower: label -1
    * Touching Vertical: 0 or sign of return
    
## Side and Size Label
* Need to define side to determine the direction of profit taking and stop loss
* Need algorithms to produce the side of transactions
* We do not want to learn the side with a single ML model
    * Primary model: Decide the side of your bets (Meta Labeling)
    * Secondary model: Decide the side of bets
    
## Meta Labeling
* Similar to model stacking
* Helpful to achieve higher F1-scores
    * Primary Model: Determine the side with high recall
    * Secondary Model: Determine if you act or pass, focus on improving precision
* Powerful with four reasons
    1. White box
        * Allows you to build a model on top of white box like fundamental models
        * Helpful for quantamental firms
    2. Avoids overfitting
    3. More sophisticated model
        * E.g., allows you to build a model focusing on long or short positions 
    4. Able to divide decisions depending on the bet size
        * High accuracy on small bets and low accuracy on large bets ruins you
* You can add a meta-labeling layer to any primary model
* Drop under-populated labels
    * ML algorithms do not perform well on too imbalanced classes
    * scikit-learn bug

# Financial Data Structures
## Types of Data
* Fundamental Data
* Market Data
* Analytics
* Alternative Data

## Bars
### Standard Bars
* Time Bars: Sampled with fixed time interval
* Tick Bars: Sampled with fixed number of ticks
* Volume Bars: Sampled with fixed volume
* Dollar Bars: Sampled with fixed amount of value

### Information-Driven Bars
* Use the followings to estimate the amount of information
    * $b_t = \begin{cases}
        b_{t-1} if \Delta p_t = 0\\
        \frac{\Delta p_t}{| \Delta p_t| } if \Delta p_t \neq 0
      \end{cases} $
    * $T^* = \underset{T}{arg min} \{|\theta_T| \geq E[\theta_T]\}$ for defined $\theta_T$
* Tick Imbalanced Bars (TIB)
    * Take into consideration how many times prices changes
    * $\theta_T = \sum_{t=1}^T b_t$
    * Look at flow imbalance. If imbalance is more than expected, make a new bar
* Volume/Dollar Imbalanced Bars (VIB and DIB)
    * $\theta_T = \sum_{t=1}^T b_t v_t$
* Tick Runs Bars
    * Monitor the sequence of buys
    * $\theta_T = max\{ \sum_{t|b_t=1}^T b_t, - \sum_{t|b_t=-1}^T b_t\}$
* Volume/Dollar Runs Bars
    * $\theta_T = max\{ \sum_{t|b_t=1}^T b_t v_t, - \sum_{t|b_t=-1}^T b_t v_t\}$
    
## Dealing with Multi-Product Series
* Example cases:
    * Model spreads with changing weights
    * Basket of securities where dividends/coupons must be reinvested
    * Basket that must be rebalanced
    * Index whose constitutes changed
    * Replace an expired/matured contract/security
* Goal is to transform any complex multi-product dataset into a single dataset that resembles a total-return ETF

### ETF Trick
* Problems when trading a spread of futures
    * The spread is characterized by a vector of weights changing over time and may converge.
    * Spreads can be negative values
    * Trading times  will not align exactly for all constituents
* The goal is to model a basket of future as if it was a single non-expiring cash product
    * Changes in the series reflects PnL
    * Strictly positive
    * Shortfall is taken into consideration
    
##### Method
For instrument $i = 1, \dots, I$ at bar $t = 1, \dots, T$
* $o_{i, t}$: Raw open price
* $p_{i, t}$: Raw close price
* $\phi_{i, t}$: Exchange rate to USD
* $v_{i, t}$: Volume
* $d_{i, t}$: Dividend or coupon
    
For allocation vector $\omega_t$ rebalanced on bars $B \subseteq \{1, \dots, T\}$,
* $h_{i, t} = \begin{cases}
        \frac{\omega_{i, t} K_t}{o_{i, t + 1} \phi_{i, t} \sum_i |\omega_{i, t}|} if t \in B\\
        \frac{\Delta p_t}{| \Delta p_t| } if \Delta p_t \neq 0
      \end{cases} $
      

## Sampling Features
* Not all of ML algorithms are scalable, e.g., SVM
* ML works well when trained on relevant features
* Event-Based Sampling: Sample feature relevant to certain events, e.g., spike of volatility
    * CUSUM (Cumulative Sum) Filter: Sample when target value deviates larger than defined threshold