# Data driven ML method 

Instead of pricing them using logic and computations as we've explored, I'll make assumptions on the data and instead build data driven models. I'll take some inspiration from  [Ghassane Benrhmach, Khalil Namir, Jamal Bouyaghroumni, Abdelwahed Namir
](https://www.researchgate.net/publication/353175680_FINANCIAL_TIME_SERIES_PREDICTION_USING_WAVELET_AND_ARTIFICIAL_NEURAL_NETWORK) and expand on some of their ideas. I'll summerize what they did and then potential changes I'll make. To see how useful the product is, i'll try to do the same comparison with an ARIMA model, and see how my version does. 

They used a DWT to decompose each price series into an approximation component + details. Then they used a ANN model and trained on the coeff. They used inverse DWT to get the forecast of every approximation comp. + details. 

## Decomposition 
One simple improvement is to use SWT instead of using DWT. We can get into the details of SWT, for now I'll leave it at the fact that the fact that shift invariance is important and given that we can look at more volitle markets we may actually gain a lot from moving to a SWT. We also have a 1-1 mapping to the time domain which also is very useful for further analysis. 

## Network 
So the choice of network is intersting, we can implement an multi-scale network, which can be intersting given that we get shared learning across all scales. This may work to our advantage since we then might be able to see how differnt details or approximations affect each other. To understand why we have to look into how a normal encoder would work. It would simply mix features from different resolutions without care for the context of that resolution. As you can imagine, from a picture zooming in would give better edges but be more noisy and harder to deduce contex. While zooming out would give more global context, with less noise but we lose the finer structures of the data. 

In [2]:
import yfinance as yf 
import numpy as np
import pandas as pd
import pywt
import torch
from torch import nn
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
def fetch_series(symbol="CL=F", start="2015-01-02", end="2025-01-02", col="Adj Close"):
    """
    Download a price series with yfinance. Switch `symbol` to whatevs 
    """
    df = yf.download(symbol, start=start, end=end, progress=False)
    series = df[col].dropna()
    series.name = symbol
    return series


def swt_decompose(values: np.ndarray, level: int = 3, wavelet: str = "db4") -> np.ndarray:
    """
    Stationary Wavelet Transform. Returns an array shaped
    (channels, T) where channels = level * 2 (cA_i, cD_i for i = 1..level)
    """
    coeffs = pywt.swt(values, wavelet, level=level)
    cA = [c[0] for c in coeffs]
    cD = [c[1] for c in coeffs]
    return np.vstack(cA + cD)  # (channels, T)

### Dataset 

*Consider an m variate time series* let's say: $\bar{x}_t = (x_t^{(1)}, x_t^{(2)}, \dots, x_t^{(m)})\in \mathbb{R}^m$. The point here is we have a feature vector with the price, volume volatility etc. In principle we can see this as a matrix or 2D tensor, after creating the patches, these will essentially be the amount of data we assume to be dependent of each other. One intuitive way to see it is basically as how much we zoom into the picture, this will bring us the different resolutions that we're going to look into further.  Consider this 2D tensor, $\mathbf{X}_i = [\mathbf{x}_{(i-1)P+1}, \dots, \mathbf{x}{iP}]\in \mathbb{R}^{P\times m}$, where m is the amount of features we have in our vector, i is going to be $i = 1, \dots, T/P$. 
We are then going to flatten this into a $\mathbb{R}^{Pm}$ vector. Here comes the *learning* part. We're going to do a linear projection that is going to be learned to find the best weights. Essentially, $\mathbf{z_i} = \mathbf{W_E} \text{vec}(\mathbf{X_i}) + \mathbf{b_E}$. 

Now we can look into the scaling tokens, with a set of tokens that we create, we can then look into different types of linear projections, into three different spaces, for example $K, Q, V$. There is a point of doing it this way, essentially we want to first understand what feature pattern we care about, which is the $Q$ then what feature pattern token j contains and finally what information is stored at token j. We can then compute a similarity score and then compute $A_{ij}$ whic simply is how much of token j contributes to token i. 

We define this attention matrix as such : $A = \text{softmax}(\frac{QK^T}{\sqrt{d_k}}+ B) \in \mathbb{R}^{m\times m}$. 

We can mix these values by letting the new tokens be: $Z' = AV$, this implies that each token i becomes a weighted average of the valuje vectors from all tokens in its own window. And finally we can remap to the original channel width with: $\tilde{Z} = Z'W_O \in \mathbb{R}^{M\times C}$


*There is a lot more details and proofs going into each decision of the model, since i am excited to implement stuff to see how well they work rather than reviewing my logic ill work on implementations instead*


In [4]:
"""
We want to implement  ashifted window attention for a 1 dimensional time series
we do this by: 
- patch embedding 
- window self attention 
- we going to create two swin (similar) blocks to show the flow Q,K,V -> A -> Z' (which i did prove)
- then we're going to create simple forecasting
"""

def window_partition_1d(x: torch.Tensor, window_size: int):
    """Split sequence into 1D windows (picture analogy works here)
    Args:
        x: Tensor[B, L, C]
        window_size: int
    Returns:
        windows: Tensor[B, num_win, M, C] where M = window_size
    """
    B, L, C = x.shape
    assert L % window_size == 0, "Sequence length must be multiple of window." 
    x = x.view(B, L // window_size, window_size, C)
    return x  # type: (B, num_win, M, C)


def window_reverse_1d(windows: torch.Tensor, window_size: int):
    """Inverse of window_partition_1d."""
    B, num_win, M, C = windows.shape
    x = windows.reshape(B, num_win * window_size, C)
    return x  # (B, L, C)