# Data driven ML method 

Instead of pricing them using logic and computations as we've explored, I'll make assumptions on the data and instead build data driven models. I'll take some inspiration from  [Ghassane Benrhmach, Khalil Namir, Jamal Bouyaghroumni, Abdelwahed Namir
](https://www.researchgate.net/publication/353175680_FINANCIAL_TIME_SERIES_PREDICTION_USING_WAVELET_AND_ARTIFICIAL_NEURAL_NETWORK) and expand on some of their ideas. I'll summerize what they did and then potential changes I'll make. To see how useful the product is, i'll try to do the same comparison with an ARIMA model, and see how my version does. 

They used a DWT to decompose each price series into an approximation component + details. Then they used a ANN model and trained on the coeff. They used inverse DWT to get the forecast of every approximation comp. + details. 

## Decomposition 
One simple improvement is to use SWT instead of using DWT. We can get into the details of SWT, for now I'll leave it at the fact that the fact that shift invariance is important and given that we can look at more volitle markets we may actually gain a lot from moving to a SWT. We also have a 1-1 mapping to the time domain which also is very useful for further analysis. 

## Network 
So the choice of network is intersting, we can implement an multi-scale network, which can be intersting given that we get shared learning across all scales. This may work to our advantage since we then might be able to see how differnt details or approximations affect each other. To understand why we have to look into how a normal encoder would work. It would simply mix features from different resolutions without care for the context of that resolution. As you can imagine, from a picture zooming in would give better edges but be more noisy and harder to deduce contex. While zooming out would give more global context, with less noise but we lose the finer structures of the data. 

In [None]:
import yfinance as yf 
import numpy as np
import pandas as pd
import pywt
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
def fetch_series(symbol="CL=F", start="2015-01-02", end="2025-01-02", col="Adj Close"):
    """
    Download a price series with yfinance. Switch `symbol` to whatevs 
    """
    df = yf.download(symbol, start=start, end=end, progress=False)
    series = df[col].dropna()
    series.name = symbol
    return series


def swt_decompose(values: np.ndarray, level: int = 3, wavelet: str = "db4") -> np.ndarray:
    """
    Stationary Wavelet Transform. Returns an array shaped
    (channels, T) where channels = level * 2 (cA_i, cD_i for i = 1..level)
    """
    coeffs = pywt.swt(values, wavelet, level=level)
    cA = [c[0] for c in coeffs]
    cD = [c[1] for c in coeffs]
    return np.vstack(cA + cD)  # (channels, T)

### Dataset 

*Consider an m variate time series* let's say: $\bar{x}_t = (x_t^{(1)}, x_t^{(2)}, \dots, x_t^{(m)})\in \mathbb{R}^m$. The point here is we have a feature vector with the price, volume volatility etc. a

