TODO: Update the lin
<a href="https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/endersgame/mean_reversion_attacker/mean_reversion_attacker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Cover](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/mid-one/cover.jpg)

# En garde, attack!

Welcome to Mid+One! Dive into the world of martingales and market dynamics.

Your challenge: spot tiny shifts in financial time-series, to predict where prices are heading.

It's all about spotting patterns in the elusive mid-price.

# The goal

## Attacking not forecasting!

We don't want to forecast the future prices as this is extremely difficult and requires often a lot of computation. What we want is detect a shift in the market dynamics, up or down, that's it! 

This is a much simpler task and can be done with a simple model. We also need this decision to be computed quickly! Under 20 milliseconds.

To be precise, our attacker will consume a univariate sequence of numerical data points $x_1, x_2, \dots x_t$ and try to exploit deviations from the [martingale property](https://en.wikipedia.org/wiki/Martingale_(probability_theory)), which is to say that we expect the series $x_t$ to satisfy:
$$ E[x_{t+k}] \approx x_t $$
roughly. Of course, there's no such thing in this world as a perfect martingale and it is your job to indicate when
$$ E[x_{t+k}] > x_t + \epsilon $$
by returning a positive value, or conversely. Here $\epsilon$ finds interpretation as a trading cost. The attacker will *typically* return `0` meaning that it thinks:
$$  x_t - \epsilon   > E[x_{t+k}] > x_t + \epsilon $$
because trading opportunities are probably on the rare side - though obviously this is problem dependent. The $\epsilon$ and $k$ (`horizon`) parameters are set [here](https://github.com/microprediction/midone/blob/main/midone/gameconfig.py).

## Scoring

The scoring is straightforward: it corresponds to the profit of a simple trading strategy: if the model predicts a move up over some short horizon (corresponding to a few seconds at most), the strategy is to buy now and sell at the end of the horizon minus some transaction costs.

If we decided to go up at $t$:
$$x_{t+k} - x_t - \epsilon$$
and 
$$x_t - x_{t+k} - \epsilon$$
if we decided to go down at $t$.

The case of a move down is symmetric. In most cases, we don't expect the model to predict a move up or down so we will just do nothing.

This is an example of detecting three move ups (the green bands, detecting down would show as red bands), with two resulting in a profit and one with a loss:

![Profit](docs/profit.png) 

# The data

This competition is really meant to focus on single streaming series and should be quite agnostic to particular financial instruments. There is a training phase where parameters can be learned on group of similar instruments but overall, we rely on online learning. 

Let's look right away at some data.


# Setup

- Install the crunch CLI
- Use the token to get started with data and submission
- Setup the notebook

In [2]:
%pip install --upgrade crunch-cli -q
!crunch setup --notebook mid-one cobra --token bff0RrUG5HBNHqQYGVmVCrXvWU4V4Ywgk0qHSYPmV6UvewfDPrmN7Im2FF6JDMtR

import crunch
crunch = crunch.load_notebook()

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
midplot 0.1.0 requires crunch-cli<5.0.0,>=4.0.3, but you have crunch-cli 5.1.3 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

---
Your token seems to have expired or is invalid.

Please follow this link to copy and paste your new setup command:
https://hub.crunchdao.com/competitions/mid-one/submit

If you think that is an error, please contact an administrator.
loaded inline runner with module: <module '__main__'>


## Load the data

Timeseries come as streams and you can get many streams, split into `train` and `test` datasets. 

A stream is sequence of data points represented by a dictionary. The value of the time series is `pt["x"]` where `p` is the point in the series.

In [3]:
# Load the data
x_train, x_test = crunch.load_streams()

print(f"Loaded {len(x_train)} training streams and {len(x_test)} testing streams")

data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/78/X_train.parquet (7049425 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/78/X_test_reduced.parquet (405611 bytes)
data/X_test.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/78/y_train.parquet (5804278 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/78/y_test_reduced.parquet (411693 bytes)
data/y_test.parquet: already exists, file length match
data/example_prediction.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/78/example_prediction_reduc

# Having a peek into the data

`midplot` provides a lot of cool features, one is to able to visualize the data.

Running over the data sequentially like a live algorithm would do is called `replay`.

Let's have a look at the data by replaying the first stream.

In [4]:
from midplot import replay
replay(x_train[:1], with_visualization=True)

Please define the 'infer' function in the main module: for debugging, showing no attacks.


VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'lightgrey'},
              'mode': 'lines',
   …

<midplot.replay.ReplayResults at 0x10ffe09e0>

Note the comment about the `infer` function not being defined. Crunch will pick up automatically your attack fuction once `infer` is define.

We only ran the first stream for space. Alternatively, you can specify certain streams like this:
```python
replay(x_train, only_stream_ids=[0], with_visualization=True)
```

To only run a subset of the data, you can also specify a `start` and `stop` index:
```python
replay(x_train, only_stream_ids=[0], start_index=0, stop_index=1000, with_visualization=True)
```

### Important flexibility

`midplot` let's you pass regular iterable of float as well so you don't have to recreate these small dictionaries so you can easily run your algorithm on your own data -- very useful for debugging.

This is how you would attack the `log`.

In [5]:
import numpy as np
replay(np.log(range(1, 100)), with_visualization=True)

Please define the 'infer' function in the main module: for debugging, showing no attacks.


VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'lightgrey'},
              'mode': 'lines',
   …

<midplot.replay.ReplayResults at 0x138f5cb90>

# Let's attack!

Underneath, `crunch` requires an `infer` function. The syntax is not completely trivial so `midplot` provides a helper function to do this.

We only need to define an Attacker class like the other notebooks.

For demo purposes, we will keep a buffer of points and detect a move if the change of price in the second part of the buffer is higher with a threshoold (and will be adjusted) the move in the first part of the buffer.

This is a measure of momentum in some way.

In [6]:
from midone import HORIZON, EPSILON, Attacker
UP, DOWN, NOTHING = 1., -1., 0.


class MomentumAttacker(Attacker):
    # We will turn this factor into a hyper-parameter!
    factor: float = 2.5
    
    def __init__(self):
        super().__init__()
        self.buffer = []
        self.num_points = 100
    def tick_and_predict(self, new_value: float, horizon: int = HORIZON) -> float:
        # Add new value and maintain fixed buffer size
        self.buffer.append(new_value)
        if len(self.buffer) > self.num_points:
            self.buffer.pop(0)  # Remove oldest value

        # Wait until we have enough data
        if len(self.buffer) < self.num_points:
            return NOTHING

        # Split buffer into two halves and calculate change in each half
        mid = self.num_points // 2
        first_half_change = self.buffer[mid - 1] - self.buffer[0]
        second_half_change = self.buffer[-1] - self.buffer[mid]
        if np.sign(first_half_change) != np.sign(second_half_change):
            return NOTHING
        # Compare changes to predict trend
        if np.abs(second_half_change) > 2.5 * np.abs(first_half_change):
            return np.sign(second_half_change)
        else:
            return NOTHING
        
from midplot.helpers import wrap
infer = wrap(MomentumAttacker)

This is equivalent to
```python

def infer(
        stream: typing.Iterator[dict],
        hyper_parameters: Any = None,
        with_hyper_parameters_load: bool = False,
        model_directory_path: str = ".",
):
    m = MomentumAttacker()
    yield  # We are ready
    for message in stream:
        pred = m.tick(message["x"], horizon=HORIZON)
        yield pred
        
```

## Note: what are the `hyper_parameters`, `with_hyper_parameters_load` and `model_directory_path` for?

These parameters are important in the training phase where we want to optimize the parameter and save the optimal parameters which we want to load in the inference phase for the actual submission.

It would look like this:
```python
def infer(
        stream: typing.Iterator[dict],
        hyper_parameters: Any = None,
        with_hyper_parameters_load: bool = False,
        model_directory_path: str = ".",
):
    m = MomentumAttacker()
    if hyper_parameters is not None:
        # Defined on your model
        m.update_from_hyper_parameters(hyper_parameters)
    if with_hyper_parameters_load:
        # Load from the params for final inference
        m.load_params(model_directory_path)
    yield  # Leave this here.
    for message in stream:
        pred = m.tick(message["x"], horizon=HORIZON)
        yield pred
```

and the train function would look something like that:
```python

def train(
    streams: typing.List[typing.Iterable[dict]],
    model_directory_path: str
):
    hyper_params = {}
    def optimize(hyper_params):
        res = replay(streams, hyper_parameters=hyper_params)
        return - res.total_score
    # Your optimization function
    hyper_params = optimize(hyper_params)
    # Save the parameters
```



## Let's run the same replay on the first time series

The library will automatically call the `infer` function on the data points.

Let's also display the scoring for this simple algorithm.

In [7]:
replay(x_train[:1], stop_index=500, with_accounting_visualizer=True, with_visualization=True)

HTML(value='\n<style>\n    .table { \n        width: 100%; \n        margin-bottom: 1rem; \n        background…

VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'lightgrey'},
              'mode': 'lines',
   …

<midplot.replay.ReplayResults at 0x13e4d40b0>

## Scenarios

On the last replay, you can click on the graph and will select some points of the time series. This let's you select simple case where you would like your algorithm to pick up a move up or down or do nothing.

This can be very useful to constraints your training to some behavior. Will do some semi-supervised learning in some way.

```python
get_replay_result().save_selected(UP)
```

In [None]:
from midplot import get_replay_result
# Run this to save the scenario
# get_replay_result().save_selected(UP)

In [8]:
from midplot import load_scenarios
movie = load_scenarios()
print(f"Loaded {len(movie.scenarios)} scenarios")

Loaded 1 scenarios




You can run all the scenarios like this:

In [9]:

from midplot.replay import load_scenarios

movie = load_scenarios()

replay(movie.streams(), horizon=HORIZON, with_accounting_visualizer=True, with_visualization=True)

HTML(value='\n<style>\n    .table { \n        width: 100%; \n        margin-bottom: 1rem; \n        background…

VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'lightgrey'},
              'mode': 'lines',
   …

<midplot.replay.ReplayResults at 0x13e459cd0>

or even check that the scenarios are successful:

In [11]:
def check_scenarios(threshold=None):
    movie = load_scenarios()
    r = replay(movie.streams())
    return  r.check_scenarios(movie.scenarios, threshold=threshold)

ok, checks = check_scenarios(movie.scenarios, threshold=0.5)
print(f"Failed {len(checks.failed)} Success {len(checks.success)}")

TypeError: check_scenarios() got multiple values for argument 'threshold'

# THIS IS WHERE IT DEPENDS IF I INCLUDE algos FOR THE TRAINING TO BE EASIER OR NOT


In [None]:
import typing
import os
from algos.diff import Diff

from algos.split import Split
from algos.momentum import Momentum
from algos.optimize import load_params
from typing import Any
from midplot import replay, get_replay_result
from algos.multi import Detector

def get_parameter_file_path(model_directory_path: str):
    return os.path.join(model_directory_path, 'resources/params.json')



def base_model():
    return Detector(Momentum)#, Diff, Split)

def infer(
        stream: typing.Iterator[dict],
        hyper_parameters: Any = None,
        with_hyper_parameters_load: bool = False,
        model_directory_path: str = ".",
):
    m = base_model()

    if hyper_parameters is not None:
        m.update_models_from_parametrization(hyper_parameters)

    if with_hyper_parameters_load:
        try:
            hyper_parameters = load_params(get_parameter_file_path(model_directory_path))
            m.update_models_from_parametrization(hyper_parameters)
        except FileNotFoundError:
            pass
    

    # Signals to the system that your attacker is initialized and ready.
    yield  # Leave this here.
    for message in stream:
        pred = m.tick(message["x"], horizon=HORIZON)
        yield pred
        

In [None]:
res = replay(x_train, only_stream_ids=0, stop_index=5000, horizon=HORIZON, epsilon=EPSILON, with_visualization=True, with_accounting_visualizer=True)
print("Initial", res.total_score)


In [None]:
import random
from algos.optimize import SimpleOptimizer, save_params
from midplot.replay import load_scenarios
import numpy as np
from tqdm import tqdm

def check_scenarios(mov, hyper_parameters, threshold=None):
    r = replay(mov.streams(), horizon=HORIZON, epsilon=EPSILON, hyper_parameters=hyper_parameters)
    return  r.check_scenarios(mov.scenarios, threshold=threshold)


def get_stream_ids():
    return random.sample(range(0, len(x_train)), 25)

MAX_ITERATIONS = 100

def optimize_multi(model_directory_path: str = ".", budget: int = 5):
    m = base_model()
    p = m.create_parametrization()
    param_file = get_parameter_file_path(model_directory_path)
    
    optimizer = SimpleOptimizer(p)
    
    movie = load_scenarios()
    print(f"Loaded {len(movie.scenarios)} scenarios")
    
    stream_ids = get_stream_ids()
    periods = 0
    iteration = 0
    num_success_scenarios = None
    best_score = - np.inf
    
    with tqdm(total=budget, desc="Optimizing") as pbar:
        while periods < budget:
            iteration += 1
            if iteration > MAX_ITERATIONS:
                break
            x = optimizer.ask()
            _, p = x
            if periods % 10 == 0:
                # Check scenarios first
                threshold = 0.25 if num_success_scenarios is None else num_success_scenarios / len(movie.scenarios)
                ok, scen = check_scenarios(movie, p, threshold=threshold)
                if not ok:
                    print(f"Failed at scenarios at {threshold}%: {scen}")
                    optimizer.tell(x, - np.inf)
                    if iteration > MAX_ITERATIONS:
                        raise Exception("Could not find a good solution")
                    continue
                print(f"Success at scenarios at {threshold}%: {scen}")
                num_success_scenarios = len(scen.success)
            iteration = 0
            periods += 1
            r = replay(x_train, horizon=HORIZON, epsilon=EPSILON, only_stream_ids=stream_ids, hyper_parameters=p)
            # Print the elapsed time
            loss = - r.total_score
            if -loss > best_score:
                best_score = -loss
            # Update the progress bar
            print(f"Period {periods} Score {-loss} Best {best_score}")
            pbar.set_postfix(best_score=best_score)
            pbar.update(1)
            
            optimizer.tell(x, loss)

            recommendation = optimizer.recommend()
            _, p = recommendation
            save_params(p, param_file)

    # return recommendation.value

print("Optimizing")
optimize_multi(".", 1)

In [None]:
predictions = crunch.test()

### The `train` function
The canonical way to write a training procedure uses `streams` argument and iterates over all data points in all training streams.

In [None]:

def train(
    streams: typing.List[typing.Iterable[dict]],
    model_directory_path: str
):
    optimize_multi(model_directory_path, budget=1)


In [None]:

# Here is how you would use it on the training data
train(
    streams=x_train[:1],
    model_directory_path="resources"
)

In [None]:
predictions = crunch.test()

In [None]:
!pip install pandoc