[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/random-submission/random-submission.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [6]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break 5nGdJcFAORwQMvuZcvvCDPcN

you appear to have never submitted code before
data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
                                
---
Success! Your environment has been correctly setup.
Next recommended actions:
1. Load the Crunch Toolings: `crunch = crunch.load_notebook()`
2. Execute the cells with your code
3. Run a test: `crunch.test()`
4. Download and submit your code to the platform!


# Your model

## Setup

In [7]:
import os
import random
import typing

# Import your dependencies
import joblib
import pandas as pd
import sklearn.metrics
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

In [8]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 6.4.4
available ram: 12.67 gb
available cpu: 2 core
----


## Data

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [9]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


### `X_train`

Index:
- `id`: the ID of the dataset
- `time`: arbitrary amount of time sampled regularely

Columns:
- `value`: the timeseries data
- `period`: if you are in an **initial segment** (0) or an **extension segment** (1)

In [10]:
X_train

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
10000,2134,0.001137,1
10000,2135,0.003526,1
10000,2136,0.000687,1
10000,2137,0.001640,1


### `y_train`

This is a simple `pandas.Series` that tells if a dataset id has a structural breakpoint or not.

Index:
- `id`: the ID of the dataset

Value:
- `structural_breakpoint`: the value you need to predict

In [6]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


### `X_test`

This is a **`list` of `pandas.DataFrame`** that have the same format as [`X_train`](#X_train).

It is provided as a list to make sure you are encouraged to read the records **one by one**, __as this will be mandatory in the [`infer()`](#infer) function__.

In [7]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [8]:
X_test[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0
10001,...,...,...
10001,2774,-0.013937,1
10001,2775,-0.015649,1
10001,2776,-0.009744,1
10001,2777,0.025375,1


## Implementation

### `train()`

In the training function, users build and train the model to make inferences on the test data. <br />
Your model must be stored in the `model_directory_path`.

In [1]:
def fft_decompose(signal, t_ref, sample_rate=1.0):
    fft_result = np.fft.rfft(signal, n=512)
    A = np.abs(fft_result)
    phi = np.angle(fft_result)
    F = np.fft.rfftfreq(512, d=1.0 / sample_rate)
    return A * np.cos(2 * np.pi * F * t_ref + phi)

def extract_features(df):
    group = df.reset_index()  # works for both MultiIndex or regular index

    pre = group[group['period'] == 0]
    post = group[group['period'] == 1]

    if len(pre) == 0 or len(post) == 0:
        return None

    signal_pre = pre['value'].values
    signal_post = post['value'].values

    t_pre_break = pre['time'].iloc[-1]
    t_post_start = post['time'].iloc[0]

    vec_pre = fft_decompose(signal_pre, t_ref=t_pre_break)
    vec_post = fft_decompose(signal_post, t_ref=t_post_start)

    return np.concatenate([vec_pre, vec_post])

def create_input(X_list):
    features = []
    for id in X_list:
        df = X_list[id]
        df
        break
        feat = extract_features(df)
        if feat is not None:
            features.append(feat)
    return pd.DataFrame(features)  # shape: (n_series, 2 * (512//2 + 1))


In [4]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    X = create_input(X_train)
    y = y_train.astype(int).values
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    clf = xgb.XGBClassifier(
        n_estimators=1000,
        max_depth=6,
        learning_rate=0.03,
        subsample=0.8,
        colsample_bytree=0.8,
        gamma=1.0,
        reg_alpha=0.1,
        reg_lambda=1.0,
        tree_method='hist',  # or 'gpu_hist' if you have GPU
        n_jobs=-1,
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss',
        device='cuda',
    )
    clf.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=50,
        verbose=10
    )

    model = clf

    joblib.dump(model, os.path.join(model_directory_path, 'model.joblib'))

NameError: name 'pd' is not defined

### `infer()`

In the inference function, the trained model is loaded and used to make inferences on a sample of data that matches the characteristics of the training test.

#### Setup

Once your model is loaded, you must do a `yield` to signal it to the runner. <br />
After that you can start reading data from `X_test`.

#### Iteration

The datasets must be read **one by one** and each value must be returned with a `yield <value>`. <br />
If you try to skip this, you will get an error. <br />
All values are then concatenated into a prediction file.

**Warning: The datasets can only be iterated once!**

#### Cleanup

Code can be executed after the `for` loop if you need to persist state or do some cleanup.

In [3]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

    yield  # mark as ready

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for dataset in X_test:
        # prediction = model.predict(dataset)
        features = extract_features(dataset)
        prediction = round(random.random(), 2)

        yield float(prediction)  # send the prediction for the current dataset

NameError: name 'typing' is not defined

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [18]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

08:07:45 no forbidden library found
08:07:45 
08:07:45 started
08:07:45 running local test
08:07:45 internet access isn't restricted, no check will be done
08:07:45 
08:07:46 starting unstructured loop...
08:07:46 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


08:07:49 duration - time=00:00:03
08:07:49 memory - before="1.31 GB" after="2.39 GB" consumed="1.08 GB"


KeyError: 'period'

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"].astype(float)

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)