**Project Title:** *Predicting Short-Term Cryptocurrency Returns Using Machine Learning (G-Research Crypto Forecasting)*
 

---

## **1. Domain Background**

Cryptocurrency markets are volatile, nonlinear, and affected by strong cross-asset correlations.  Machine learning—especially gradient-boosting models like LightGBM—has become a popular approach in quantitative finance for modeling noisy, high-frequency financial data.

This project uses the dataset accroding from the **G-Research Crypto Forecasting Kaggle competition** https://www.kaggle.com/competitions/g-research-crypto-forecasting , which provides minute-level OHLCV data and a future return “Target” for multiple crypto assets. The objective is to predict short-term price movements, a task that can meaningfully contribute to algorithmic trading strategies even when predictive correlations are small.

---




## **2. Problem Statement**

The task is:

> **To predict short-term cryptocurrency returns (the provided “Target”) using historical price data and engineered features.**

Challenges include:

* Highly noisy target values
* Missing timestamps and asset-specific trading periods
* The need to avoid data leakage
* Non-stationary market behavior

A successful predictive model must capture meaningful patterns while remaining robust to market noise.

---

## **3. Datasets and Inputs**

The dataset is create according   **G-Research Crypto Forecasting** Kaggle competition, but use the newest data. In this project, I use a **preprocessed version** (`df`) that contains:

| Column      | Description                                       |
| ----------- | ------------------------------------------------- |
| `open_time` | Unix time (minute-level)                          |
| `symbol`    | Identifier for each asset                         |
| `close`     | Close price                                       | 
| `target`    | 15-minute future return (provided by competition) |


The `target` is calculate according the method mentioned in the competition.
 


In [20]:
import pandas as pd
import numpy as np
df=pd.read_csv('mergee_df.csv')
df.head()

Unnamed: 0,open_time,close,symbol,target
0,1761955200,1.076,0GUSDT,-0.006433
1,1761955200,0.06807,AWEUSDT,-0.001617
2,1761955200,1.001,SNXUSDT,-0.003002
3,1761955200,0.001069,SLPUSDT,-0.001873
4,1761955200,0.1653,AXLUSDT,-0.004244


### Data Preparation

1. Align timestamps across assets
2. Forward-fill missing Close data (limit = 60 minutes)
3. Sort by timestamp
4. Generate lag-based and market-relative features
5. Remove rows with missing values



In [21]:
import numpy as np
import pandas as pd
import lightgbm as lgb
import mlflow
import os

 

df.rename(columns={'open_time':'timestamp','symbol':'Asset_ID','close':'Close','target':'Target'},inplace=True)

 
#mlflow.set_tracking_uri('http://k8s-mlflow-mlflowin-4da4410a5a-161181029.ap-northeast-1.elb.amazonaws.com')
#mlflow.set_experiment("crypto_lgbm_v1_mydata")
# ============================
# Load Data
# ============================
use_cols = ['timestamp', 'Asset_ID', 'Close', 'Target']
dtype_map = {'timestamp': 'int32', 'Asset_ID': 'string', 'Close': 'float32', 'Target': 'float32'}

data = df[use_cols].astype(dtype_map)

print("Raw data shape:", data.shape)

data.dropna(subset=['Close', 'Target'], inplace=True)

# ============================
# Align timestamps across assets
# ============================
asset_start_times = data.groupby('Asset_ID')['timestamp'].min()
#global_start = asset_start_times.max()

#if global_start > data['timestamp'].min():
#    data = data[data['timestamp'] >= global_start].copy()

print("Aligned data shape:", data.shape)
data.sort_values(['Asset_ID', 'timestamp'], inplace=True)

# Forward fill limit = 60 minutes
ffill_limit = 60
processed_dfs = []

for asset_id, df_asset in data.groupby('Asset_ID'):
    df_asset = df_asset.sort_values('timestamp').set_index('timestamp')

    full_index = np.arange(df_asset.index.min(),
                           df_asset.index.max() + 60,
                           60, dtype=np.int32)

    df_asset = df_asset.reindex(full_index)

    df_asset['Close'] = df_asset['Close'].ffill(limit=ffill_limit)
    #df_asset['Target'] = df_asset['Target'].ffill(limit=ffill_limit)

    df_asset.dropna(subset=['Close'], inplace=True)
    df_asset['Asset_ID'] = asset_id
    processed_dfs.append(df_asset.reset_index(names='timestamp'))

data_proc = pd.concat(processed_dfs, ignore_index=True)
data_proc.sort_values('timestamp', inplace=True)
print("Processed data shape:", data_proc.shape)

Raw data shape: (16329600, 4)
Aligned data shape: (16323930, 4)
Processed data shape: (16323930, 4)


### Features Engineered

* Lag returns: `return_1m`, `return_5m`, `return_15m`, `return_30m`, `return_60m`
* Rolling trend indicators: `trend_15m`, `trend_60m`
* Cross-sectional deviations: `diff_ret_1m`, … `diff_ret_60m`
* Market price deviation: `diff_price` (optional)

Final dataset size after processing:
**~1.2–1.5 million rows** (depending on filtering).

In [22]:
# ============================
# Feature Engineering
# ============================
lag_list = [1, 5, 15, 30, 60]
ma_window_list = [15, 60]

data_proc['log_close'] = np.log(data_proc['Close'])


data_proc['ret_15m'] = data_proc.groupby('Asset_ID')['Close'].transform(
        lambda x: ( x.shift(-15)-x)/x
    )

# Log returns
for L in lag_list:
    data_proc[f'return_{L}m'] = data_proc.groupby('Asset_ID')['log_close'].transform(
        lambda x: x - x.shift(L)
    )

# Rolling averages + trend
for w in ma_window_list:
    data_proc[f'avg_{w}m'] = data_proc.groupby('Asset_ID')['Close'].transform(
        lambda x: x.shift(1).rolling(window=w).mean()
    )
    data_proc[f'trend_{w}m'] = np.log(data_proc['Close'] / data_proc[f'avg_{w}m'])

# Market features
for L in lag_list:
    data_proc[f'mkt_ret_{L}m'] = data_proc.groupby('timestamp')[f'return_{L}m'].transform('mean')
    data_proc[f'diff_ret_{L}m'] = data_proc[f'return_{L}m'] - data_proc[f'mkt_ret_{L}m']

data_proc['mkt_close'] = data_proc.groupby('timestamp')['Close'].transform('mean')
data_proc['diff_price'] = data_proc['Close'] - data_proc['mkt_close']

# Drop intermediate columns
drop_cols = ['log_close'] + \
            [f'avg_{w}m' for w in ma_window_list] + \
            [f'mkt_ret_{L}m' for L in lag_list] + \
            ['mkt_close']

data_proc.drop(columns=drop_cols, inplace=True)
#['diff_price'] +
# [f'diff_ret_{L}m' for L in lag_list] +
# [f'trend_{w}m' for w in ma_window_list]
feature_cols = (
    [f'trend_{w}m' for w in ma_window_list] +
    [f'diff_ret_{L}m' for L in lag_list] +
    [f'return_{L}m' for L in lag_list]  
   
   
)

data_proc.dropna(subset=feature_cols, inplace=True)
print("Final training data shape:", data_proc.shape)

Final training data shape: (16301250, 18)


## **4. Proposed Solution**

The proposed solution uses **LightGBM regression** to model the short-term price return (`Target`).
Reasons for choosing LightGBM:

* Handles nonlinear tabular patterns well
* Efficient for large datasets
* Supports early stopping and fast training
* Popular in financial competitions

### Model Pipeline

1. **Preprocess data** (timestamp alignment, forward-fill, feature generation)
2. **Use forward-chaining time-series cross-validation** (7 folds)
3. **Train LightGBM models** with parameters:

```
learning_rate = 0.05  
num_leaves = 256  
n_estimators = 10000  
early_stopping_rounds = 50  
```

4. **Evaluate models using correlation metric**
5. **Select the best fold model**
6. **Log everything to MLflow** (parameters, metrics, artifacts)

The entire process simulates real-world forecasting and avoids data leakage.

---



 

The primary metric is:

## **Pearson Correlation Coefficient**

[
corr = \frac{cov(y, \hat{y})}{\sigma_y \sigma_{\hat{y}}}
]

Reasons for using correlation:

* Return magnitude is less important than directional accuracy
* RMSE is not meaningful in noisy financial targets
* Correlation is the official competition metric
* Robust to scaling differences

Secondary metrics:

* Fold correlation values
* Average correlation across all CV folds

---


In [23]:
# ============================
# Prepare CV Splitting by Time
# ============================
 

X = data_proc[feature_cols]
y = data_proc['Target'].values
timestamps = data_proc['timestamp']
unique_times = np.sort(data_proc['timestamp'].unique())



n_folds = 7
n_unique = len(unique_times)
segment_len = n_unique // (n_folds + 1)

# Time boundaries
boundaries = [unique_times[0]]
for i in range(1, n_folds + 1):
    idx = min(i * segment_len, n_unique - 1)
    boundaries.append(unique_times[idx])
boundaries.append(unique_times[-1] + 1)

# ============================
# Training with MLflow Tracking
# ============================
#mlflow.end_run()
#with mlflow.start_run():
if 1:

    #mlflow.log_param("model", "lightgbm")
    #mlflow.log_param("learning_rate", 0.05)
    #mlflow.log_param("num_leaves", 256)
    #mlflow.log_param("lags", lag_list)
    #mlflow.log_param("rolling_windows", ma_window_list)
    #mlflow.log_param("n_folds", n_folds)
    #mlflow.log_param("features", feature_cols)
    fold_scores = []
    fold_models = []

    for fold in range(n_folds):
        train_start = boundaries[fold ]
        val_start = boundaries[fold + 1]
        val_end = boundaries[fold + 2]
        print('validation time from', pd.to_datetime(val_start,unit='s'), 'to', pd.to_datetime(val_end,unit='s'))
        val_mask = (timestamps >= val_start) & (timestamps < val_end)
        train_mask =  ( timestamps < val_start)# & (timestamps >= train_start)  

        X_train, y_train = X[train_mask], y[train_mask]
        X_val, y_val = X[val_mask], y[val_mask]

        print(f"Fold {fold+1}: train={len(y_train)}, val={len(y_val)}")

        model = lgb.LGBMRegressor(
            objective='regression',
            learning_rate=0.05,
            num_leaves=256,
            n_estimators=10000,
            verbose=-1
        )

        model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            eval_metric='l2',
            callbacks=[lgb.early_stopping(50, verbose=False)]
        )

        y_pred = model.predict(X_val)
        corr = np.corrcoef(y_val, y_pred)[0, 1]
        fold_scores.append(corr)
        
        fold_models.append(model)  # ADD THIS LINE

        print(f"Fold {fold+1} corr = {corr:.5f}")
        #mlflow.log_metric(f"fold_{fold+1}_corr", float(corr))

        # Log the model of this fold
        #mlflow.lightgbm.log_model(model, artifact_path=f"model_fold_{fold+1}")

    best_idx = int(np.argmax(fold_scores))
    best_model = fold_models[best_idx]

    #   "model"，  MLflow registry  
    #mlflow.lightgbm.log_model(best_model, artifact_path="model")



    avg_corr = float(sum(fold_scores) / len(fold_scores))
    print("Average corr:", avg_corr)
    #mlflow.log_metric("avg_corr", avg_corr)



validation time from 2025-11-04 18:50:00 to 2025-11-08 12:40:00
Fold 1: train=2037420, val=2037420
Fold 1 corr = 0.02918
validation time from 2025-11-08 12:40:00 to 2025-11-12 06:30:00
Fold 2: train=4074840, val=2037420
Fold 2 corr = 0.04608
validation time from 2025-11-12 06:30:00 to 2025-11-16 00:20:00
Fold 3: train=6112260, val=2037420
Fold 3 corr = 0.06579
validation time from 2025-11-16 00:20:00 to 2025-11-19 18:10:00
Fold 4: train=8149680, val=2037420
Fold 4 corr = 0.05149
validation time from 2025-11-19 18:10:00 to 2025-11-23 12:00:00
Fold 5: train=10187100, val=2037420
Fold 5 corr = 0.03359
validation time from 2025-11-23 12:00:00 to 2025-11-27 05:50:00
Fold 6: train=12224520, val=2037420
Fold 6 corr = 0.07105
validation time from 2025-11-27 05:50:00 to 2025-11-30 23:44:01
Fold 7: train=14261940, val=2039310
Fold 7 corr = 0.09794
Average corr: 0.05644527123118832


## **5. Benchmark Model**

To determine whether the ML model adds value, we compare it to two simple baselines.

### **Benchmark 1 — Zero Prediction**

Predict future return = 0.

Expected correlation: **≈ 0**

### **Benchmark 2 — Copy Last Return**

Predict:

```
ŷ_t = return_1m(t)
```

Expected correlation: **≈ 0.01 – 0.015**

Any model achieving **> 0.02 correlation** is considered meaningful in this domain.

---


 

In [None]:
 
### **Benchmark 2 — Copy Last Return**
fold_models = []

for fold in range(n_folds):
    train_start = boundaries[fold ]
    val_start = boundaries[fold + 1]
    val_end = boundaries[fold + 2]
    print('validation time from', pd.to_datetime(val_start,unit='s'), 'to', pd.to_datetime(val_end,unit='s'))
    val_mask = (timestamps >= val_start) & (timestamps < val_end)
    train_mask =  ( timestamps < val_start)# & (timestamps >= train_start)  

    X_train, y_train = X[train_mask], y[train_mask]
    X_val, y_val = X[val_mask], y[val_mask]

    print(f"Fold {fold+1}: train={len(y_train)}, val={len(y_val)}")

        

    y_pred =  X_val['return_1m']
    corr = np.corrcoef(y_val, y_pred)[0, 1]
    fold_scores.append(corr)
    
    

    print(f"Fold {fold+1} corr = {corr:.5f}")
   





avg_corr = float(sum(fold_scores) / len(fold_scores))
print("Average corr:", avg_corr)
 

validation time from 2025-11-04 18:50:00 to 2025-11-08 12:40:00
Fold 1: train=2037420, val=2037420
Fold 1 corr = -0.02023
validation time from 2025-11-08 12:40:00 to 2025-11-12 06:30:00
Fold 2: train=4074840, val=2037420
Fold 2 corr = -0.03279
validation time from 2025-11-12 06:30:00 to 2025-11-16 00:20:00
Fold 3: train=6112260, val=2037420
Fold 3 corr = -0.03735
validation time from 2025-11-16 00:20:00 to 2025-11-19 18:10:00
Fold 4: train=8149680, val=2037420
Fold 4 corr = -0.04934
validation time from 2025-11-19 18:10:00 to 2025-11-23 12:00:00
Fold 5: train=10187100, val=2037420
Fold 5 corr = -0.03055
validation time from 2025-11-23 12:00:00 to 2025-11-27 05:50:00
Fold 6: train=12224520, val=2037420
Fold 6 corr = -0.04817
validation time from 2025-11-27 05:50:00 to 2025-11-30 23:44:01
Fold 7: train=14261940, val=2039310
Fold 7 corr = -0.04992
Average corr: 0.009055709222220801


## **Conclusion**

This project applies machine learning techniques to a real-world financial forecasting problem. By combining:

* Feature engineering
* Time-series validation
* Gradient-boosted models
* MLflow experiment tracking

the project aims to build a reproducible, production-grade forecasting pipeline. Even a small improvement in predictive correlation can be valuable in algorithmic trading.