# 1. Importing Libraries & Loading the Data

In this section, we import all essential Python libraries commonly used for data analysis, visualization, and modeling. We also load the training and test datasets from the provided `.parquet` files.

These datasets contain minute-level trading data for the crypto market, and our objective is to **predict future market price movements**, represented by the `label` column in `train.parquet`.

We’ll begin by loading and inspecting the shape and structure of both datasets.


In [None]:
# Core data handling libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning tools
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
from lightgbm import early_stopping, log_evaluation
# Set default aesthetics for plots
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
train = pd.read_parquet('/kaggle/input/drw-crypto-market-prediction/train.parquet')
test = pd.read_parquet('/kaggle/input/drw-crypto-market-prediction/test.parquet')

# Inspect dimensions and structure
print(f"✅ Training set shape: {train.shape}")
print(f"✅ Test set shape: {test.shape}")
display(train.head())

# Output Analysis

- `train.shape` and `test.shape` show us how many rows and features each dataset has.
- Use `train.head()` to preview the first few rows — this includes:
  - Timestamps (minute-level index),
  - Market signals (`bid_qty`, `ask_qty`, etc.),
  - 780 anonymized features (`X_1` to `X_780`),
  - Target: `label` (price movement to predict).
- This gives us a foundational understanding of the structure before we explore deeper.

- Training set has 525,886 rows and 786 columns

- Test set has 538,150 rows and same 786 features

- label column is at the end

- Features like bid_qty, ask_qty, buy_qty, etc. are present

- Anonymized feature columns: X1 to X780

# 2. Dataset Summary & Missing Values Check

In this section, we examine:
- Summary statistics for key market and anonymized features
- Any missing values in the training dataset

This helps us decide if we need data cleaning or imputation steps before modeling.


In [None]:
# Select key market columns and sample anonymized columns
market_columns = ['bid_qty', 'ask_qty', 'buy_qty', 'sell_qty', 'volume']
anonymized_sample = [f'X{i}' for i in range(1, 6)]  # Sample only first 5 anonymized features

# Summary statistics
summary_stats = train[market_columns + anonymized_sample + ['label']].describe().T

# Check for missing values
missing_values = train.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)

# Display outputs
display(summary_stats)
display(missing_values if not missing_values.empty else "No missing values found.")

## Summary Statistics – Observations

- **Market Features**:
  - `bid_qty`, `ask_qty`, `buy_qty`, `sell_qty`, `volume` show significant variance, especially `volume` (std ~588 vs mean ~264), indicating volatility typical of crypto markets.
  - All quantities start from a minimum of 0 or near 0, which makes sense for sparse minutes.
  - Some extreme max values (e.g., `volume` > 28,000 and `ask_qty` > 1,300) may represent outliers or high-activity minutes.

- **Anonymized Features (X1–X5)**:
  - All means are close to zero with standard deviations close to 1, suggesting they are likely **standardized features**.
  - Values range widely — for example, `X3` ranges from ~-7.4 to ~8.5.
  - No obvious feature has zero variance, so all contribute potentially useful signals.

- **Target Variable `label`**:
  - Mean is close to 0 — consistent with predicting change in market price.
  - High skew and kurtosis: range is wide from -24.41 to +20.74.
  - Likely a **heavy-tailed distribution**, so we may need robust metrics like MAE or Huber loss.

- No missing values were found, so we can proceed without imputation.

---

Next, we’ll explore the **distribution of the label** and its correlation with selected features.


# 3. Target Variable Exploration

In this section, we:

- Visualize the distribution of the `label`, which is the value we’re predicting.
- Analyze how the `label` correlates with market activity features (`bid_qty`, `ask_qty`, `volume`, etc.) and a few anonymized features.

This helps us understand:
- Whether the target is skewed.
- Which features may have predictive power based on linear correlation.

In [None]:
# Plot the distribution of the target label
plt.figure(figsize=(10, 5))
sns.histplot(train['label'], bins=100, kde=True, color='orange')
plt.title("Distribution of Target Variable: label", fontsize=14)
plt.xlabel("Label Value")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

# Correlation with label (only a few select features to keep it readable)
selected_features = ['bid_qty', 'ask_qty', 'buy_qty', 'sell_qty', 'volume'] + [f'X{i}' for i in range(1, 6)]
correlation_with_label = train[selected_features + ['label']].corr()['label'].sort_values(ascending=False)

# Display top 10 most positively and negatively correlated
display(correlation_with_label.head(10))
display(correlation_with_label.tail(10))

## Target Variable Insights

### 🔹 Label Distribution:
- The `label` (target) is **tightly centered around zero**, forming a sharp peak.
- This distribution suggests:
  - Most price movements are small in magnitude (micro-changes per minute).
  - Only a few data points represent extreme market movements.
- It has **heavy tails**, indicating the presence of outliers — something to consider for model robustness.

### 🔹 Feature Correlations:
- All correlations with the `label` are **very weak**, with values close to 0 (both positive and negative).
- Highest absolute correlations:
  - `X2` (0.0157), `X3` (0.0122), and `sell_qty` (0.0112).
- This suggests:
  - **Linear correlations are low**, but that doesn't rule out **non-linear relationships**.
  - We'll rely on models that can detect **complex patterns** (e.g., tree-based methods).


Next, we’ll engineer a few interpretable features like bid-ask spread, order imbalance, and volume ratios to improve signal for modeling.


# 4. Feature Engineering – Creating Derived Market Features

Since the anonymized features (`X1–X780`) offer no direct interpretability, we enhance the dataset with interpretable features derived from:
- Market liquidity
- Supply-demand imbalance
- Execution volume ratios

These features can help models detect trading pressure, imbalance, or volatility spikes, all of which may relate to price movement (`label`).

In [None]:
# Copy the dataframe to preserve original
train_fe = train.copy()

# 1. Bid-Ask Spread
train_fe['bid_ask_spread'] = train_fe['ask_qty'] - train_fe['bid_qty']

# 2. Buy-Sell Volume Imbalance
train_fe['volume_imbalance'] = (train_fe['buy_qty'] - train_fe['sell_qty']) / (train_fe['buy_qty'] + train_fe['sell_qty'] + 1e-6)

# 3. Normalized volume
train_fe['log_volume'] = np.log1p(train_fe['volume'])

# 4. Ratio-based features
train_fe['buy_to_volume_ratio'] = train_fe['buy_qty'] / (train_fe['volume'] + 1e-6)
train_fe['sell_to_volume_ratio'] = train_fe['sell_qty'] / (train_fe['volume'] + 1e-6)

# Display new feature distributions
new_features = ['bid_ask_spread', 'volume_imbalance', 'log_volume', 'buy_to_volume_ratio', 'sell_to_volume_ratio']
display(train_fe[new_features].describe().T)

## Engineered Feature Summary

We've derived 5 new interpretable features from raw market signals:

### 🔹 `bid_ask_spread`
- Measures the difference between seller ask and buyer bid volumes.
- High standard deviation and wide range (from -1096 to +1143) — some extreme mismatches suggest unusual trading conditions or outliers.

### 🔹 `volume_imbalance`
- Reflects supply-demand pressure: values closer to ±1 indicate one-sided trading (either strong buying or strong selling).
- Median close to 0 → market is usually balanced.

### 🔹 `log_volume`
- Log-transform of volume to reduce skew and outlier impact.
- Appears well-behaved and scaled for modeling.

### 🔹 `buy_to_volume_ratio` & `sell_to_volume_ratio`
- Normalized measures of directional trading pressure.
- Median ~0.5 for both — consistent with a mostly balanced market.

These engineered features add intuitive signals beyond the anonymized `X` variables and will be part of our final model.

# 5. Baseline LightGBM Model – Time-Aware Validation

We’ll now build a baseline regression model using **LightGBM**.

Key considerations:
- We use a **time-based split** to avoid data leakage.
- Only a **subset of features** is used: the engineered ones and a few anonymized features.
- The target variable is `label`, which is continuous.

This baseline gives us a starting point for model performance and later hyperparameter tuning.


In [None]:
# Step 1: Feature selection (same)
feature_cols = [
    'bid_ask_spread', 'volume_imbalance', 'log_volume',
    'buy_to_volume_ratio', 'sell_to_volume_ratio',
    'X1', 'X2', 'X3', 'X4', 'X5'
]

# Step 2: Time-based split
train_cutoff = int(len(train_fe) * 0.8)
X_train, X_val = train_fe[feature_cols].iloc[:train_cutoff], train_fe[feature_cols].iloc[train_cutoff:]
y_train, y_val = train_fe['label'].iloc[:train_cutoff], train_fe['label'].iloc[train_cutoff:]

# Step 3: Train model with callbacks
model = lgb.LGBMRegressor(n_estimators=200, learning_rate=0.05, random_state=42)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='rmse',
    callbacks=[
        early_stopping(stopping_rounds=20),
        log_evaluation(period=50)  # Logs every 50 iterations
    ]
)

# Step 4: Evaluate
y_pred = model.predict(X_val)
rmse = mean_squared_error(y_val, y_pred, squared=False)
print(f"📉 Validation RMSE: {rmse:.5f}")

## Baseline LightGBM Results

- Model trained with early stopping after just **5 boosting rounds**.
- **Validation RMSE = 1.04047**, which represents the average magnitude of prediction error on unseen data.

### Observations:
- The model converged quickly, which may imply:
  - The feature signal is limited in this subset.
  - A small learning rate with early stopping halts before full learning.
- Only **10 features** were used — we’re keeping it intentionally simple to benchmark future improvements.

This result establishes a baseline. We’ll now apply this model to the test set and generate a submission file.


# 6. Generate Test Predictions & Submission File

In this step:
- We apply the trained LightGBM model to the test set using the same selected features.
- We then construct a valid `submission.csv` in the format required by the competition:
  - Column 1: `ID` (from test set)
  - Column 2: `prediction` (our model's output)

This file can now be submitted to Kaggle for evaluation on the private test set.


In [None]:
# Step 1: Load sample submission
sample_submission = pd.read_csv('/kaggle/input/drw-crypto-market-prediction/sample_submission.csv')

# Create engineered features in the test set
test_fe = test.copy()

test_fe['bid_ask_spread'] = test_fe['ask_qty'] - test_fe['bid_qty']
test_fe['volume_imbalance'] = (test_fe['buy_qty'] - test_fe['sell_qty']) / (test_fe['buy_qty'] + test_fe['sell_qty'] + 1e-6)
test_fe['log_volume'] = np.log1p(test_fe['volume'])
test_fe['buy_to_volume_ratio'] = test_fe['buy_qty'] / (test_fe['volume'] + 1e-6)
test_fe['sell_to_volume_ratio'] = test_fe['sell_qty'] / (test_fe['volume'] + 1e-6)

# Step 2: Apply same features to test set
X_test = test_fe[feature_cols]

# Step 3: Predict
test_preds = model.predict(X_test)

# Step 4: Fill submission dataframe
sample_submission['prediction'] = test_preds

# Step 5: Save to CSV
sample_submission.to_csv('submission.csv', index=False)
print("submission.csv file created successfully.")

In [None]:
sample_submission.head()

# Final Thoughts

We’ve successfully built a complete pipeline to predict crypto price movements using minute-level market data:

### 🔹 Summary of Steps:
1. **Data Loading** from `.parquet`
2. **EDA**: Distribution analysis & correlation checks
3. **Feature Engineering**: Bid-ask spread, volume imbalance, etc.
4. **Modeling**: Baseline LightGBM with time-based validation
5. **Test Prediction** & Submission File Creation



# Model Improvement Step 1: Feature Expansion

We now expand our feature set by including the first 100 anonymized features (`X1` to `X100`) in addition to the engineered market features.

This gives the model more signal and improves learning capacity.


In [None]:
# Define new feature set
engineered_features = [
    'bid_ask_spread', 'volume_imbalance', 'log_volume',
    'buy_to_volume_ratio', 'sell_to_volume_ratio'
]

anonymized_features = [f'X{i}' for i in range(1, 101)]

feature_cols = engineered_features + anonymized_features

# Split again
X_train, X_val = train_fe[feature_cols].iloc[:train_cutoff], train_fe[feature_cols].iloc[train_cutoff:]
y_train, y_val = train_fe['label'].iloc[:train_cutoff], train_fe['label'].iloc[train_cutoff:]

# Retrain model
model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.05, random_state=42)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='rmse',
    callbacks=[
        early_stopping(stopping_rounds=20),
        log_evaluation(period=50)
    ]
)

# Evaluate
y_pred = model.predict(X_val)
rmse = mean_squared_error(y_val, y_pred, squared=False)
print(f"📉 Expanded Feature Set - Validation RMSE: {rmse:.5f}")

## Result After Expanding Features to X1–X100

- 📉 Validation RMSE: **1.04086**
- 🔁 Early stopping occurred after just 1 iteration again.
- 🔬 Despite adding 100 anonymized features + engineered ones (105 total), there's **no performance gain** over the baseline.

### Possible Reasons:
- Many of the `X_` features may be uninformative or noisy.
- Model may be underfitting due to limited training depth (early stopping too early).
- Better results likely require:
  - **Feature selection or dimensionality reduction** (e.g. PCA)
  - **Hyperparameter tuning** to fully utilize signal


# Model Improvement Step 2: LightGBM Hyperparameter Tuning with Optuna

Now we tune the LightGBM model using **Optuna**, an automatic hyperparameter optimization library.

### Why Optuna?
- Efficient exploration of parameter space
- Learns from past trials to suggest better future configurations
- Finds combinations that manual tuning often misses

We’ll tune over 30 trials using time-based validation RMSE as the optimization target.


In [None]:
import optuna
from lightgbm import LGBMRegressor

def objective(trial):
    params = {
        'n_estimators': 1000,
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.1, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'random_state': 42,
        'n_jobs': -1
    }

    model = LGBMRegressor(**params)

    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        eval_metric='rmse',
        callbacks=[
            early_stopping(stopping_rounds=20),
            log_evaluation(period=0)
        ]
    )

    preds = model.predict(X_val)
    rmse = mean_squared_error(y_val, preds, squared=False)
    return rmse

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)

print(f"🔍 Best RMSE: {study.best_value:.5f}")
print("✅ Best Params:")
print(study.best_params)

## Hyperparameter Tuning Summary

-  **Best RMSE achieved**: 1.03933
-  Tuned using 30 trials of Optuna
-  **Best Parameters** found:
  - `learning_rate`: 0.0676
  - `num_leaves`: 97
  - `max_depth`: 4
  - `min_child_samples`: 98
  - `subsample`: 0.5186
  - `colsample_bytree`: 0.6457
  - `reg_alpha`: 0.1215
  - `reg_lambda`: 0.1979

These hyperparameters gave a modest improvement over the baseline. We’ll now use them to train the final model and make test predictions.


In [None]:
# Use best parameters from Optuna
best_params = {
    'n_estimators': 1000,
    'learning_rate': 0.06762745224761675,
    'num_leaves': 97,
    'max_depth': 4,
    'min_child_samples': 98,
    'subsample': 0.5185830564077009,
    'colsample_bytree': 0.6456671540143484,
    'reg_alpha': 0.12148076904003635,
    'reg_lambda': 0.19786770408218737,
    'random_state': 42,
    'n_jobs': -1
}

# Train on full dataset
X_full = train_fe[feature_cols]
y_full = train_fe['label']

final_model = lgb.LGBMRegressor(**best_params)
final_model.fit(X_full, y_full)

# Predict on test
X_test = test_fe[feature_cols]
final_preds = final_model.predict(X_test)

# Load submission template
sample_submission = pd.read_csv('/kaggle/input/drw-crypto-market-prediction/sample_submission.csv')
sample_submission['prediction'] = final_preds

# Save submission file
sample_submission.to_csv('/kaggle/working/submission.csv', index=False)
print("Final submission file saved to /kaggle/working/submission.csv")