## Multi-Bar Classification

Description:
Here, we classify whether price will move up, down, or remain neutral over a fixed horizon (e.g., 5 bars). We label each bar with −1,0,+1 based on future returns exceeding positive/negative thresholds. A classifier (e.g., RandomForestClassifier) then learns to predict these discrete classes, which we translate into trading signals.

#### 📌 Important Note:
This notebook contains *interactive charts generated using Vectorbt.  
GitHub does not display interactive Plotly charts, so the graphs will not be visible here.  

✅ To view the charts, please download this notebook and run it on your local machine.  
Make sure you have Vectorbt and its dependencies installed to regenerate the visualizations.


## Part 1: Data & Feature Engineering

**Objective:**  
Load raw price data (MetaTrader 5 or CSV) and transform it into a feature-rich dataset.

**Tasks:**
- Fetch historical bars  
- Apply `ta.add_all_ta_features` or custom features  
- (Optionally) create specific labels (multi-bar, double-barrier, regime, etc.)  
- Clean/prepare the final feature matrix **X** and target **y**  

In [1]:
import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Adjust if your notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import MetaTrader5 as mt5

# If using vectorbt
import vectorbt as vbt

# Our modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
from models.model_training import (
    select_features_rf_reg,
    walk_forward_splits
)
from backtests.simple_backtest import simulate_trading, calculate_sharpe_ratio

# Sklearn / Models
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
# Suppose you have a multi-bar labeling function
from features.labeling_schemes import create_labels_multi_bar  # or define inline
from sklearn.naive_bayes import GaussianNB, BernoulliNB

###########################################################
# 1) Data Loading & Basic Feature Engineering
###########################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    # Fetch 2000 bars from an earlier period for training
    data = get_data_mt5(symbol="US30", timeframe=mt5.TIMEFRAME_H4, n_bars=5000, start_pos=5000)
    mt5.shutdown()

df = add_all_ta_features(data)

###########################################################
# 2) Multi-Bar Labeling Function
###########################################################
def create_labels_multi_bar(df, horizon=5, threshold=0.005):
    """
    Creates classification labels for multi-bar horizon:
      +1 if future return >= threshold
      -1 if future return <= -threshold
       0 otherwise
    df must have a 'close' column.
    Returns a new DataFrame with:
      'future_return_h' and 'multi_bar_label'
    """
    df_copy = df.copy()
    
    # 1) Horizon-based future returns
    df_copy["future_return_h"] = df_copy["close"].pct_change(periods=horizon).shift(-horizon)
    
    # 2) Classification labels
    df_copy["multi_bar_label"] = 0
    df_copy.loc[df_copy["future_return_h"] >= threshold, "multi_bar_label"] = 1
    df_copy.loc[df_copy["future_return_h"] <= -threshold, "multi_bar_label"] = -1
    
    # 3) Drop rows where future_return_h is NaN
    df_copy.dropna(subset=["future_return_h"], inplace=True)
    
    return df_copy

df_lbl = create_labels_multi_bar(df, horizon=5, threshold=0.005)

# Prepare X, y
X = df_lbl.drop(columns=["multi_bar_label", "future_return_h"])
y = df_lbl["multi_bar_label"]

###########################################################
# 3) Walk-Forward Splits
###########################################################
folds = walk_forward_splits(X, y, n_splits=3)
print(f"Number of folds created: {len(folds)}")

###########################################################
# 4) Define Classification Models
###########################################################
models = {
    "RandomForestClassifier": RandomForestClassifier(n_estimators=100, random_state=42),
    "GradientBoostingClassifier": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42),
    "SVC": SVC(C=1.0, kernel='rbf', probability=True),
    "XGBClassifier": XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
    "LGBMClassifier": LGBMClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
    "GaussianNB": GaussianNB(),  # <-- Added Bayesian Classification Model
    "BernoulliNB": BernoulliNB() # <-- Another Bayesian Model (good for binary data)
}

###########################################################
# 5) Loop Over Folds + Simple Backtest
###########################################################
fold_results = {}

for fold_i, (X_train_fold, y_train_fold, X_test_fold, y_test_fold) in enumerate(folds, start=1):
    print(f"\n===== Fold {fold_i} =====")
    
    # We must shift labels from [-1, 0, 1] to [0, 1, 2] for XGBoost & co.
    # SHIFT: -1 -> 0, 0 -> 1, +1 -> 2
    y_train_fold_shifted = y_train_fold + 1
    y_test_fold_shifted = y_test_fold + 1

    # Feature selection with a classifier
    rf_for_fs = RandomForestClassifier(n_estimators=100, random_state=42)
    # Use the SHIFTED y_train for feature selection
    X_train_sel, selected_idx = select_features_rf_reg(
        X_train_fold, y_train_fold_shifted, estimator=rf_for_fs, max_features=20
    )
    feats = X_train_fold.columns[selected_idx]
    print(f"Selected features for Fold {fold_i}: {feats.tolist()}")

    X_test_sel = X_test_fold[feats]

    # Scale
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_sel)
    X_test_scaled = scaler.transform(X_test_sel)

    fold_results[fold_i] = {}

    for model_name, model in models.items():
        # 1) Fit on SHIFTED y
        model.fit(X_train_scaled, y_train_fold_shifted)
        
        # 2) Predict SHIFTED labels
        preds_shifted = model.predict(X_test_scaled)
        
        # 3) Shift back: 0->-1, 1->0, 2->+1
        preds = preds_shifted - 1

        # Evaluate Accuracy on the unshifted test labels
        acc = accuracy_score(y_test_fold, preds)

        # Convert classification => signals
        signals = preds  # signals in {-1, 0, +1}

        # Align with the test portion
        df_test_fold = df_lbl.loc[X_test_fold.index].copy()

        # Simple backtest with cost
        daily_returns, total_return = simulate_trading(signals, df_test_fold, cost=0.0002)
        sr = calculate_sharpe_ratio(np.array(daily_returns))

        fold_results[fold_i][model_name] = {
            "Accuracy": acc,
            "TotalReturn": total_return,
            "Sharpe": sr
        }

###########################################################
# 6) Print Results
###########################################################
for fold_i, model_dict in fold_results.items():
    print(f"\n=== Fold {fold_i} Results ===")
    for model_name, stats in model_dict.items():
        acc = stats["Accuracy"]
        ret = stats["TotalReturn"]
        sr = stats["Sharpe"]
        print(f"{model_name}: ACC={acc:.2f}, Return={ret:.2f}%, Sharpe={sr:.2f}")


Number of folds created: 3

===== Fold 1 =====
Selected features for Fold 1: ['trend_adx_neg', 'trend_adx', 'volatility_atr', 'volume_adi', 'momentum_pvo_signal', 'volume_nvi', 'volatility_bbw', 'trend_ema_slow', 'volume_vpt', 'volatility_bbm', 'trend_mass_index', 'momentum_ppo_signal', 'volatility_dcw', 'trend_sma_slow', 'volume_cmf', 'trend_dpo', 'trend_kst_sig', 'trend_cci', 'volatility_ui', 'close']


  File "c:\Users\moham\miniconda3\envs\ml\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Users\moham\miniconda3\envs\ml\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\moham\miniconda3\envs\ml\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Users\moham\miniconda3\envs\ml\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000578 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 1248, number of used features: 20
[LightGBM] [Info] Start training from score -1.523495
[LightGBM] [Info] Start training from score -0.655407
[LightGBM] [Info] Start training from score -1.336284

===== Fold 2 =====
Selected features for Fold 2: ['volatility_atr', 'volume_vpt', 'volatility_kcw', 'volatility_ui', 'volatility_dcw', 'volume_nvi', 'volatility_bbw', 'volume_adi', 'momentum_pvo_signal', 'volatility_bbl', 'momentum_ppo_signal', 'trend_mass_index', 'momentum_uo', 'trend_adx', 'trend_adx_neg', 'volume_cmf', 'trend_stc', 'momentum_ppo', 'trend_adx_pos', 'trend_kst_diff']
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000348 se

## Part 2: Model Training & Hyperparameter Tuning

**Objective:**  
Train an ML model (e.g., RandomForest, XGBoost) on the engineered features to predict the chosen labels.

**Tasks:**
- Perform time-based or walk-forward splits  
- Select top features if desired (e.g., using RandomForest feature importance)  
- Use `RandomizedSearchCV` or `GridSearchCV` to find optimal hyperparameters  
- Save the best model pipeline (e.g., `best_rf_pipeline.pkl`) 

Fine Tuning with RandomizedSearchCV

In [1]:
# Code 2: Hyperparameter Tuning for Chosen Classification Model

import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Adjust if your notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import MetaTrader5 as mt5
import joblib

# Sklearn / Models
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.pipeline import Pipeline

# Your modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
# Suppose you have a multi-bar labeling function
from features.labeling_schemes import create_labels_multi_bar  # or define inline

###########################################################
# 1) DATA LOADING & FEATURE ENGINEERING
###########################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    # Fetch 2000 bars from an earlier period for training
    data = get_data_mt5(symbol="US30", timeframe=mt5.TIMEFRAME_H4, n_bars=5000, start_pos=5000)
    mt5.shutdown()

df = add_all_ta_features(data)

# Create classification labels
# e.g., horizon=5, threshold=0.005 => ±0.5% over 5 bars
df_lbl = create_labels_multi_bar(df, horizon=5, threshold=0.005)

# Now we have columns: 'future_return_h' and 'multi_bar_label' in df_lbl
X_full = df_lbl.drop(columns=["multi_bar_label", "future_return_h"])
y_full = df_lbl["multi_bar_label"]

# SHIFT LABELS from [-1,0,+1] => [0,1,2]
# so the classifier won't complain about negative labels
y_full_shifted = y_full + 1  # -1->0, 0->1, +1->2

print("Unique classes in y_full:", y_full.unique())
print("Unique classes in y_full_shifted:", y_full_shifted.unique())

# Ensure chronological order if needed
# X_full = X_full.sort_index()
# y_full_shifted = y_full_shifted.loc[X_full.index]

###########################################################
# 2) DEFINE YOUR TRAIN PORTION
###########################################################
# e.g., first 80% for tuning
split_idx = int(len(X_full)*0.8)
X_tune = X_full.iloc[:split_idx].copy()
y_tune_shifted = y_full_shifted.iloc[:split_idx].copy()

print(f"Tuning portion size: {len(X_tune)}")

###########################################################
# 3) TIME-BASED CV (TimeSeriesSplit)
###########################################################
tscv = TimeSeriesSplit(n_splits=3)

# We'll define a scoring for classification
# e.g. "accuracy"
scorer = make_scorer(accuracy_score)

###########################################################
# 4) BUILD A PIPELINE
###########################################################
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42))
])

###########################################################
# 5) DEFINE PARAM DISTRIBUTIONS FOR RandomForestClassifier
###########################################################
param_distributions = {
    "clf__n_estimators": [100, 200, 300],
    "clf__max_depth": [None, 5, 10, 15],
    "clf__min_samples_split": [2, 5, 10],
    "clf__max_features": ["auto", "sqrt", 0.5]
}

###########################################################
# 6) SET UP RandomizedSearchCV
###########################################################
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_distributions,
    n_iter=10,               # how many random combos
    scoring=scorer,          # 'accuracy' metric
    cv=tscv,                 # time-based folds
    random_state=42,
    n_jobs=-1,
    verbose=2
)

###########################################################
# 7) FIT ON TUNING PORTION
###########################################################
random_search.fit(X_tune, y_tune_shifted)

print("Best params:", random_search.best_params_)
print("Best score (accuracy):", random_search.best_score_)

best_estimator = random_search.best_estimator_

###########################################################
# 8) SAVE THE BEST ESTIMATOR
###########################################################
joblib.dump(best_estimator, "models/saved_models/best_rf_mb_pipeline.pkl")
print("Saved best estimator to 'best_rf_mb_pipeline.pkl'")


Unique classes in y_full: [ 0  1 -1]
Unique classes in y_full_shifted: [1 2 0]
Tuning portion size: 3996
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best params: {'clf__n_estimators': 100, 'clf__min_samples_split': 2, 'clf__max_features': 0.5, 'clf__max_depth': 5}
Best score (accuracy): 0.4627961294627962
Saved best estimator to 'best_rf_mb_pipeline.pkl'


## Part 3: Backtesting & Performance Evaluation

**Objective:**  
Evaluate how well the trained model performs on unseen data, simulating real trades.

**Tasks:**
- Use walk-forward or expanding splits to mimic “live” conditions  
- Convert model predictions to signals ([-1, 0, +1] or buy/sell/hold)  
- Run a simple backtest script or VectorBT for performance metrics  
- Calculate returns, Sharpe ratio, drawdowns, confusion matrix, etc.  
- Visualize results (equity curve, trades, etc.) to judge strategy viability  

In [None]:
import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import MetaTrader5 as mt5
import vectorbt as vbt
import joblib

# Our modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
from features.labeling_schemes import create_labels_multi_bar  # multi-bar labeling

# Sklearn
from sklearn.metrics import accuracy_score

###########################################################
# 1) DATA LOADING & FEATURE ENGINEERING
###########################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    # Fetch 5000 most recent bars for backtesting
    data = get_data_mt5(symbol="US30", timeframe=mt5.TIMEFRAME_H4, n_bars=5000, start_pos=0)
    mt5.shutdown()

# Add technical features
df = add_all_ta_features(data)

# Create multi-bar classification labels (e.g., horizon=5, threshold=0.005)
df_lbl = create_labels_multi_bar(df, horizon=5, threshold=0.005)

# Separate features and labels
X = df_lbl.drop(columns=["multi_bar_label", "future_return_h"])
y = df_lbl["multi_bar_label"]

# Shift labels from [-1,0,+1] → [0,1,2] for classifier
y_shifted = y + 1

###########################################################
# 2) LOAD PRE-TRAINED CLASSIFICATION MODEL (NO RETRAINING)
###########################################################
best_pipeline = joblib.load("models/saved_models/best_rf_mb_pipeline.pkl")
print("Loaded best classification model from 'best_rf_mb_pipeline.pkl'")

# 1) Identify columns used during training
trained_columns = best_pipeline["scaler"].feature_names_in_  # adjust if your pipeline step name differs

# 2) Subset X to match these columns
X_test = X[trained_columns]

# 3) Predict on new data
preds_shifted = best_pipeline.predict(X_test)

# Convert predictions back to [-1, 0, +1]
preds = preds_shifted - 1

# Compute accuracy against true labels (also shifted back)
y_true_unshifted = y_shifted - 1
accuracy = accuracy_score(y_true_unshifted, preds)
print(f"\nOut-of-Sample Accuracy: {accuracy:.4f}")

###########################################################
# 3) CONVERT PREDICTIONS TO SIGNALS & BACKTEST
###########################################################
# +1 => buy, -1 => sell, 0 => no position
signals = preds

print("\nRunning Full Backtest on the Last 2000 Bars...")

close_prices = df_lbl["close"]
if len(signals) < len(close_prices):
    signals = np.append(signals, [0] * (len(close_prices) - len(signals)))

signals_s = pd.Series(signals, index=close_prices.index)

fees = 0.0002  # 0.02% transaction cost per trade

pf = vbt.Portfolio.from_signals(
    close_prices,
    entries=signals_s > 0,
    exits=signals_s < 0,
    init_cash=10000,
    freq='4H',
    fees=fees
)

total_return = pf.total_return()
sharpe_ratio = pf.sharpe_ratio()

print("\nFull Backtest Results:")
print(f"Accuracy={accuracy:.2f}, Return={total_return:.2f}%, Sharpe={sharpe_ratio:.2f}")
print(pf.stats())

# Optional: Plot the backtest
fig = pf.plot()
fig.show()


Loaded best classification model from 'best_rf_mb_pipeline.pkl'

Out-of-Sample Accuracy: 0.5439

Running Full Backtest on the Last 2000 Bars...

Full Backtest Results:
Accuracy=0.54, Return=0.30%, Sharpe=1.21
Start                               2021-12-01 16:00:00
End                                 2025-02-28 00:00:00
Period                                832 days 12:00:00
Start Value                                     10000.0
End Value                                  12971.402323
Total Return [%]                              29.714023
Benchmark Return [%]                          24.515943
Max Gross Exposure [%]                            100.0
Total Fees Paid                              236.291366
Max Drawdown [%]                              13.645737
Max Drawdown Duration                 295 days 04:00:00
Total Trades                                         56
Total Closed Trades                                  56
Total Open Trades                                     0
Open Tr