## Double-Barrier Labeling

Description:
We set an upper and lower price barrier from the current bar (e.g., ±0.5% from the close) and watch which barrier is touched first within a given horizon (e.g., 20 bars). If the upper barrier is touched first, label = +1; if the lower barrier, label = -1; if neither is touched, label = 0. This method (inspired by López de Prado) focuses on directional moves of a certain size rather than raw returns.

#### 📌 Important Note:
This notebook contains *interactive charts generated using Vectorbt.  
GitHub does not display interactive Plotly charts, so the graphs will not be visible here.  

✅ To view the charts, please download this notebook and run it on your local machine.  
Make sure you have Vectorbt and its dependencies installed to regenerate the visualizations.


## Part 1: Data & Feature Engineering

**Objective:**  
Load raw price data (MetaTrader 5 or CSV) and transform it into a feature-rich dataset.

**Tasks:**
- Fetch historical bars  
- Apply `ta.add_all_ta_features` or custom features  
- (Optionally) create specific labels (multi-bar, double-barrier, regime, etc.)  
- Clean/prepare the final feature matrix **X** and target **y**  

In [None]:
# Double-Barrier Labeling Classification Example

import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Adjust if your notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import MetaTrader5 as mt5

# If using vectorbt
import vectorbt as vbt

# Our modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
from models.model_training import (
    select_features_rf_reg,
    walk_forward_splits
)
from backtests.simple_backtest import simulate_trading, calculate_sharpe_ratio
from features.labeling_schemes import create_labels_double_barrier  # or define inline



# Sklearn / Models
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.naive_bayes import GaussianNB, BernoulliNB


###########################################################
# 1) Data Loading & Basic Feature Engineering
###########################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H4, n_bars=2000, start_pos=2000)
    mt5.shutdown()

df = add_all_ta_features(data)

###########################################################
# 2) Double-Barrier Labeling Function
###########################################################


df_lbl = create_labels_double_barrier(df, up=0.005, down=0.005, horizon=20)

# Prepare X, y
X = df_lbl.drop(columns=["barrier_label"])
y = df_lbl["barrier_label"]

###########################################################
# 3) Walk-Forward Splits
###########################################################
folds = walk_forward_splits(X, y, n_splits=3)
print(f"Number of folds created: {len(folds)}")

###########################################################
# 4) Define Classification Models
###########################################################
models = {
    "RandomForestClassifier": RandomForestClassifier(n_estimators=100, random_state=42),
    "GradientBoostingClassifier": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42),
    "SVC": SVC(C=1.0, kernel='rbf', probability=True),
    "XGBClassifier": XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
    "LGBMClassifier": LGBMClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
    "GaussianNB": GaussianNB(),  # <-- Added Bayesian Classification Model
    "BernoulliNB": BernoulliNB() # <-- Another Bayesian Model (good for binary data)
}
###########################################################
# 5) Loop Over Folds + Simple Backtest
###########################################################
fold_results = {}

for fold_i, (X_train_fold, y_train_fold, X_test_fold, y_test_fold) in enumerate(folds, start=1):
    print(f"\n===== Fold {fold_i} =====")

    # SHIFT labels from [-1,0,+1] => [0,1,2]
    y_train_fold_shifted = y_train_fold + 1
    y_test_fold_shifted = y_test_fold + 1

    # Feature selection with a classifier
    rf_for_fs = RandomForestClassifier(n_estimators=100, random_state=42)
    X_train_sel, selected_idx = select_features_rf_reg(
        X_train_fold, y_train_fold_shifted, estimator=rf_for_fs, max_features=20
    )
    feats = X_train_fold.columns[selected_idx]
    print(f"Selected features for Fold {fold_i}: {feats.tolist()}")

    X_test_sel = X_test_fold[feats]

    # Scale
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_sel)
    X_test_scaled = scaler.transform(X_test_sel)

    fold_results[fold_i] = {}

    for model_name, model in models.items():
        # Train on SHIFTED labels
        model.fit(X_train_scaled, y_train_fold_shifted)
        
        # Predict SHIFTED
        preds_shifted = model.predict(X_test_scaled)
        
        # SHIFT back: 0->-1, 1->0, 2->+1
        preds = preds_shifted - 1

        # Evaluate Accuracy vs. unshifted test labels
        acc = accuracy_score(y_test_fold, preds)

        # Convert classification => signals in {-1,0,+1}
        signals = preds

        # Align with test portion
        df_test_fold = df_lbl.loc[X_test_fold.index].copy()

        # Simple backtest with cost
        daily_returns, total_return = simulate_trading(signals, df_test_fold, cost=0.0002)
        sr = calculate_sharpe_ratio(np.array(daily_returns))

        fold_results[fold_i][model_name] = {
            "Accuracy": acc,
            "TotalReturn": total_return,
            "Sharpe": sr
        }

###########################################################
# 6) Print Results
###########################################################
for fold_i, model_dict in fold_results.items():
    print(f"\n=== Fold {fold_i} Results ===")
    for model_name, stats in model_dict.items():
        acc = stats["Accuracy"]
        ret = stats["TotalReturn"]
        sr = stats["Sharpe"]
        print(f"{model_name}: ACC={acc:.2f}, Return={ret:.2f}%, Sharpe={sr:.2f}")


Number of folds created: 3

===== Fold 1 =====
Selected features for Fold 1: ['trend_ema_slow', 'trend_sma_slow', 'volatility_bbl', 'trend_visual_ichimoku_b', 'volume_vpt', 'trend_ichimoku_base', 'volatility_bbm', 'trend_sma_fast', 'momentum_pvo_signal', 'volatility_kcl', 'volume_adi', 'trend_visual_ichimoku_a', 'momentum_roc', 'volatility_dcm', 'volume_vwap', 'momentum_kama', 'trend_ichimoku_a', 'trend_ema_fast', 'volatility_dcl', 'volatility_dcw']


  File "c:\Users\moham\miniconda3\envs\ml\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Users\moham\miniconda3\envs\ml\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\moham\miniconda3\envs\ml\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Users\moham\miniconda3\envs\ml\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000188 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1513
[LightGBM] [Info] Number of data points in the train set: 250, number of used features: 20
[LightGBM] [Info] Start training from score -1.671313
[LightGBM] [Info] Start training from score -0.946750
[LightGBM] [Info] Start training from score -0.858022

===== Fold 2 =====
Selected features for Fold 2: ['volume_nvi', 'trend_ema_slow', 'trend_visual_ichimoku_b', 'volatility_bbl', 'momentum_pvo_signal', 'close', 'trend_sma_slow', 'volume_obv', 'volume_vpt', 'volume_adi', 'trend_ichimoku_base', 'volatility_dcm', 'volatility_kcp', 'trend_visual_ichimoku_a', 'volatility_dch', 'volatility_kch', 'momentum_rsi', 'volatility_kcl', 'volatility_dcl', 'trend_ichimoku_a']
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000197 seconds.
You can set `force_col_wise=true`

## Part 2: Model Training & Hyperparameter Tuning

**Objective:**  
Train an ML model (e.g., RandomForest, XGBoost) on the engineered features to predict the chosen labels.

**Tasks:**
- Perform time-based or walk-forward splits  
- Select top features if desired (e.g., using RandomForest feature importance)  
- Use `RandomizedSearchCV` or `GridSearchCV` to find optimal hyperparameters  
- Save the best model pipeline (e.g., `best_rf_pipeline.pkl`) 

In [None]:
# Code 2: Hyperparameter Tuning for Chosen Classification Model using Double-Barrier Labeling

import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Adjust if your notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import MetaTrader5 as mt5
import joblib

# Sklearn / Models
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.pipeline import Pipeline

# Your modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
# Import the double-barrier labeling function
from features.labeling_schemes import create_labels_double_barrier  # or define inline

###########################################################
# 1) DATA LOADING & FEATURE ENGINEERING
###########################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H4, n_bars=2000, start_pos=2000)
    mt5.shutdown()

df = add_all_ta_features(data)

# Create classification labels using double-barrier approach
# e.g., up=0.005, down=0.005 => ±0.5% barriers, horizon=20 bars
df_lbl = create_labels_double_barrier(df, up=0.005, down=0.005, horizon=20)

# Now we have a 'barrier_label' in {-1, 0, +1}
X_full = df_lbl.drop(columns=["barrier_label"])
y_full = df_lbl["barrier_label"]

# SHIFT LABELS from [-1,0,+1] => [0,1,2]
# so the classifier won't complain about negative labels
y_full_shifted = y_full + 1  # -1->0, 0->1, +1->2

print("Unique classes in y_full:", y_full.unique())
print("Unique classes in y_full_shifted:", y_full_shifted.unique())

# Ensure chronological order if needed
# X_full = X_full.sort_index()
# y_full_shifted = y_full_shifted.loc[X_full.index]

###########################################################
# 2) DEFINE YOUR TRAIN PORTION
###########################################################
# e.g., first 80% for tuning
split_idx = int(len(X_full)*0.8)
X_tune = X_full.iloc[:split_idx].copy()
y_tune_shifted = y_full_shifted.iloc[:split_idx].copy()

print(f"Tuning portion size: {len(X_tune)}")

###########################################################
# 3) TIME-BASED CV (TimeSeriesSplit)
###########################################################
tscv = TimeSeriesSplit(n_splits=3)

# We'll define a scoring for classification
scorer = make_scorer(accuracy_score)

###########################################################
# 4) BUILD A PIPELINE
###########################################################
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42))
])

###########################################################
# 5) DEFINE PARAM DISTRIBUTIONS FOR RandomForestClassifier
###########################################################
param_distributions = {
    "clf__n_estimators": [100, 200, 300],
    "clf__max_depth": [None, 5, 10, 15],
    "clf__min_samples_split": [2, 5, 10],
    "clf__max_features": ["auto", "sqrt", 0.5]
}

###########################################################
# 6) SET UP RandomizedSearchCV
###########################################################
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_distributions,
    n_iter=10,               # how many random combos
    scoring=scorer,          # 'accuracy' metric
    cv=tscv,                 # time-based folds
    random_state=42,
    n_jobs=-1,
    verbose=2
)

###########################################################
# 7) FIT ON TUNING PORTION
###########################################################
random_search.fit(X_tune, y_tune_shifted)

print("Best params:", random_search.best_params_)
print("Best score (accuracy):", random_search.best_score_)

best_estimator = random_search.best_estimator_

###########################################################
# 8) SAVE THE BEST ESTIMATOR
###########################################################
joblib.dump(best_estimator, "models/saved_models/best_rf_db_pipeline.pkl")
print("Saved best estimator to 'best_rf_db_pipeline.pkl'")


Unique classes in y_full: [ 1. -1.  0.]
Unique classes in y_full_shifted: [2. 0. 1.]
Tuning portion size: 1600
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best params: {'clf__n_estimators': 200, 'clf__min_samples_split': 5, 'clf__max_features': 'sqrt', 'clf__max_depth': 10}
Best score (accuracy): 0.5325000000000001
Saved best estimator to 'best_rf_db_pipeline.pkl'


## Part 3: Backtesting & Performance Evaluation

**Objective:**  
Evaluate how well the trained model performs on unseen data, simulating real trades.

**Tasks:**
- Use walk-forward or expanding splits to mimic “live” conditions  
- Convert model predictions to signals ([-1, 0, +1] or buy/sell/hold)  
- Run a simple backtest script or VectorBT for performance metrics  
- Calculate returns, Sharpe ratio, drawdowns, confusion matrix, etc.  
- Visualize results (equity curve, trades, etc.) to judge strategy viability  

In [None]:
import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import MetaTrader5 as mt5
import vectorbt as vbt
import joblib

# Our modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
from features.labeling_schemes import create_labels_double_barrier  # Double-barrier classification

# Sklearn
from sklearn.metrics import accuracy_score

###########################################################
# 1) DATA LOADING & FEATURE ENGINEERING
###########################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    # Fetch 2000 most recent bars for backtesting
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H4, n_bars=2000, start_pos=0)
    mt5.shutdown()

# Add technical features
df = add_all_ta_features(data)

# Create double-barrier classification labels
# Example: up=0.005, down=0.005 => ±0.5% barrier, horizon=20 bars
df_lbl = create_labels_double_barrier(df, up=0.005, down=0.005, horizon=20)

# Separate features and labels
X = df_lbl.drop(columns=["barrier_label"])
y = df_lbl["barrier_label"]           # in {-1, 0, +1}
y_shifted = y + 1                     # in {0, 1, 2} for the classifier if needed

###########################################################
# 2) LOAD PRE-TRAINED CLASSIFICATION MODEL (NO RETRAINING)
###########################################################
best_pipeline = joblib.load("models/saved_models/best_rf_db_pipeline.pkl")
print("Loaded best classification model from 'best_rf_db_pipeline.pkl'")

# Identify the columns the pipeline was trained on
trained_columns = best_pipeline["scaler"].feature_names_in_  # adjust if your pipeline step name differs

# Subset X to match these columns
X_test = X[trained_columns]

# Predict on the new dataset (model expects labels in {0,1,2})
preds_shifted = best_pipeline.predict(X_test)

# Convert predictions back to {-1, 0, +1}
preds = preds_shifted - 1

# Accuracy on the exact prediction rows
accuracy = accuracy_score(y.loc[X_test.index], preds)
print(f"\nOut-of-Sample Accuracy: {accuracy:.4f}")

###########################################################
# 3) BACKTEST via target exposure (-1, 0, +1)  << UPDATED
###########################################################
# predictions already in {-1, 0, +1}
exposure = pd.Series(preds.astype(float), index=X_test.index)

# Align prices exactly to prediction rows (no padding)
close = df_lbl.loc[X_test.index, "close"]

# Optional: trade on next bar to avoid look-ahead (set to 0 for same-bar)
execution_lag = 1
if execution_lag > 0:
    exposure = exposure.shift(execution_lag).fillna(0.0)

fees = 0.0002  # 0.02% transaction cost per trade

pf = vbt.Portfolio.from_orders(
    close=close,
    size=exposure,              # -1 short, 0 flat, +1 long
    size_type='targetpercent',
    init_cash=10000,
    freq='4H',
    fees=fees
)

total_return = pf.total_return()
sharpe_ratio = pf.sharpe_ratio()

print("\nFull Backtest Results:")
print(f"Accuracy={accuracy:.4f}, Return={total_return:.2f}%, Sharpe={sharpe_ratio:.2f}")
print(pf.stats())

# Optional: Plot the backtest results
fig = pf.plot()
fig.show()


Loaded best classification model from 'best_rf_db_pipeline.pkl'

Out-of-Sample Accuracy: 0.8745

Running Full Backtest on the Last 2000 Bars...

Full Backtest Results:
Accuracy=0.87, Return=145.80%, Sharpe=15.64
Start                               2024-04-03 12:00:00
End                                 2025-03-02 16:00:00
Period                                333 days 08:00:00
Start Value                                     10000.0
End Value                                1467982.002469
Total Return [%]                           14579.820025
Benchmark Return [%]                          29.209055
Max Gross Exposure [%]                            100.0
Total Fees Paid                            59441.265675
Max Drawdown [%]                              16.812138
Max Drawdown Duration                  57 days 08:00:00
Total Trades                                        287
Total Closed Trades                                 286
Total Open Trades                                     1
Open