
# 📈 Time-Series Forecasting for Trading (with Feature Engineering)

This notebook implements various **time-series forecasting models** to predict price movements using **live data from MetaTrader 5 (MT5)**.

## **🔹 Models Included:**
✅ **ARIMA** - Univariate Forecasting  
✅ **SARIMA** - Seasonal Adjusted Forecasting  
✅ **VAR** - Multivariate Forecasting (e.g., Price + Volume)  
✅ **LSTM/Transformer** - Deep Learning-Based Forecasting  

### **🔹 Workflow:**
1️⃣ **Fetch Live Data from MT5**  
2️⃣ **Apply Feature Engineering (Technical Indicators, Autocorrelation, Stationarity Checks)**  
3️⃣ **Select Best Features Dynamically**  
4️⃣ **Train & Evaluate Forecasting Models**  
5️⃣ **Compare Predictions Using Metrics (RMSE, MAE, MAPE, Sharpe Ratio)**  
    

In [1]:

import sys
import os
import warnings
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import MetaTrader5 as mt5

# Time-Series & ML Libraries
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.vector_ar.var_model import VAR

# Deep Learning (LSTM)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Fix import issue (ensure project root is recognized)
from pathlib import Path
project_root = Path.cwd().parent.parent  # Move two levels up (from notebooks/time_series)
sys.path.append(str(project_root))

# Import Feature Engineering Module
from features.feature_engineering import add_all_ta_features
from models.model_training import walk_forward_splits
warnings.filterwarnings("ignore")
    

In [2]:

# Initialize MetaTrader 5 and Fetch Data
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    SYMBOL = "BTCUSD"  # Change symbol here
    TIMEFRAME = mt5.TIMEFRAME_H1  # 1-hour candles
    N_BARS = 1000

    df = mt5.copy_rates_from_pos(SYMBOL, TIMEFRAME, 0, N_BARS)
    mt5.shutdown()

# Convert to DataFrame
df = pd.DataFrame(df)
df['time'] = pd.to_datetime(df['time'], unit='s')  # Convert timestamp
df.set_index('time', inplace=True)

# Apply Feature Engineering (Technical Indicators, Autocorrelation, etc.)
df = add_all_ta_features(df)

# Drop NaN values created during feature calculations
df.dropna(inplace=True)

# Display Data Overview
print(df.head())

# Plot Closing Price
fig = px.line(df, x=df.index, y="close", title=f"{SYMBOL} Price Data with Features")
fig.show()
    

                          open       high        low      close  tick_volume  \
time                                                                           
2025-01-17 06:00:00  101016.27  101477.00  100886.39  101449.28         9548   
2025-01-17 07:00:00  101449.28  101776.52  101305.00  101335.62        16329   
2025-01-17 08:00:00  101339.91  101594.76  101292.85  101403.80        15285   
2025-01-17 09:00:00  101403.80  101884.59  101391.16  101444.24        10256   
2025-01-17 10:00:00  101445.52  102213.81  101440.76  102124.33        17320   

                     spread  real_volume    volume_adi            volume_obv  \
time                                                                           
2025-01-17 06:00:00     244            0   8651.738304                  9548   
2025-01-17 07:00:00       0            0  -5556.486416  18446744073709544835   
2025-01-17 08:00:00       9            0  -9607.206332                  8504   
2025-01-17 09:00:00    1150            

In [3]:

# Select Top Correlated Features with Closing Price
corr_matrix = df.corr()
top_features = corr_matrix["close"].abs().sort_values(ascending=False).index[:10]

# Filter dataset to only include top features
df_selected = df[top_features]

# Display Selected Features
print("Top Selected Features for Forecasting:", top_features.tolist())

# Plot Feature Correlations
fig = px.imshow(corr_matrix.loc[top_features, top_features], text_auto=True, title="Feature Correlation Heatmap")
fig.show()
    

Top Selected Features for Forecasting: ['others_cr', 'close', 'high', 'low', 'open', 'trend_ichimoku_conv', 'trend_ema_fast', 'volume_vpt', 'momentum_kama', 'trend_ichimoku_a']


In [4]:

def test_stationarity(series):
    result = adfuller(series.dropna())
    print(f"ADF Statistic: {result[0]:.4f}")
    print(f"p-value: {result[1]:.4f}")
    if result[1] < 0.05:
        print("✅ The series is likely stationary.")
    else:
        print("❌ The series is likely non-stationary. Differencing may be needed.")

# Apply Stationarity Test to Selected Features
for feature in df_selected.columns:
    print(f"Testing Stationarity for {feature}:")
    test_stationarity(df_selected[feature])
    print("")

# Apply Differencing to Non-Stationary Features
for feature in df_selected.columns:
    df_selected[f"{feature}_diff"] = df_selected[feature].diff().dropna()

# Remove Original Columns and Keep Differenced Features
df_selected.drop(columns=top_features, inplace=True)
df_selected.dropna(inplace=True)
    

Testing Stationarity for others_cr:
ADF Statistic: 0.7350
p-value: 0.9905
❌ The series is likely non-stationary. Differencing may be needed.

Testing Stationarity for close:
ADF Statistic: 0.7350
p-value: 0.9905
❌ The series is likely non-stationary. Differencing may be needed.

Testing Stationarity for high:
ADF Statistic: 0.0071
p-value: 0.9591
❌ The series is likely non-stationary. Differencing may be needed.

Testing Stationarity for low:
ADF Statistic: -0.2206
p-value: 0.9360
❌ The series is likely non-stationary. Differencing may be needed.

Testing Stationarity for open:
ADF Statistic: 0.5491
p-value: 0.9863
❌ The series is likely non-stationary. Differencing may be needed.

Testing Stationarity for trend_ichimoku_conv:
ADF Statistic: 0.3733
p-value: 0.9805
❌ The series is likely non-stationary. Differencing may be needed.

Testing Stationarity for trend_ema_fast:
ADF Statistic: 0.5945
p-value: 0.9875
❌ The series is likely non-stationary. Differencing may be needed.

Testing St

In [11]:
# Ensure 'df_test' Exists for Storing Forecasts
if 'df_test' not in locals():
    train_size = int(len(df_selected) * 0.8)
    df_train, df_test = df_selected.iloc[:train_size], df_selected.iloc[train_size:].copy()

# Train ARIMA Model
best_feature = df_selected.columns[0]
arima_model = ARIMA(df_train[best_feature].dropna(), order=(1,1,1))
arima_result = arima_model.fit()

# Forecast Next Steps
arima_forecast = arima_result.forecast(steps=len(df_test))
df_test["arima_forecast"] = arima_forecast.values

# Plot ARIMA Forecast
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_test.index, y=df_test[best_feature], name="Actual Price"))
fig.add_trace(go.Scatter(x=df_test.index, y=df_test["arima_forecast"], name="ARIMA Forecast", line=dict(color="red")))
fig.update_layout(title="ARIMA Forecast", xaxis_title="Time", yaxis_title="Price")
fig.show()


In [12]:

# ---- SARIMA ----
sarima_model = SARIMAX(df_train[best_feature].dropna(), order=(1,1,1), seasonal_order=(1,1,1,12))
sarima_result = sarima_model.fit()

# Forecast Next Steps
sarima_forecast = sarima_result.forecast(steps=len(df_test))
df_test["sarima_forecast"] = sarima_forecast.values

# Plot SARIMA Forecast
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_test.index, y=df_test[best_feature], name="Actual Price"))
fig.add_trace(go.Scatter(x=df_test.index, y=df_test["sarima_forecast"], name="SARIMA Forecast", line=dict(color="green")))
fig.update_layout(title="SARIMA Forecast", xaxis_title="Time", yaxis_title="Price")
fig.show()


In [13]:

# ---- VAR ----
df_var = df_selected.dropna()
var_model = VAR(df_var)
var_result = var_model.fit(maxlags=5)

# Forecast Next Steps
var_forecast = var_result.forecast(df_var.values[-5:], steps=len(df_test))
var_forecast_df = pd.DataFrame(var_forecast, columns=df_var.columns, index=df_test.index)

# Store VAR Forecast in `df_test`
df_test["var_forecast"] = var_forecast_df[best_feature]

# Plot VAR Forecast
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_test.index, y=df_test[best_feature], name="Actual Price"))
fig.add_trace(go.Scatter(x=df_test.index, y=df_test["var_forecast"], name="VAR Forecast", line=dict(color="blue")))
fig.update_layout(title="VAR Forecast", xaxis_title="Time", yaxis_title="Price")
fig.show()


In [9]:
# Ensure 'close' is retained
if 'close' not in df_selected.columns:
    df_selected['close'] = df['close']

# Scale Data
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_selected[['close']])


# Create Sequences
lookback = 10
X, y = [], []
for i in range(len(df_scaled) - lookback - 1):
    X.append(df_scaled[i:i+lookback])
    y.append(df_scaled[i+lookback])

X, y = np.array(X), np.array(y)

# Split Data
X_train, X_test, y_train, y_test = X[:-50], X[-50:], y[:-50], y[-50:]

# Define LSTM Model
model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(lookback, 1)),
    Dropout(0.2),
    LSTM(50),
    Dense(1)
])
model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20, batch_size=16, verbose=1)

# Predict
lstm_preds = model.predict(X_test)
lstm_preds = scaler.inverse_transform(lstm_preds)

# Plot LSTM Forecast
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_selected.index[-100:], y=df_selected['close'][-100:], name="Actual Price"))
fig.add_trace(go.Scatter(x=df_selected.index[-50:], y=lstm_preds.flatten(), name="LSTM Forecast", line=dict(color="orange")))
fig.update_layout(title="LSTM Forecast", xaxis_title="Time", yaxis_title="Price")
fig.show()


Epoch 1/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - loss: 0.1473
Epoch 2/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 0.0037
Epoch 3/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.0028
Epoch 4/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.0025
Epoch 5/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - loss: 0.0032
Epoch 6/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.0031
Epoch 7/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.0028
Epoch 8/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.0031
Epoch 9/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.0027
Epoch 10/20
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - loss: 0.0027
Epoch 11/

In [14]:
# Define Performance Metrics Function
def evaluate_forecasts(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    print(f"\n📊 Performance of {model_name}:")
    print(f"✅ Mean Absolute Error (MAE): {mae:.4f}")
    print(f"✅ Mean Squared Error (MSE): {mse:.4f}")
    print(f"✅ Root Mean Squared Error (RMSE): {rmse:.4f}")

# Evaluate ARIMA
evaluate_forecasts(df_test[best_feature], df_test["arima_forecast"], "ARIMA")

# Evaluate SARIMA
evaluate_forecasts(df_test[best_feature], df_test["sarima_forecast"], "SARIMA")

# Evaluate VAR
evaluate_forecasts(df_test[best_feature], df_test["var_forecast"], "VAR")

# Evaluate LSTM
evaluate_forecasts(y_test.flatten(), lstm_preds.flatten(), "LSTM")



📊 Performance of ARIMA:
✅ Mean Absolute Error (MAE): 0.3014
✅ Mean Squared Error (MSE): 0.2016
✅ Root Mean Squared Error (RMSE): 0.4490

📊 Performance of SARIMA:
✅ Mean Absolute Error (MAE): 0.3216
✅ Mean Squared Error (MSE): 0.2135
✅ Root Mean Squared Error (RMSE): 0.4620

📊 Performance of VAR:
✅ Mean Absolute Error (MAE): 0.3060
✅ Mean Squared Error (MSE): 0.2015
✅ Root Mean Squared Error (RMSE): 0.4489

📊 Performance of LSTM:
✅ Mean Absolute Error (MAE): 86364.8782
✅ Mean Squared Error (MSE): 7461025620.6772
✅ Root Mean Squared Error (RMSE): 86377.2286


In [17]:
# Plot Comparison of All Forecasts
fig = go.Figure()

# Actual Prices
fig.add_trace(go.Scatter(x=df_test.index, y=df_test[best_feature], name="Actual Price", line=dict(color="black")))

# ARIMA
fig.add_trace(go.Scatter(x=df_test.index, y=df_test["arima_forecast"], name="ARIMA", line=dict(color="red")))

# SARIMA
fig.add_trace(go.Scatter(x=pd.date_range(df_test.index[0], periods=len(sarima_forecast), freq="H"),
                         y=sarima_forecast, name="SARIMA", line=dict(color="green")))

# VAR
fig.add_trace(go.Scatter(x=pd.date_range(df_test.index[0], periods=len(var_forecast_df[best_feature]), freq="H"),
                         y=var_forecast_df[best_feature], name="VAR", line=dict(color="blue")))

# LSTM
fig.add_trace(go.Scatter(x=df_test.index[-50:], y=lstm_preds.flatten(), name="LSTM", line=dict(color="orange")))

fig.update_layout(title="Comparison of Time-Series Forecasting Models", xaxis_title="Time", yaxis_title="Price")
fig.show()


## Part 1: Data & Feature Engineering

**Objective:**  
Load raw price data (MetaTrader 5 or CSV) and transform it into a feature-rich dataset.

**Tasks:**
- Fetch historical bars  
- Apply `ta.add_all_ta_features` or custom features  
- (Optionally) create specific labels (multi-bar, double-barrier, regime, etc.)  
- Clean/prepare the final feature matrix **X** and target **y**  

In [1]:
import sys
import os
import warnings
from pathlib import Path

# -----------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# -----------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Assuming notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

# -----------------------------------------------------------------------------
# 2) IMPORT LIBRARIES & MODULES
# -----------------------------------------------------------------------------
import pandas as pd
import numpy as np
import MetaTrader5 as mt5
import vectorbt as vbt
import plotly.graph_objects as go

# Time-Series Libraries
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.vector_ar.var_model import VAR

# Feature Engineering & Data Loader
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features
from models.model_training import walk_forward_splits
from sklearn.metrics import mean_absolute_error, mean_squared_error

# -----------------------------------------------------------------------------
# 3) DATA LOADING & FEATURE ENGINEERING
# -----------------------------------------------------------------------------
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H1, n_bars=1000)
    mt5.shutdown()

df = add_all_ta_features(data).dropna()  # Apply technical indicators

# Select features and target (closing price prediction)
X = df.drop(columns=["close"])
y = df["close"]

print(f"✅ Features (X): {X.shape}, Target (y): {y.shape}")

# -----------------------------------------------------------------------------
# 4) WALK-FORWARD SPLITS
# -----------------------------------------------------------------------------
folds = walk_forward_splits(X, y, n_splits=3)  # Apply walk-forward splits
print(f"✅ {len(folds)} folds created for walk-forward validation!")

# Dictionary to store each fold's results and the best feature used
fold_results = {}

# -----------------------------------------------------------------------------
# 5) TIME-SERIES MODELS TRAINING, FORECASTING & EVALUATION FOR EACH FOLD
# -----------------------------------------------------------------------------
for fold_i, (X_train_fold, y_train_fold, X_test_fold, y_test_fold) in enumerate(folds, start=1):
    print(f"\n===== Fold {fold_i} =====")
    
    # Convert train/test folds into DataFrame format
    df_train = pd.DataFrame(X_train_fold, columns=X.columns)
    df_train["target"] = y_train_fold.values

    df_test = pd.DataFrame(X_test_fold, columns=X.columns)
    df_test["target"] = y_test_fold.values

    # Select Best Feature (based on correlation with target), excluding the 'target' column
    feature_corr = df_train.drop(columns=["target"]).corrwith(df_train["target"]).abs().sort_values(ascending=False)
    best_feature = feature_corr.index[0]
    print(f"🔹 Best Feature Selected: {best_feature}")

    # ----- Model 1: ARIMA -----
    arima_model = ARIMA(df_train[best_feature].dropna(), order=(1, 1, 1))
    arima_result = arima_model.fit()
    arima_forecast = arima_result.forecast(steps=len(df_test))
    df_test["arima_forecast"] = arima_forecast.values

    # ----- Model 2: SARIMA -----
    sarima_model = SARIMAX(df_train[best_feature].dropna(), order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
    sarima_result = sarima_model.fit()
    sarima_forecast = sarima_result.forecast(steps=len(df_test))
    df_test["sarima_forecast"] = sarima_forecast.values

    # ----- Model 3: VAR (Multi-Feature Model) -----
    df_var = df_train.drop(columns=["target"]).dropna()
    if df_var.empty:
        print(f"⚠️ Skipping VAR model for Fold {fold_i} due to insufficient data!")
        df_test["var_forecast"] = np.nan
    else:
        var_model = VAR(df_var)
        var_result = var_model.fit(maxlags=5)

        # Ensure df_test has the necessary features for VAR
        df_var_test = df_test.drop(columns=["target"]).dropna()
        if df_var_test.empty:
            print(f"⚠️ Skipping VAR forecast for Fold {fold_i} due to missing features!")
            df_test["var_forecast"] = np.nan
        else:
            # Forecast using the last 5 observations from df_var
            var_forecast = var_result.forecast(df_var.values[-5:], steps=len(df_var_test))
            var_forecast_df = pd.DataFrame(var_forecast, columns=df_var.columns)

            # If the best feature exists in the VAR forecast, use it
            if best_feature in var_forecast_df.columns:
                df_test["var_forecast"] = var_forecast_df[best_feature].values
            else:
                print(f"⚠️ Best feature '{best_feature}' missing from VAR forecast in Fold {fold_i}. Filling with NaNs.")
                df_test["var_forecast"] = np.nan

    # -----------------------------------------------------------------------------
    # PLOTTING FORECASTS FOR THE CURRENT FOLD
    # -----------------------------------------------------------------------------
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df_train.index, y=df_train[best_feature], name="Train Data"))
    fig.add_trace(go.Scatter(x=df_test.index, y=df_test[best_feature], name="Test Data", line=dict(color="gray")))
    fig.add_trace(go.Scatter(x=df_test.index, y=df_test["arima_forecast"], name="ARIMA Forecast", line=dict(color="red")))
    fig.add_trace(go.Scatter(x=df_test.index, y=df_test["sarima_forecast"], name="SARIMA Forecast", line=dict(color="green")))
    fig.add_trace(go.Scatter(x=df_test.index, y=df_test["var_forecast"], name="VAR Forecast", line=dict(color="blue")))
    fig.update_layout(title=f"Forecasting - Fold {fold_i}", xaxis_title="Time", yaxis_title=best_feature)
    fig.show()

    # -----------------------------------------------------------------------------
    # Evaluate Forecasts (for each model) within the fold
    # -----------------------------------------------------------------------------
    def evaluate_forecasts(actual, predicted, model_name):
        mae = mean_absolute_error(actual, predicted)
        rmse = np.sqrt(mean_squared_error(actual, predicted))
        print(f"🔹 {model_name} - Fold {fold_i}: MAE = {mae:.4f}, RMSE = {rmse:.4f}")

    # Fill any missing forecast values (backward fill)
    df_test["arima_forecast"].fillna(method="bfill", inplace=True)
    df_test["sarima_forecast"].fillna(method="bfill", inplace=True)
    df_test["var_forecast"].fillna(method="bfill", inplace=True)

    # Store the fold's results along with the best feature used
    fold_results[fold_i] = {"df": df_test.copy(), "best_feature": best_feature}

    # Optional: Evaluate immediately for the current fold (dropping NaNs)
    df_test_cleaned = df_test.dropna(subset=[best_feature, "arima_forecast", "sarima_forecast", "var_forecast"])
    if df_test_cleaned.empty:
        print(f"⚠️ Warning: No valid forecast samples for evaluation in Fold {fold_i}.")
    else:
        evaluate_forecasts(df_test_cleaned[best_feature], df_test_cleaned["arima_forecast"], "ARIMA")
        evaluate_forecasts(df_test_cleaned[best_feature], df_test_cleaned["sarima_forecast"], "SARIMA")
        evaluate_forecasts(df_test_cleaned[best_feature], df_test_cleaned["var_forecast"], "VAR")

# -----------------------------------------------------------------------------
# 6) PRINT RESULTS SUMMARY ACROSS FOLDS
# -----------------------------------------------------------------------------
for fold_i, fold_info in fold_results.items():
    df_test = fold_info["df"]
    best_feature = fold_info["best_feature"]
    print(f"\n=== Fold {fold_i} Results Summary ===")

    # Drop rows with missing values in any key column
    df_test_cleaned = df_test.dropna(subset=[best_feature, "arima_forecast", "sarima_forecast", "var_forecast"])
    if df_test_cleaned.empty:
        print(f"⚠️ No valid forecast data available for evaluation in Fold {fold_i}.")
        continue

    mae_arima = mean_absolute_error(df_test_cleaned[best_feature], df_test_cleaned["arima_forecast"])
    rmse_arima = np.sqrt(mean_squared_error(df_test_cleaned[best_feature], df_test_cleaned["arima_forecast"]))

    mae_sarima = mean_absolute_error(df_test_cleaned[best_feature], df_test_cleaned["sarima_forecast"])
    rmse_sarima = np.sqrt(mean_squared_error(df_test_cleaned[best_feature], df_test_cleaned["sarima_forecast"]))

    mae_var = mean_absolute_error(df_test_cleaned[best_feature], df_test_cleaned["var_forecast"])
    rmse_var = np.sqrt(mean_squared_error(df_test_cleaned[best_feature], df_test_cleaned["var_forecast"]))

    print(f"📌 ARIMA - MAE: {mae_arima:.4f}, RMSE: {rmse_arima:.4f}")
    print(f"📌 SARIMA - MAE: {mae_sarima:.4f}, RMSE: {rmse_sarima:.4f}")
    print(f"📌 VAR    - MAE: {mae_var:.4f}, RMSE: {rmse_var:.4f}")


✅ Features (X): (1000, 92), Target (y): (1000,)
✅ 3 folds created for walk-forward validation!

===== Fold 1 =====
🔹 Best Feature Selected: others_cr


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  return get_prediction_index(
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  return get_prediction_index(
  self._init_dates(dates, freq)


🔹 ARIMA - Fold 1: MAE = 2.6297, RMSE = 3.0618
🔹 SARIMA - Fold 1: MAE = 2.4853, RMSE = 3.0333
🔹 VAR - Fold 1: MAE = 2.3171, RMSE = 2.8153

===== Fold 2 =====
🔹 Best Feature Selected: others_cr



A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


No supported index is available. Prediction results will be given with an integer index beginning at `start`.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


No supported index is available. Prediction results will be given with an integer index beginning at `start`.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.




A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.



🔹 ARIMA - Fold 2: MAE = 0.8383, RMSE = 1.0495
🔹 SARIMA - Fold 2: MAE = 1.0987, RMSE = 1.2946
🔹 VAR - Fold 2: MAE = 15266.2514, RMSE = 41890.0691

===== Fold 3 =====
🔹 Best Feature Selected: others_cr



No supported index is available. Prediction results will be given with an integer index beginning at `start`.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


No supported index is available. Prediction results will be given with an integer index beginning at `start`.


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.



🔹 ARIMA - Fold 3: MAE = 3.2151, RMSE = 5.1056
🔹 SARIMA - Fold 3: MAE = 2.8176, RMSE = 4.0693
🔹 VAR - Fold 3: MAE = 123873828408479325356032.0000, RMSE = 670590313183068318334976.0000

=== Fold 1 Results Summary ===
📌 ARIMA - MAE: 2.6297, RMSE: 3.0618
📌 SARIMA - MAE: 2.4853, RMSE: 3.0333
📌 VAR    - MAE: 2.3171, RMSE: 2.8153

=== Fold 2 Results Summary ===
📌 ARIMA - MAE: 0.8383, RMSE: 1.0495
📌 SARIMA - MAE: 1.0987, RMSE: 1.2946
📌 VAR    - MAE: 15266.2514, RMSE: 41890.0691

=== Fold 3 Results Summary ===
📌 ARIMA - MAE: 3.2151, RMSE: 5.1056
📌 SARIMA - MAE: 2.8176, RMSE: 4.0693
📌 VAR    - MAE: 123873828408479325356032.0000, RMSE: 670590313183068318334976.0000


In [1]:

from pathlib import Path
import sys
import os
import warnings


# -----------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# -----------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Assuming notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")


import numpy as np
import pandas as pd
import MetaTrader5 as mt5

# Plotting
import plotly.graph_objects as go

# TensorFlow / Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Custom Modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features


# -------------------------------------------------------------------------
# 3) DATA LOADING & FEATURE ENGINEERING
# -------------------------------------------------------------------------
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H4, n_bars=10000)
    mt5.shutdown()

# data is assumed to have columns like ['time', 'open', 'high', 'low', 'close', 'tick_volume', ...]
df = add_all_ta_features(data).dropna()  # This adds many technical indicators
print("Columns after TA features:", df.columns.tolist())

# For the LSTM, we need at least 'close' in the DataFrame
# Optionally, if you want only the 'close' column:
# df_selected = df[['close']]
# Or if you want multiple features, e.g. 'close', 'volume', 'rsi', 'macd', etc.:
# Make sure your create_sequences() function handles multi-feature input if you do so.
df_selected = df[['close']].copy()

print("df_selected shape:", df_selected.shape)
print("df_selected head:\n", df_selected.head())


###############################################################################
# 1) LSTM HELPER FUNCTIONS
###############################################################################
def create_sequences(data, lookback=10):
    """
    Convert a 1D or 2D numpy array (data) into LSTM sequences.
    data shape: (num_samples, 1) or (num_samples, num_features)
    Returns:
        X: shape (num_sequences, lookback, num_features)
        y: shape (num_sequences, num_features) or (num_sequences, 1)
    """
    X, y = [], []
    for i in range(len(data) - lookback):
        X.append(data[i : i + lookback])
        y.append(data[i + lookback])
    return np.array(X), np.array(y)

def lstm_walk_forward(df, n_splits=3, lookback=10, horizon=50, epochs=20, batch_size=16):
    """
    Perform walk-forward validation with an LSTM model.

    Parameters:
    -----------
    df        : DataFrame with a 'close' column (and optionally other columns).
    n_splits  : Number of walk-forward folds.
    lookback  : Number of timesteps in each LSTM input sequence.
    horizon   : Number of test points in each fold.
    epochs    : Training epochs for each fold's LSTM.
    batch_size: Batch size for LSTM training.

    Returns:
    --------
    fold_results : list of dicts, each containing:
        - 'fold'   : Fold index (1-based).
        - 'mae'    : MAE on the fold test set.
        - 'rmse'   : RMSE on the fold test set.
        - 'preds'  : Inverse-scaled predictions.
        - 'y_test' : Inverse-scaled actuals.
        - 'df_train': Training DataFrame slice for the fold.
        - 'df_test' : Test DataFrame slice for the fold.
    """
    if 'close' not in df.columns:
        raise ValueError("DataFrame must have a 'close' column.")

    total_length = len(df)
    # The maximum training length per fold is everything minus the final horizon
    fold_size = (total_length - horizon) // n_splits

    fold_results = []
    start_idx = 0

    for fold_idx in range(n_splits):
        # End of training region for this fold
        train_end = start_idx + fold_size
        if fold_idx == n_splits - 1:
            # Last fold takes everything up to the final horizon
            train_end = total_length - horizon
        
        test_end = train_end + horizon

        # Slice the DataFrame for training and testing
        df_train = df.iloc[:train_end]
        df_test = df.iloc[train_end:test_end]

        # Scale each fold separately based on the training set
        scaler = MinMaxScaler()
        train_scaled = scaler.fit_transform(df_train[['close']])
        test_scaled = scaler.transform(df_test[['close']])

        # Create sequences for training
        X_train, y_train = create_sequences(train_scaled, lookback)

        # Combine last 'lookback' points of training + test set for test sequences
        last_train_part = train_scaled[-lookback:]
        test_input = np.concatenate([last_train_part, test_scaled], axis=0)
        X_test, y_test = create_sequences(test_input, lookback)

        # Define LSTM model
        model = Sequential([
            LSTM(50, return_sequences=True, input_shape=(lookback, 1)),
            Dropout(0.2),
            LSTM(50),
            Dense(1)
        ])
        model.compile(loss="mse", optimizer="adam")

        # Train
        model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=0)

        # Predict
        preds = model.predict(X_test)
        preds_inverted = scaler.inverse_transform(preds)
        y_test_inverted = scaler.inverse_transform(y_test.reshape(-1, 1))

        # Evaluate
        mae = mean_absolute_error(y_test_inverted, preds_inverted)
        rmse = np.sqrt(mean_squared_error(y_test_inverted, preds_inverted))

        fold_results.append({
            'fold': fold_idx + 1,
            'mae': mae,
            'rmse': rmse,
            'preds': preds_inverted.flatten(),
            'y_test': y_test_inverted.flatten(),
            'df_train': df_train,
            'df_test': df_test
        })

        # Move start index for next fold
        start_idx = train_end

    return fold_results

###############################################################################
# 2) DATA LOADING & PREPARATION
###############################################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H1, n_bars=1000)
    mt5.shutdown()

df = add_all_ta_features(data).dropna()  # Apply technical indicators

print("Available columns:", df.columns)

# Here, we define df_selected so that we have a DataFrame with at least 'close'.
# If you want multiple features, e.g. RSI, MACD, etc., add them here.
df_selected = df[['close']].copy()

###############################################################################
# 3) RUN THE WALK-FORWARD LSTM
###############################################################################
n_splits = 3
lookback = 10
horizon = 50
epochs = 10
batch_size = 16

fold_results = lstm_walk_forward(
    df_selected, 
    n_splits=n_splits, 
    lookback=lookback, 
    horizon=horizon, 
    epochs=epochs, 
    batch_size=batch_size
)

###############################################################################
# 4) EVALUATE & PLOT RESULTS
###############################################################################
for result in fold_results:
    fold_num = result['fold']
    mae = result['mae']
    rmse = result['rmse']
    preds = result['preds']
    actual = result['y_test']
    df_train = result['df_train']
    df_test = result['df_test']

    print(f"=== Fold {fold_num} ===")
    print(f"MAE = {mae:.4f}, RMSE = {rmse:.4f}")

    # For plotting, note that the test sequences start after 'lookback'
    test_index = df_test.index[lookback:]

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df_train.index,
        y=df_train['close'],
        name="Train Data",
        line=dict(color="blue")
    ))
    fig.add_trace(go.Scatter(
        x=df_test.index,
        y=df_test['close'],
        name="Test Data",
        line=dict(color="gray")
    ))
    fig.add_trace(go.Scatter(
        x=test_index,
        y=preds,
        name="LSTM Forecast",
        line=dict(color="orange")
    ))
    fig.update_layout(
        title=f"LSTM Forecast - Fold {fold_num}",
        xaxis_title="Time",
        yaxis_title="Price"
    )
    fig.show()


Columns after TA features: ['open', 'high', 'low', 'close', 'tick_volume', 'spread', 'real_volume', 'volume_adi', 'volume_obv', 'volume_cmf', 'volume_fi', 'volume_em', 'volume_sma_em', 'volume_vpt', 'volume_vwap', 'volume_mfi', 'volume_nvi', 'volatility_bbm', 'volatility_bbh', 'volatility_bbl', 'volatility_bbw', 'volatility_bbp', 'volatility_bbhi', 'volatility_bbli', 'volatility_kcc', 'volatility_kch', 'volatility_kcl', 'volatility_kcw', 'volatility_kcp', 'volatility_kchi', 'volatility_kcli', 'volatility_dcl', 'volatility_dch', 'volatility_dcm', 'volatility_dcw', 'volatility_dcp', 'volatility_atr', 'volatility_ui', 'trend_macd', 'trend_macd_signal', 'trend_macd_diff', 'trend_sma_fast', 'trend_sma_slow', 'trend_ema_fast', 'trend_ema_slow', 'trend_vortex_ind_pos', 'trend_vortex_ind_neg', 'trend_vortex_ind_diff', 'trend_trix', 'trend_mass_index', 'trend_dpo', 'trend_kst', 'trend_kst_sig', 'trend_kst_diff', 'trend_ichimoku_conv', 'trend_ichimoku_base', 'trend_ichimoku_a', 'trend_ichimoku_b

=== Fold 2 ===
MAE = 522.6795, RMSE = 656.3507


=== Fold 3 ===
MAE = 1006.4930, RMSE = 1284.6250


## Part 2: Model Training & Hyperparameter Tuning

**Objective:**  
Train an ML model (e.g., RandomForest, XGBoost) on the engineered features to predict the chosen labels.

**Tasks:**
- Perform time-based or walk-forward splits  
- Select top features if desired (e.g., using RandomForest feature importance)  
- Use `RandomizedSearchCV` or `GridSearchCV` to find optimal hyperparameters  
- Save the best model pipeline (e.g., `best_rf_pipeline.pkl`) 

1) Imports & Basic Setup

In [None]:
import sys
import os
import warnings
from pathlib import Path

# ---------------------------------------------------------------------------
# 1) SET PROJECT ROOT AND UPDATE PATH/WORKING DIRECTORY
# ---------------------------------------------------------------------------
project_root = Path.cwd().parent.parent  # Adjust if your notebook is in notebooks/time_series
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import MetaTrader5 as mt5
import optuna
import joblib

# Plotly for charting (optional)
import plotly.graph_objects as go

# TensorFlow / Keras
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Custom Modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features

###############################################################################
# 2) DATA LOADING & FEATURE ENGINEERING
###############################################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H4, n_bars=2000, start_pos=2000)
    mt5.shutdown()

df = add_all_ta_features(data).dropna()
print("Columns after TA features:", df.columns.tolist())

# Keep just 'close' for demonstration (add more columns if desired)
df_selected = df[['close']].copy()
print("df_selected shape:", df_selected.shape)

###############################################################################
# 3) HELPER FUNCTIONS
###############################################################################
def create_sequences(data, lookback=10):
    """
    Convert 1D or 2D numpy array into LSTM sequences of shape:
      (num_sequences, lookback, num_features)
    """
    X, y = [], []
    for i in range(len(data) - lookback):
        X.append(data[i : i + lookback])
        y.append(data[i + lookback])
    return np.array(X), np.array(y)

def lstm_walk_forward(
    df, 
    n_splits=3, 
    lookback=10, 
    horizon=50, 
    epochs=20, 
    batch_size=16,
    n_units=50,
    dropout_rate=0.2,
    learning_rate=1e-3
):
    """
    Perform walk-forward validation with an LSTM model.
    Returns a list of dicts, each containing fold results (MAE, RMSE, etc.).
    """
    total_length = len(df)
    fold_size = (total_length - horizon) // n_splits
    fold_results = []
    start_idx = 0

    for fold_idx in range(n_splits):
        # Slice data for this fold
        train_end = start_idx + fold_size
        if fold_idx == n_splits - 1:
            train_end = total_length - horizon
        test_end = train_end + horizon

        df_train = df.iloc[:train_end]
        df_test = df.iloc[train_end:test_end]

        # Scale train/test separately
        scaler = MinMaxScaler()
        train_scaled = scaler.fit_transform(df_train[['close']])
        test_scaled = scaler.transform(df_test[['close']])

        # Create sequences for training
        X_train, y_train = create_sequences(train_scaled, lookback)

        # Create sequences for testing
        last_train_part = train_scaled[-lookback:]
        test_input = np.concatenate([last_train_part, test_scaled], axis=0)
        X_test, y_test = create_sequences(test_input, lookback)

        # Build the LSTM model
        model = Sequential([
            LSTM(n_units, return_sequences=True, input_shape=(lookback, 1)),
            Dropout(dropout_rate),
            LSTM(n_units),
            Dense(1)
        ])
        optimizer = Adam(learning_rate=learning_rate)
        model.compile(loss="mse", optimizer=optimizer)

        # Train
        model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=0)

        # Predict
        preds = model.predict(X_test, verbose=0)
        preds_inverted = scaler.inverse_transform(preds)
        y_test_inverted = scaler.inverse_transform(y_test.reshape(-1, 1))

        # Evaluate
        mae = mean_absolute_error(y_test_inverted, preds_inverted)
        rmse = np.sqrt(mean_squared_error(y_test_inverted, preds_inverted))

        fold_results.append({
            'fold': fold_idx + 1,
            'mae': mae,
            'rmse': rmse,
            'preds': preds_inverted.flatten(),
            'y_test': y_test_inverted.flatten(),
            'df_train': df_train,
            'df_test': df_test
        })

        start_idx = train_end

    return fold_results

def build_lstm_model(n_units, dropout_rate, learning_rate, lookback):
    """
    Build a final LSTM model for training on the entire dataset
    after finding best hyperparams.
    """
    model = Sequential([
        LSTM(n_units, return_sequences=True, input_shape=(lookback, 1)),
        Dropout(dropout_rate),
        LSTM(n_units),
        Dense(1)
    ])
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss="mse", optimizer=optimizer)
    return model

###############################################################################
# 4) OPTUNA OBJECTIVE FUNCTION
###############################################################################
def objective(trial):
    """
    Optuna objective: sample hyperparams, run walk-forward, return average MSE.
    """
    n_units = trial.suggest_int("n_units", 32, 128, step=32)
    dropout_rate = trial.suggest_float("dropout_rate", 0.0, 0.5, step=0.1)
    learning_rate = trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True)
    epochs = trial.suggest_int("epochs", 10, 50, step=10)
    batch_size = trial.suggest_int("batch_size", 16, 64, step=16)
    lookback = trial.suggest_int("lookback", 5, 30, step=5)

    fold_results = lstm_walk_forward(
        df_selected,
        n_splits=3,
        lookback=lookback,
        horizon=50,
        epochs=epochs,
        batch_size=batch_size,
        n_units=n_units,
        dropout_rate=dropout_rate,
        learning_rate=learning_rate
    )

    # Average MSE across folds
    mse_list = []
    for fr in fold_results:
        rmse = fr['rmse']
        mse_list.append(rmse**2)  # RMSE^2 = MSE
    avg_mse = np.mean(mse_list)

    return avg_mse  # Optuna will minimize this

###############################################################################
# 5) RUN OPTUNA STUDY
###############################################################################
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=10, timeout=1800)  # e.g., 10 trials or 30 min

print("\nBest Trial:")
print(" Value (MSE):", study.best_trial.value)
print(" Params:", study.best_trial.params)

best_params = study.best_trial.params

###############################################################################
# 6) EVALUATE BEST PARAMS WITH WALK-FORWARD & PRINT/SHOW PLOTS
###############################################################################
best_fold_results = lstm_walk_forward(
    df_selected,
    n_splits=3,
    lookback=best_params['lookback'],
    horizon=50,
    epochs=best_params['epochs'],
    batch_size=best_params['batch_size'],
    n_units=best_params['n_units'],
    dropout_rate=best_params['dropout_rate'],
    learning_rate=best_params['learning_rate']
)

# Print final performance across folds
rmse_list = [fr['rmse'] for fr in best_fold_results]
avg_rmse = np.mean(rmse_list)
print(f"\nWalk-Forward Performance with Best Hyperparams: Avg RMSE = {avg_rmse:.4f}")

# (Optional) Plot each fold's predictions
for result in best_fold_results:
    fold_num = result['fold']
    mae = result['mae']
    rmse = result['rmse']
    preds = result['preds']
    actual = result['y_test']
    df_train = result['df_train']
    df_test = result['df_test']

    print(f"\n=== Fold {fold_num} ===")
    print(f"MAE = {mae:.4f}, RMSE = {rmse:.4f}")

    # Indices for plotting
    test_index = df_test.index[best_params['lookback']:]  # shift by lookback

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df_train.index,
        y=df_train['close'],
        name="Train Data",
        line=dict(color="blue")
    ))
    fig.add_trace(go.Scatter(
        x=df_test.index,
        y=df_test['close'],
        name="Test Data",
        line=dict(color="gray")
    ))
    fig.add_trace(go.Scatter(
        x=test_index,
        y=preds,
        name="LSTM Forecast",
        line=dict(color="orange")
    ))
    fig.update_layout(
        title=f"LSTM Forecast - Fold {fold_num}",
        xaxis_title="Time",
        yaxis_title="Price"
    )
    fig.show()

###############################################################################
# 7) RETRAIN FINAL MODEL ON ENTIRE DATA & SAVE
###############################################################################
# 1) Scale entire dataset with a fresh scaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_selected[['close']])

# 2) Create sequences for the entire dataset
X_full, y_full = create_sequences(df_scaled, lookback=best_params['lookback'])

# 3) Build the final LSTM with best hyperparams
final_model = build_lstm_model(
    n_units=best_params['n_units'],
    dropout_rate=best_params['dropout_rate'],
    learning_rate=best_params['learning_rate'],
    lookback=best_params['lookback']
)

# 4) Train on entire dataset
final_model.fit(X_full, y_full, epochs=best_params['epochs'],
                batch_size=best_params['batch_size'], verbose=1)

# 5) Save model and scaler
model_path = "models/saved_models/best_timeseries_lstm_model.h5"
final_model.save(model_path)
print(f"Saved best LSTM model to '{model_path}'")

scaler_path = "models/saved_models/minmax_scaler.pkl"
joblib.dump(scaler, scaler_path)
print(f"Saved MinMaxScaler to '{scaler_path}'")


[I 2025-02-28 11:47:37,989] A new study created in memory with name: no-name-bbe792a4-92d5-475a-b71d-82eec3e015d2


Columns after TA features: ['open', 'high', 'low', 'close', 'tick_volume', 'spread', 'real_volume', 'volume_adi', 'volume_obv', 'volume_cmf', 'volume_fi', 'volume_em', 'volume_sma_em', 'volume_vpt', 'volume_vwap', 'volume_mfi', 'volume_nvi', 'volatility_bbm', 'volatility_bbh', 'volatility_bbl', 'volatility_bbw', 'volatility_bbp', 'volatility_bbhi', 'volatility_bbli', 'volatility_kcc', 'volatility_kch', 'volatility_kcl', 'volatility_kcw', 'volatility_kcp', 'volatility_kchi', 'volatility_kcli', 'volatility_dcl', 'volatility_dch', 'volatility_dcm', 'volatility_dcw', 'volatility_dcp', 'volatility_atr', 'volatility_ui', 'trend_macd', 'trend_macd_signal', 'trend_macd_diff', 'trend_sma_fast', 'trend_sma_slow', 'trend_ema_fast', 'trend_ema_slow', 'trend_vortex_ind_pos', 'trend_vortex_ind_neg', 'trend_vortex_ind_diff', 'trend_trix', 'trend_mass_index', 'trend_dpo', 'trend_kst', 'trend_kst_sig', 'trend_kst_diff', 'trend_ichimoku_conv', 'trend_ichimoku_base', 'trend_ichimoku_a', 'trend_ichimoku_b

[I 2025-02-28 11:50:00,691] Trial 0 finished with value: 3732335.848029452 and parameters: {'n_units': 128, 'dropout_rate': 0.4, 'learning_rate': 0.00048188398102667944, 'epochs': 30, 'batch_size': 64, 'lookback': 10}. Best is trial 0 with value: 3732335.848029452.
[I 2025-02-28 11:52:45,758] Trial 1 finished with value: 1182089.3237289798 and parameters: {'n_units': 128, 'dropout_rate': 0.2, 'learning_rate': 0.0011847001103633932, 'epochs': 40, 'batch_size': 48, 'lookback': 5}. Best is trial 1 with value: 1182089.3237289798.
[I 2025-02-28 12:00:06,491] Trial 2 finished with value: 2140429.6150380843 and parameters: {'n_units': 96, 'dropout_rate': 0.5, 'learning_rate': 0.0015224751735208528, 'epochs': 30, 'batch_size': 16, 'lookback': 15}. Best is trial 1 with value: 1182089.3237289798.
[I 2025-02-28 12:04:01,453] Trial 3 finished with value: 2478006.355463989 and parameters: {'n_units': 32, 'dropout_rate': 0.4, 'learning_rate': 0.005093204429337122, 'epochs': 40, 'batch_size': 48, 'lo


Best Trial:
 Value (MSE): 932777.854093264
 Params: {'n_units': 128, 'dropout_rate': 0.1, 'learning_rate': 0.0013997406651546764, 'epochs': 20, 'batch_size': 16, 'lookback': 10}

Walk-Forward Performance with Best Hyperparams: Avg RMSE = 1161.4837

=== Fold 1 ===
MAE = 165.3904, RMSE = 201.7328



=== Fold 2 ===
MAE = 1902.0396, RMSE = 2001.2004



=== Fold 3 ===
MAE = 1080.5761, RMSE = 1281.5178


Epoch 1/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 13ms/step - loss: 0.0069
Epoch 2/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 15ms/step - loss: 3.0551e-04
Epoch 3/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - loss: 2.7914e-04
Epoch 4/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 15ms/step - loss: 3.1685e-04
Epoch 5/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - loss: 2.3898e-04
Epoch 6/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 14ms/step - loss: 1.9749e-04
Epoch 7/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - loss: 1.9419e-04
Epoch 8/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 13ms/step - loss: 2.2309e-04
Epoch 9/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 13ms/step - loss: 2.3012e-04
Epoch 10/20
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━



Saved best LSTM model to 'models/saved_models/best_timeseries_lstm_model.h5'
Saved MinMaxScaler to 'models/saved_models/minmax_scaler.pkl'


## Part 3: Backtesting & Performance Evaluation

**Objective:**  
Evaluate how well the trained model performs on unseen data, simulating real trades.

**Tasks:**
- Use walk-forward or expanding splits to mimic “live” conditions  
- Convert model predictions to signals ([-1, 0, +1] or buy/sell/hold)  
- Run a simple backtest script or VectorBT for performance metrics  
- Calculate returns, Sharpe ratio, drawdowns, confusion matrix, etc.  
- Visualize results (equity curve, trades, etc.) to judge strategy viability  

In [None]:
import warnings
warnings.filterwarnings("ignore")

import sys
import os
from pathlib import Path

import numpy as np
import pandas as pd
import MetaTrader5 as mt5
import vectorbt as vbt
import joblib

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import mean_squared_error

# If your saved model used "mse" as a loss, map it to MeanSquaredError:
custom_objs = {"mse": tf.keras.losses.MeanSquaredError()}

###############################################################################
# 1) PROJECT SETUP
###############################################################################
project_root = Path.cwd().parent.parent  # Adjust if needed
sys.path.append(str(project_root))
os.chdir(str(project_root))
warnings.filterwarnings("ignore")

# Custom modules
from data.data_loader import get_data_mt5
from features.feature_engineering import add_all_ta_features

###############################################################################
# 2) DATA LOADING & PREPARATION
###############################################################################
if not mt5.initialize():
    print("Failed to initialize MT5")
else:
    data = get_data_mt5(symbol="BTCUSD", timeframe=mt5.TIMEFRAME_H4, n_bars=2000, start_pos=0)
    mt5.shutdown()

df = add_all_ta_features(data).dropna()
df.sort_index(inplace=True)  # Ensure chronological order

# We'll forecast price, so we keep just 'close' as the target
df_selected = df[['close']].copy()

print("DataFrame shape:", df_selected.shape)
print(df_selected.head())

###############################################################################
# 3) LOAD MODEL & SCALER (NO RE-FIT)
###############################################################################
model_path = "models/saved_models/best_timeseries_lstm_model.h5"
scaler_path = "models/saved_models/minmax_scaler.pkl"

# Load the trained LSTM model
best_model = load_model(model_path, custom_objects=custom_objs)
print(f"Loaded model from '{model_path}'")

# (Re-)compile if needed
best_model.compile(optimizer=Adam(learning_rate=0.001), loss="mean_squared_error")

# Load the MinMaxScaler that was previously fitted on training data
scaler = joblib.load(scaler_path)
print(f"Loaded scaler from '{scaler_path}'")

###############################################################################
# 4) BACKTEST PARAMETERS
###############################################################################
lookback = 10       # Must match what was used during training
threshold = 0.002   # e.g. 0.2% implied return threshold
fees = 0.0002       # e.g. 0.02% transaction cost per trade
execution_lag = 1   # trade on the NEXT bar to avoid look-ahead (set 0 for same-bar)

###############################################################################
# 5) HELPER: CREATE SEQUENCES FOR INFERENCE
###############################################################################
def create_sequences(arr, lookback):
    """Convert data into overlapping lookback sequences for LSTM."""
    X_seq = []
    for i in range(lookback, len(arr)):
        X_seq.append(arr[i - lookback : i])
    return np.array(X_seq)

###############################################################################
# 6) PREPARE FULL DATASET FOR BACKTESTING (TRANSFORM ONLY)
###############################################################################
# IMPORTANT: Use scaler.transform() instead of scaler.fit_transform()
df_scaled = scaler.transform(df_selected[['close']])

# Create sequences from the full dataset
X_full_seq = create_sequences(df_scaled, lookback)

# Predict using the entire dataset
preds_scaled = best_model.predict(X_full_seq, verbose=0)

# Convert scaled predictions back to original price domain
preds_inverted = scaler.inverse_transform(preds_scaled)

# Index that corresponds to the predictions
test_index = df_selected.index[lookback:]

# Ensure alignment (trim if any slight length mismatch)
n = min(len(test_index), len(preds_inverted))
test_index = test_index[:n]
preds_inverted = preds_inverted[:n].reshape(-1)

# Actual prices aligned to prediction rows
actual_prices = df_selected['close'].reindex(test_index).values

# Compute implied returns from predicted price vs actual price
implied_returns = (preds_inverted - actual_prices) / actual_prices

# Compute RMSE of price forecasts
rmse = np.sqrt(mean_squared_error(actual_prices, preds_inverted))

###############################################################################
# 7) BACKTEST WITH VECTORBT (TARGET EXPOSURE: -1, 0, +1)
###############################################################################
print("\n=== Running Full Backtest on Entire Dataset ===")

# Convert returns to exposure: Long (+1), Short (-1), Flat (0)
exposure = np.where(implied_returns > threshold, 1.0,
            np.where(implied_returns < -threshold, -1.0, 0.0)).astype(float)

# Align exposure with price series; optional next-bar execution to avoid look-ahead
close = df_selected['close'].reindex(test_index)
exposure_s = pd.Series(exposure, index=test_index)
if execution_lag > 0:
    exposure_s = exposure_s.shift(execution_lag).fillna(0.0)

pf = vbt.Portfolio.from_orders(
    close=close,
    size=exposure_s,             # -1 short, 0 flat, +1 long
    size_type='targetpercent',
    init_cash=10000,
    freq='4H',
    fees=fees
)

###############################################################################
# 8) PRINT PERFORMANCE METRICS
###############################################################################
total_return = pf.total_return()
sharpe_ratio = pf.sharpe_ratio()

print(f"\nFull Backtest RMSE={rmse:.4f}, Return={total_return:.2%}, Sharpe={sharpe_ratio:.2f}")
print("\nVectorbt Full Dataset Stats:")
print(pf.stats())

# Plot
fig = pf.plot()
fig.update_layout(title="Full Dataset Backtest (LSTM Price Forecasting)")
fig.show()




DataFrame shape: (2000, 1)
                        close
time                         
2024-04-03 12:00:00  65929.59
2024-04-03 16:00:00  66195.94
2024-04-03 20:00:00  65715.27
2024-04-04 00:00:00  66304.94
2024-04-04 04:00:00  65476.41
Loaded model from 'models/saved_models/best_timeseries_lstm_model.h5'
Loaded scaler from 'models/saved_models/minmax_scaler.pkl'
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step

=== Running Full Backtest on Entire Dataset ===

Full Backtest RMSE=1438.10, Return=8.85%, Sharpe=0.44

Vectorbt Full Dataset Stats:
Start                               2024-04-05 04:00:00
End                                 2025-03-02 16:00:00
Period                                331 days 16:00:00
Start Value                                     10000.0
End Value                                  10885.263431
Total Return [%]                               8.852634
Benchmark Return [%]                          25.862688
Max Gross Exposure [%]             