This notebook constitutes the initial modeling stage of the project. Its primary purpose is to establish robust,
well-documented baseline models against which more sophisticated architectures (e.g. LSTM or TFT networks)
can subsequently be compared.

The main steps are:

1. Load the pre-processed train, validation and test datasets.
2. Specify the explanatory variables (features) and the response variable (binary target).
3. Train a regularized Logistic Regression model using standardized features.
4. Train an XGBoost classifier capable of capturing non-linear relationships.
5. Evaluate both models on train, validation and test sets using a coherent set of metrics.
6. Persist the trained models and their evaluation metrics to disk (`models/` and `reports/` folders).

No time-series windowing or sequence models are considered in this notebook; those will be introduced in a
subsequent modeling stage dedicated specifically to recurrent neural networks.

# 0. Imports and global configuration


In [1]:
import sys
from pathlib import Path
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    RocCurveDisplay,
)

import xgboost as xgb

from src.utils.config import load_config

cfg = load_config()

MODELS_DIR = PROJECT_ROOT / "models"
REPORTS_DIR = PROJECT_ROOT / "reports"
PROC_DATA_DIR = PROJECT_ROOT / "data" / "processed"

REPORTS_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

TARGET_TICKER = "AMD"
RANDOM_SEED = 42


XGBoostError: 
XGBoost Library (libxgboost.dylib) could not be loaded.
Likely causes:
  * OpenMP runtime is not installed
    - vcomp140.dll or libgomp-1.dll for Windows
    - libomp.dylib for Mac OSX
    - libgomp.so for Linux and other UNIX-like OSes
    Mac OSX users: Run `brew install libomp` to install OpenMP runtime.

  * You are running 32-bit Python on a 64-bit OS

Error message(s): ["dlopen(/Users/jenriquezafra/Proyectos/Dev/python/Equity-Signals/.venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib\n  Referenced from: <636BF463-1886-392D-B8B3-6011C44DCEE9> /Users/jenriquezafra/Proyectos/Dev/python/Equity-Signals/.venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.dylib\n  Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/Users/jenriquezafra/.pyenv/versions/3.12.1/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jenriquezafra/.pyenv/versions/3.12.1/lib/libomp.dylib' (no such file), '/opt/homebrew/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/lib/libomp.dylib' (no such file), '/Users/jenriquezafra/.pyenv/versions/3.12.1/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jenriquezafra/.pyenv/versions/3.12.1/lib/libomp.dylib' (no such file), '/opt/homebrew/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/lib/libomp.dylib' (no such file)"]


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    RocCurveDisplay,
)

import matplotlib.pyplot as plt

import xgboost as xgb  # if this explodes, xgboost is not installed

# project folders (very simple, nothing fancy)
PROJECT_ROOT = Path.cwd().resolve()
DATA_PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
MODELS_DIR = PROJECT_ROOT / "models"
REPORTS_DIR = PROJECT_ROOT / "reports"

MODELS_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

TARGET_COL = "BinaryTarget"
RANDOM_STATE = 42

Para usar windowing, hacer algo como lo siguiente:

````
from src.processing.windowing import build_windows_from_df

train = pd.read_parquet("data/processed/train.parquet")
val   = pd.read_parquet("data/processed/val.parquet")
test  = pd.read_parquet("data/processed/test.parquet")

WINDOW_SIZE = 60
TARGET_COL = "BinaryTarget"

X_train, y_train, feature_cols = build_windows_from_df(
    train, target_col=TARGET_COL, window_size=WINDOW_SIZE
)
# Para val/test usas las MISMAS feature_cols
from src.processing.windowing import build_windows

X_val, y_val = build_windows(val, feature_cols, TARGET_COL, WINDOW_SIZE)
X_test, y_test = build_windows(test, feature_cols, TARGET_COL, WINDOW_SIZE)

```

---
# 1. Load processed datasets

---
# 2. Feature and target definition

---
# 3. Target balance inspection

---
# 4. Unified Evaluation Utility
(definition of the evaluation function)

---
# 5. Baseline Model I: Logistic Regression

## 5.1. Model construction

## 5.2. Training

## 5.3. Evaluation

---
# 6. Baseline Model II: XGBoost Classifier

## 6.1. Model construction

## 6.2. Training

## 6.3. Evaluation

---
# 7. Metrics and Export

---
# 8. Model serialization
- save logistic regression model
- save XGBoost model

---
# 9. Placeholder for Sequence Models (LSTM or even TFT)
