## Import libraries

In [None]:
import src.config as config

## ETL — Extract, Transform, Load of Raw Dataset

This section is responsible for the Extraction phase of the ETL process, pulling historical financial data from multiple sources as defined in the configuration file [dataset.yaml](../configs/dataset.yaml).

- Data Storage: Each dataset is stored locally in the data/raw/ directory, organized and saved individually as .csv files for traceability and versioning.
- Data Cleaning:
  - Missing values are treated using a standard cleaning strategy.
  - Features with low variance (threshold < 0.01) are removed to reduce noise and improve modeling efficiency. In this execution, no low-variance columns were found.

In [None]:
from src import dataset

dataset.main(
    # ---- REPLACE DEFAULT AS APPROPRIATE ----
    asset = '^BVSP',
    asset_focus = 'Close',
    years = 9
    # -----------------------------------------
    )

## Feature Engineering — Time Series Preparation

This part of the pipeline is responsible for transforming the cleaned dataset into structured features suitable for training time series models. It includes the following key steps:
- Feature Generation: Constructs relevant features based on historical market data.
- Dataset Splitting: The dataset is split into training and testing sets using a consistent strategy to preserve temporal structure.
- Time Series Windowing: Converts the sequential data into overlapping windows, enabling the model to learn temporal dependencies.
- Saving Artifacts: Both training and testing sets are stored for reproducibility, along with the transformation pipelines applied during preprocessing.

In [1]:
from src import config
from src import features

features.main(
    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
    input_path = config.PROCESSED_DATA_DIR / "dataset.csv",
    train_dir = config.PROCESSED_DATA_DIR,
    test_dir = config.PROCESSED_DATA_DIR,
    sequence_length = 20
    # -----------------------------------------
)

[32m2025-06-23 09:30:16.661[0m | [1mINFO    [0m | [36msrc.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Repositories\ds-lstm-ibov[0m


[32m2025-06-23 09:30:23.256[0m | [1mINFO    [0m | [36msrc.features[0m:[36mmain[0m:[36m32[0m - [1mGenerating features from dataset...[0m
[32m2025-06-23 09:30:23.330[0m | [1mINFO    [0m | [36msrc.utils.features.splitter_strategy[0m:[36msplit[0m:[36m15[0m - [1mSplitting dataset into training and testing sets...[0m
[32m2025-06-23 09:30:23.356[0m | [1mINFO    [0m | [36msrc.utils.features.generator_strategy[0m:[36mgenerate[0m:[36m18[0m - [1mGenerating Timeseries from dataset...[0m
[32m2025-06-23 09:30:23.472[0m | [32m[1mSUCCESS [0m | [36msrc.features[0m:[36mmain[0m:[36m47[0m - [32m[1mSaving train features in C:\Repositories\ds-lstm-ibov\data\processed...[0m
[32m2025-06-23 09:30:23.625[0m | [32m[1mSUCCESS [0m | [36msrc.features[0m:[36mmain[0m:[36m51[0m - [32m[1mSaving test features in C:\Repositories\ds-lstm-ibov\data\processed...[0m
[32m2025-06-23 09:30:23.629[0m | [32m[1mSUCCESS [0m | [36msrc.features[0m:[36mmain[0m:[

## Modeling

The modeling process begins with loading the training dataset. Next, the base model is constructed, and both the compilation and training strategies are defined. With the pipeline structure in place, the model is trained over 100 epochs using an iterative approach to adjust the weights.

During training, key metrics are monitored, including accuracy, loss (error), validation accuracy and loss (on unseen data), as well as the learning rate.

In [2]:
from src import config
from src.modeling import train

train.main(
    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
    X_path = config.PROCESSED_DATA_DIR / "X_train.npy",
    y_path = config.PROCESSED_DATA_DIR / "y_train.npy",
    batch_size = 16,
    validation_split = 0.1,
    # -----------------------------------------
)

[32m2025-06-23 09:30:26.821[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m32[0m - [1mLoading training dataset...[0m
[32m2025-06-23 09:30:26.855[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m42[0m - [1mBuilding model...[0m
[32m2025-06-23 09:30:26.855[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m48[0m - [1mSelecting compile strategy...[0m
[32m2025-06-23 09:30:26.855[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m51[0m - [1mSelecting training strategy...[0m
[32m2025-06-23 09:30:26.855[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m54[0m - [1mBuilding model training pipeline template...[0m
[32m2025-06-23 09:30:26.855[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m61[0m - [1mTraining model...[0m
Epoch 1/200
[1m184/184[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 17ms/step - loss: 0.2690 - mae: 0.5757 - mse

Traceback (most recent call last):
  File "c:\Repositories\ds-lstm-ibov\.venv\Lib\site-packages\mlflow\store\tracking\file_store.py", line 329, in search_experiments
    exp = self._get_experiment(exp_id, view_type)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Repositories\ds-lstm-ibov\.venv\Lib\site-packages\mlflow\store\tracking\file_store.py", line 427, in _get_experiment
    meta = FileStore._read_yaml(experiment_dir, FileStore.META_DATA_FILE_NAME)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Repositories\ds-lstm-ibov\.venv\Lib\site-packages\mlflow\store\tracking\file_store.py", line 1373, in _read_yaml
    return _read_helper(root, file_name, attempts_remaining=retries)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Repositories\ds-lstm-ibov\.venv\Lib\site-packages\mlflow\store\tracking\file_store.py", line 1366, in _read_helper
    result = read_yaml(root, file_name)
             ^^^^^^^

[32m2025-06-23 09:36:15.948[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m70[0m - [1mSaving 'Sequential_epoch102_loss0.1278.keras' in 'C:\Repositories\ds-lstm-ibov\models'...[0m


Registered model 'RegressionSimpleModelBuilder' already exists. Creating a new version of this model...
Created version '2' of model 'RegressionSimpleModelBuilder'.


[32m2025-06-23 09:36:38.289[0m | [32m[1mSUCCESS [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m88[0m - [32m[1mModeling training complete.[0m
[32m2025-06-23 09:36:38.289[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m91[0m - [1mElapsed time: 371.47 seconds[0m


## Predict

In the prediction stage, the trained model is loaded along with the input data. The model then performs inference, generating predictions based on the provided data.


In [3]:
from src.modeling import predict
predict.main(
    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
    input_path = config.PROCESSED_DATA_DIR / "dataset.csv",
    preprocessor_path = config.PROCESSED_DATA_DIR / "preprocessor.pkl",
    model_path = config.MODELS_DIR / "Sequential_epoch36_loss0.1808.keras",  # Select the best model  
    postprocessor_path = config.PROCESSED_DATA_DIR / "postprocessor.pkl",
    output_path = config.PROCESSED_DATA_DIR / "dataset_report.csv",
    # -----------------------------------------
)

[32m2025-06-23 09:36:38.345[0m | [1mINFO    [0m | [36msrc.modeling.predict[0m:[36mmain[0m:[36m28[0m - [1mPerforming inference for model...[0m
[32m2025-06-23 09:36:38.404[0m | [1mINFO    [0m | [36msrc.modeling.predict[0m:[36mmain[0m:[36m32[0m - [1mInput data shape: (101, 50)[0m
[32m2025-06-23 09:36:38.467[0m | [1mINFO    [0m | [36msrc.utils.features.generator_strategy[0m:[36mgenerate[0m:[36m18[0m - [1mGenerating Timeseries from dataset...[0m


  self._warn_if_super_not_called()


[1m81/81[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step


ValueError: Length mismatch: Expected axis has 81 elements, new values have 1 elements

## Plot

Open the [Power BI Report]((../reports/pbi/amp-fynance.pbip)) and refresh data.