## Import libraries

In [3]:
import src.config as config

## ETL — Extract, Transform, Load of Raw Dataset

This section is responsible for the Extraction phase of the ETL process, pulling historical financial data from multiple sources as defined in the configuration file [dataset.yaml](../configs/dataset.yaml).

- Data Storage: Each dataset is stored locally in the data/raw/ directory, organized and saved individually as .csv files for traceability and versioning.
- Data Cleaning:
  - Missing values are treated using a standard cleaning strategy.
  - Features with low variance (threshold < 0.01) are removed to reduce noise and improve modeling efficiency. In this execution, no low-variance columns were found.

In [4]:
from src import dataset

dataset.main(
    # ---- REPLACE DEFAULT AS APPROPRIATE ----
    asset = '^BVSP',
    asset_focus = 'Close',
    years = 9
    # -----------------------------------------
    )

[32m2025-06-21 13:22:01.090[0m | [1mINFO    [0m | [36msrc.dataset[0m:[36mmain[0m:[36m32[0m - [1mStarting raw data loading...[0m
[32m2025-06-21 13:22:01.090[0m | [1mINFO    [0m | [36msrc.dataset[0m:[36mmain[0m:[36m38[0m - [1mRequesting information between 2015-06-24 and 2025-06-21[0m
[32m2025-06-21 13:22:01.109[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36m_load_from_yfinance[0m:[36m56[0m - [1mDownloading BOVESPA (^BVSP) from yfinance...[0m


[*********************100%***********************]  1 of 1 completed


[32m2025-06-21 13:22:03.852[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36m_load_from_yfinance[0m:[36m56[0m - [1mDownloading S&P500 (^GSPC) from yfinance...[0m


[*********************100%***********************]  1 of 1 completed


[32m2025-06-21 13:22:06.725[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36m_load_from_yfinance[0m:[36m56[0m - [1mDownloading BITCOIN (BTC-USD) from yfinance...[0m


[*********************100%***********************]  1 of 1 completed


[32m2025-06-21 13:22:09.752[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36m_load_from_yfinance[0m:[36m56[0m - [1mDownloading OURO (GC=F) from yfinance...[0m


[*********************100%***********************]  1 of 1 completed


[32m2025-06-21 13:22:12.677[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36m_load_from_yfinance[0m:[36m56[0m - [1mDownloading PETROLEO (CL=F) from yfinance...[0m


[*********************100%***********************]  1 of 1 completed


[32m2025-06-21 13:22:15.596[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36m_load_from_yfinance[0m:[36m56[0m - [1mDownloading ACUCAR (SB=F) from yfinance...[0m


[*********************100%***********************]  1 of 1 completed


[32m2025-06-21 13:22:18.451[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36mload[0m:[36m81[0m - [1mDownloading SELIC (11) from the Central Bank of Brazil API...[0m
[32m2025-06-21 13:22:40.480[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36mload[0m:[36m81[0m - [1mDownloading CDI (12) from the Central Bank of Brazil API...[0m
[32m2025-06-21 13:23:03.250[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36mload[0m:[36m81[0m - [1mDownloading SELIC_Anual (1178) from the Central Bank of Brazil API...[0m
[32m2025-06-21 13:23:25.115[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36mload[0m:[36m81[0m - [1mDownloading SELIC_Meta_Anual (432) from the Central Bank of Brazil API...[0m
[32m2025-06-21 13:23:57.300[0m | [1mINFO    [0m | [36msrc.utils.dataset.dataset_loading_strategy[0m:[36mload[0m:[36m81[0m - [1mDownloading IPCA_Mensal (433

INFO:src.utils.dataset.clean_strategy:Executing CleanMissingValues...
INFO:src.utils.dataset.clean_strategy:Executing CleanLowVariance with threshold=0.01...
INFO:src.utils.dataset.clean_strategy:Columns removed due to low variance: Index([], dtype='object')


[32m2025-06-21 13:24:33.258[0m | [32m[1mSUCCESS [0m | [36msrc.dataset[0m:[36mmain[0m:[36m64[0m - [32m[1mRaw data successfully loaded...[0m


  updated_df.bfill(inplace=True)


            ('Close', '^BVSP')  ('High', '^BVSP')  ('Low', '^BVSP')  \
2025-06-12            137800.0           137931.0          136175.0   
2025-06-13            137213.0           137800.0          136586.0   
2025-06-16            139256.0           139988.0          137212.0   
2025-06-17            138840.0           139497.0          138293.0   
2025-06-18            138717.0           139161.0          138443.0   
2025-06-19            138717.0           139161.0          138443.0   
2025-06-20            137116.0           138719.0          136815.0   

            ('Open', '^BVSP')  ('Volume', '^BVSP')  ('Close', '^GSPC')  \
2025-06-12           137127.0            7124600.0         6045.259766   
2025-06-13           137800.0            8628300.0         5976.970215   
2025-06-16           137212.0            7620500.0         6033.109863   
2025-06-17           139256.0            8377000.0         5982.720215   
2025-06-18           138844.0            8323400.0         59

## Feature Engineering — Time Series Preparation

This part of the pipeline is responsible for transforming the cleaned dataset into structured features suitable for training time series models. It includes the following key steps:
- Feature Generation: Constructs relevant features based on historical market data.
- Dataset Splitting: The dataset is split into training and testing sets using a consistent strategy to preserve temporal structure.
- Time Series Windowing: Converts the sequential data into overlapping windows, enabling the model to learn temporal dependencies.
- Saving Artifacts: Both training and testing sets are stored for reproducibility, along with the transformation pipelines applied during preprocessing.

In [5]:
from src import features

features.main(
    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
    input_path = config.PROCESSED_DATA_DIR / "dataset.csv",
    train_dir = config.PROCESSED_DATA_DIR,
    test_dir = config.PROCESSED_DATA_DIR,
    # -----------------------------------------
)

[32m2025-06-21 13:24:45.349[0m | [1mINFO    [0m | [36msrc.features[0m:[36mmain[0m:[36m31[0m - [1mGenerating features from dataset...[0m
[32m2025-06-21 13:24:45.417[0m | [1mINFO    [0m | [36msrc.utils.features.splitter_strategy[0m:[36msplit[0m:[36m15[0m - [1mSplitting dataset into training and testing sets...[0m
[32m2025-06-21 13:24:45.444[0m | [1mINFO    [0m | [36msrc.utils.features.generator_strategy[0m:[36mgenerate[0m:[36m18[0m - [1mGenerating Timeseries from dataset...[0m
[32m2025-06-21 13:24:46.003[0m | [32m[1mSUCCESS [0m | [36msrc.features[0m:[36mmain[0m:[36m46[0m - [32m[1mSaving train features in C:\Repositories\ds-lstm-ibov\data\processed...[0m
[32m2025-06-21 13:24:46.420[0m | [32m[1mSUCCESS [0m | [36msrc.features[0m:[36mmain[0m:[36m50[0m - [32m[1mSaving test features in C:\Repositories\ds-lstm-ibov\data\processed...[0m
[32m2025-06-21 13:24:46.429[0m | [32m[1mSUCCESS [0m | [36msrc.features[0m:[36mmain[0m:[

## Modeling

The modeling process begins with loading the training dataset. Next, the base model is constructed, and both the compilation and training strategies are defined. With the pipeline structure in place, the model is trained over 100 epochs using an iterative approach to adjust the weights.

During training, key metrics are monitored, including accuracy, loss (error), validation accuracy and loss (on unseen data), as well as the learning rate.

In [6]:
from src.modeling import train

train.main(
    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
    X_path = config.PROCESSED_DATA_DIR / "X_train.npy",
    y_path = config.PROCESSED_DATA_DIR / "y_train.npy"
    # -----------------------------------------
)

[32m2025-06-21 13:24:46.657[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m29[0m - [1mLoading training dataset...[0m
[32m2025-06-21 13:24:46.778[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m37[0m - [1mBuilding model...[0m
[32m2025-06-21 13:24:46.778[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m43[0m - [1mSelecting compile strategy...[0m
[32m2025-06-21 13:24:46.778[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m46[0m - [1mSelecting training strategy...[0m
[32m2025-06-21 13:24:46.778[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m49[0m - [1mBuilding model training pipeline template...[0m
[32m2025-06-21 13:24:46.778[0m | [1mINFO    [0m | [36msrc.modeling.train[0m:[36mmain[0m:[36m56[0m - [1mTraining model...[0m
Epoch 1/1000
[32m2025-06-21 13:25:02.888[0m | [31m[1mERROR   [0m | [36msrc.utils.train.train_strategy[0m:[36mtrain

AttributeError: 'Sequential' object has no attribute 'epoch'

## Predict

In the prediction stage, the trained model is loaded along with the input data. The model then performs inference, generating predictions based on the provided data.


In [None]:
from src.modeling import predict
predict.main(
    # ---- REPLACE DEFAULT PATHS AS APPROPRIATE ----
    input_path = config.PROCESSED_DATA_DIR / "dataset.csv",
    preprocessor_path = config.PROCESSED_DATA_DIR / "preprocessor.pkl",
    model_path = config.MODELS_DIR / "Sequential_epoch137_loss0.0081.keras",  # Select the best model  
    postprocessor_path = config.PROCESSED_DATA_DIR / "postprocessor.pkl",
    output_path = config.PROCESSED_DATA_DIR / "dataset_report.csv",
    # -----------------------------------------
)

## Plot

Open the [Power BI Report]((../reports/pbi/amp-fynance.pbip)) and refresh data.