# Second Feature Engineering

Downloading the new dataset of ibovespa and from our dataset load the stocks of years 2019 until 2023. We already implemented and its stored in our data folder.

In [None]:
## retrieving ibovespa data
## Do not need to run, we download it already

import yfinance as yf
import pandas as pd
import os
from pathlib import Path

ibov = "^BVSP"
df_ibov = yf.download(ibov, start="2019-01-01", end="2023-11-20")
df_ibov = df_ibov[['Close']].reset_index()
df_ibov.rename(columns={'Date': 'date', 'Close': 'ibovespa_close'}, inplace=True)
Path("../data").mkdir(exist_ok=True)
#df_ibov.to_csv("../data/ibovespa_2019-2023.csv", index=False)
#df_ibov

In [113]:
# load ibov
ibov_df = pd.read_csv("../data/ibovespa_2019-2023.csv")

In [114]:
## load stocks data

tickers = [
    'PETR4', 'VALE3', 'ITUB4', 'BBDC4', 'ABEV3', 'BBAS3', 'GGBR4', 'BRAP4', 'LREN3', 'MGLU3',
    'B3SA3', 'WEGE3', 'JBSS3', 'SUZB3', 'RADL3', 'ELET3', 'ELET6', 'SANB11', 'RENT3', 'RAIL3',
    'VIVT4', 'KLBN11', 'HYPE3', 'CSAN3', 'UGPA3', 'BRFS3', 'BRKM5', 'CIEL3', 'TOTS3', 'ENBR3'
]

base_dir = "../data/base"
all_dfs = []

for year in range(2019, 2023):
    file_path = os.path.join(base_dir, f"{year}_brazil_stocks.csv")
    if os.path.exists(file_path):
        print(f"Loading {file_path}...")
        df_year = pd.read_csv(file_path, low_memory=False)
        df_year['date'] = pd.to_datetime(df_year['date'], format='%Y%m%d', errors='coerce')
        all_dfs.append(df_year)
    else:
        print(f"Warning: {file_path} not found — skipping.")

if not all_dfs:
    raise FileNotFoundError("No yearly stock data files were found (2019–2023).")

stock_df = pd.concat(all_dfs, ignore_index=True)
stock_df = stock_df[stock_df['ticker'].isin(tickers)]

cols_to_drop = [
    'currency', 'name', 'marketType', 'bdiCode', 'prazoT', 'paperSpecification',
    'optionPrice', 'priceCorrection', 'paperDueDate', 'quoteFactor'
]
stock_df = stock_df.drop(columns=cols_to_drop, errors='ignore')
stock_df['date'] = pd.to_datetime(stock_df['date'])

Loading ../data/base/2019_brazil_stocks.csv...
Loading ../data/base/2020_brazil_stocks.csv...
Loading ../data/base/2021_brazil_stocks.csv...
Loading ../data/base/2022_brazil_stocks.csv...


In [115]:
## merge with Ibovespa
ibov_df = pd.read_csv("../data/ibovespa_2019-2023.csv")
ibov_df.rename(columns={'Date': 'date'}, inplace=True)
ibov_df['date'] = pd.to_datetime(ibov_df['date'])

merged = stock_df.merge(ibov_df[['date', 'ibovespa_close']], on='date', how='left')
df = merged.copy()

## Feature Selection

In [116]:
## feature Engineering

df['day_of_week'] = df['date'].dt.day_name()
day_map = {'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5}
df['day_of_week'] = df['day_of_week'].map(day_map)

df['daily_return'] = (df['close'] - df['open']) / df['open']
df['price_range'] = df['max'] - df['min']
df['volume_per_quantity'] = df['volume'] / df['quantity']

df['tomorrow'] = df.groupby('ticker')['close'].shift(-1)
df['target'] = (df['tomorrow'] > df['close']).astype(int)

# Rolling metrics
df['rolling_close_5']   = df.groupby('ticker')['close'].transform(lambda x: x.shift(1).rolling(5).mean())
df['rolling_std_5']     = df.groupby('ticker')['close'].transform(lambda x: x.shift(1).rolling(5).std())
df['rolling_return_5']  = df.groupby('ticker')['daily_return'].transform(lambda x: x.shift(1).rolling(5).mean())
df['rolling_volume_5']  = df.groupby('ticker')['volume'].transform(lambda x: x.shift(1).rolling(5).mean())
df['momentum_5']        = df['close'] / df['rolling_close_5'] - 1


In [117]:
horizons = [2, 5, 55, 220]
for h in horizons:
    df[f"Close_Ratio_{h}"] = df.groupby('ticker')['close'].transform(lambda x: x / x.rolling(h).mean())
    df[f"Trend_{h}"] = df.groupby('ticker')['target'].transform(lambda x: x.shift(1).rolling(h).sum())

df = df.dropna(subset=df.columns[df.columns != "tomorrow"])



Delete features that are no needed;

In [118]:
drop_features = ['open', 'close', 'min', 'max', 'avg', 'daily_return', 'rolling_close_5', 'Trend_220', 'Close_Ratio_2']

df.drop(drop_features, axis=1, inplace=True)



Unnamed: 0,date,ticker,quantity,volume,ibovespa_close,day_of_week,price_range,volume_per_quantity,tomorrow,target,rolling_std_5,rolling_return_5,rolling_volume_5,momentum_5,Trend_2,Close_Ratio_5,Trend_5,Close_Ratio_55,Trend_55,Close_Ratio_220
6600,2019-11-18,ABEV3,9445200,165711710.0,106347.0,1,0.26,17.544542,17.67,1,0.091815,0.001866,246297201.6,0.008848,2.0,1.005843,3.0,0.9469,27.0,0.971916
6601,2019-11-18,B3SA3,10101700,501185409.0,106347.0,1,1.54,49.613967,47.94,0,0.542015,0.007892,394530554.4,-0.008612,1.0,0.990027,3.0,1.066652,34.0,1.284061
6602,2019-11-18,BBAS3,15585600,724962222.0,106347.0,1,1.15,46.514874,45.79,0,0.494803,-0.001119,641216796.2,-0.014652,1.0,0.989588,2.0,0.98751,26.0,0.933396
6603,2019-11-18,BBDC4,18808900,625495829.0,106347.0,1,0.91,33.255312,33.07,0,0.394373,-0.00511,597208475.0,-0.020286,0.0,0.986188,0.0,0.970354,27.0,0.886461
6604,2019-11-18,BRAP4,1248000,40441499.0,106347.0,1,0.78,32.405047,32.69,1,0.455379,-0.001457,46990409.6,-0.002218,2.0,1.002414,3.0,1.032162,32.0,1.047033


# Evaluating Models with backtesting

In [119]:
# Train/test model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
features = [
    'quantity', 'volume', 'ibovespa_close', 'day_of_week',
    'price_range', 'volume_per_quantity',
    'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5',
    'Close_Ratio_5', 'Close_Ratio_55', 'Close_Ratio_220',
    'Trend_5', 'Trend_55'
]

train = df.iloc[:-500]
test = df.iloc[-500:]

model = RandomForestClassifier(n_estimators=100, random_state=1)
model.fit(train[features], train['target'])
preds = model.predict(test[features])
print("Precision:", precision_score(test['target'], preds))


Precision: 0.6330645161290323


The Random Forest model achieved a precision of 0.59, indicating that its “up” predictions were correct about 59% of the time.

However, since financial data is sequential and non-stationary, a single train-test split may not represent the model’s true performance over time. And such precision could be result of overfitting, which you'll soon understand as one of the major problems we've had during this project.

To address this, we adopted a walk-forward (backtesting) strategy, where we train on the first year and test on the next, then progressively expand the training set.
This approach ensures that:

- The model only uses past data to predict the future (avoiding data leakage).
- We can evaluate how well it generalizes to new market conditions over time.

In [120]:
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["target"])
    preds = model.predict(test[predictors])
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["target"], preds], axis=1)
    return combined

In [121]:
import time
from tqdm import tqdm  # progress bar

def backtest(data, model, predictors, start=50, step=1000):
    all_predictions = []

    # tqdm adds a progress bar in notebooks or terminal
    for i in tqdm(range(start, data.shape[0], step), desc="Backtesting Progress"):
        t0 = time.time()

        train = data.iloc[0:i].copy()
        test = data.iloc[i:(i+step)].copy()
        predictions = predict(train, test, predictors, model)
        all_predictions.append(predictions)

        elapsed = time.time() - t0
        print(f"Iteration up to index {i:5d} | train size: {len(train):5d} | took {elapsed:.2f} sec")

    return pd.concat(all_predictions)


In [122]:
predictions = backtest(df, model, features)

Backtesting Progress:   0%|          | 0/23 [00:00<?, ?it/s]

Iteration up to index    50 | train size:    50 | took 0.06 sec


Backtesting Progress:   9%|▊         | 2/23 [00:00<00:02,  7.05it/s]

Iteration up to index  1050 | train size:  1050 | took 0.22 sec


Backtesting Progress:  13%|█▎        | 3/23 [00:00<00:05,  3.96it/s]

Iteration up to index  2050 | train size:  2050 | took 0.41 sec


Backtesting Progress:  17%|█▋        | 4/23 [00:01<00:07,  2.53it/s]

Iteration up to index  3050 | train size:  3050 | took 0.64 sec


Backtesting Progress:  22%|██▏       | 5/23 [00:02<00:09,  1.81it/s]

Iteration up to index  4050 | train size:  4050 | took 0.84 sec


Backtesting Progress:  26%|██▌       | 6/23 [00:03<00:13,  1.30it/s]

Iteration up to index  5050 | train size:  5050 | took 1.20 sec


Backtesting Progress:  30%|███       | 7/23 [00:04<00:15,  1.04it/s]

Iteration up to index  6050 | train size:  6050 | took 1.37 sec


Backtesting Progress:  35%|███▍      | 8/23 [00:06<00:17,  1.20s/it]

Iteration up to index  7050 | train size:  7050 | took 1.71 sec


Backtesting Progress:  39%|███▉      | 9/23 [00:08<00:20,  1.45s/it]

Iteration up to index  8050 | train size:  8050 | took 2.02 sec


Backtesting Progress:  43%|████▎     | 10/23 [00:10<00:21,  1.67s/it]

Iteration up to index  9050 | train size:  9050 | took 2.17 sec


Backtesting Progress:  48%|████▊     | 11/23 [00:13<00:22,  1.89s/it]

Iteration up to index 10050 | train size: 10050 | took 2.37 sec


Backtesting Progress:  52%|█████▏    | 12/23 [00:15<00:23,  2.11s/it]

Iteration up to index 11050 | train size: 11050 | took 2.61 sec


Backtesting Progress:  57%|█████▋    | 13/23 [00:18<00:23,  2.32s/it]

Iteration up to index 12050 | train size: 12050 | took 2.81 sec


Backtesting Progress:  61%|██████    | 14/23 [00:21<00:23,  2.56s/it]

Iteration up to index 13050 | train size: 13050 | took 3.13 sec


Backtesting Progress:  65%|██████▌   | 15/23 [00:24<00:22,  2.80s/it]

Iteration up to index 14050 | train size: 14050 | took 3.36 sec


Backtesting Progress:  70%|██████▉   | 16/23 [00:28<00:21,  3.06s/it]

Iteration up to index 15050 | train size: 15050 | took 3.66 sec


Backtesting Progress:  74%|███████▍  | 17/23 [00:32<00:19,  3.32s/it]

Iteration up to index 16050 | train size: 16050 | took 3.93 sec


Backtesting Progress:  78%|███████▊  | 18/23 [00:36<00:17,  3.58s/it]

Iteration up to index 17050 | train size: 17050 | took 4.19 sec


Backtesting Progress:  83%|████████▎ | 19/23 [00:41<00:15,  3.80s/it]

Iteration up to index 18050 | train size: 18050 | took 4.32 sec


Backtesting Progress:  87%|████████▋ | 20/23 [00:45<00:12,  4.04s/it]

Iteration up to index 19050 | train size: 19050 | took 4.59 sec


Backtesting Progress:  91%|█████████▏| 21/23 [00:50<00:08,  4.37s/it]

Iteration up to index 20050 | train size: 20050 | took 5.15 sec


Backtesting Progress:  96%|█████████▌| 22/23 [00:55<00:04,  4.58s/it]

Iteration up to index 21050 | train size: 21050 | took 5.07 sec


Backtesting Progress: 100%|██████████| 23/23 [01:01<00:00,  2.66s/it]

Iteration up to index 22050 | train size: 22050 | took 5.42 sec





In [123]:
predictions["Predictions"].value_counts()

Predictions
0    11501
1    11116
Name: count, dtype: int64

In [124]:
precision_score(predictions["target"], predictions["Predictions"])

0.5929291111910759

In [125]:
predictions["target"].value_counts() / predictions.shape[0]

target
0    0.507096
1    0.492904
Name: count, dtype: float64

## Backtesting (Walk-Forward Validation)

After introducing **backtesting (walk-forward validation)**, the model evaluation became more realistic and representative of how it would perform in real-world trading scenarios.

### Why Backtesting Matters

* **Chronological Training:**
  Each training period uses only past data to predict the next horizon, avoiding future data leakage.

* **Realistic Simulation:**
  This setup mirrors how the model would actually be deployed in live trading, being retrained periodically as new data becomes available.

### Comparing Results

Even though the overall precision dropped from approximately **0.59** to **0.57**, the newer result is more trustworthy because:

* The earlier score was based on a single, fixed test set, which could reflect a lucky period rather than consistent predictive power.
* The backtesting result reflects average real-world performance across multiple time periods, making it more robust and less overfitted.


In [104]:
horizons = [2, 5, 55, 220] ## two days, a week, a month, a year worth of trade
new_predictors = []

for horizon in horizons:
    # Create rolling ratio (close vs rolling mean of close)
    ratio_column = f"Close_Ratio_{horizon}"
    df[ratio_column] = df["close"] / df["close"].rolling(horizon).mean()

    # Create rolling trend (sum of past 'target' values)
    trend_column = f"Trend_{horizon}"
    df[trend_column] = df["target"].shift(1).rolling(horizon).sum()

    new_predictors += [ratio_column, trend_column]

# Combine your base features with the new ones
features = [
    'open', 'close', 'min', 'max', 'avg', 'quantity', 'volume',
    'ibovespa_close', 'day_of_week', 'daily_return', 'price_range', 'volume_per_quantity',
    'rolling_close_5', 'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5'
] + new_predictors

print("Final feature list:")
print(features)


Final feature list:
['open', 'close', 'min', 'max', 'avg', 'quantity', 'volume', 'ibovespa_close', 'day_of_week', 'daily_return', 'price_range', 'volume_per_quantity', 'rolling_close_5', 'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5', 'Close_Ratio_2', 'Trend_2', 'Close_Ratio_5', 'Trend_5', 'Close_Ratio_55', 'Trend_55', 'Close_Ratio_220', 'Trend_220']


### Added Features

Here we added two new features to improve the model's ability to capture both short-term and long-term market patterns:

- **Close_Ratio_h:**
  Measures the relative price position — how far above or below the recent average the price is.
  This helps identify potential overbought or oversold conditions.

- **Trend_h:**
  Captures momentum by counting how many times the price increased in the past *h* sessions.
  This helps the model detect sustained uptrends or downtrends.

In [105]:
df = df.dropna(subset=df.columns[df.columns != "tomorrow"])

In [106]:
df

Unnamed: 0,date,ticker,open,close,min,max,avg,quantity,volume,ibovespa_close,...,rolling_volume_5,momentum_5,Close_Ratio_2,Trend_2,Close_Ratio_5,Trend_5,Close_Ratio_55,Trend_55,Close_Ratio_220,Trend_220
6820,2019-11-28,ELET6,35.05,35.78,34.91,35.82,35.54,1103400,3.921704e+07,108290.0,...,7.508088e+07,-0.001507,1.016911,1.0,1.064121,4.0,0.912023,34.0,0.941900,122.0
6821,2019-11-28,ENBR3,19.29,19.56,19.24,19.60,19.48,3769000,7.345123e+07,108290.0,...,3.575244e+07,0.022692,0.706903,2.0,0.611326,4.0,0.498122,34.0,0.516735,122.0
6822,2019-11-28,GGBR4,17.39,17.17,17.02,17.58,17.31,8195700,1.418854e+08,108290.0,...,4.319285e+08,0.014536,0.934931,2.0,0.506759,4.0,0.444218,34.0,0.455172,123.0
6823,2019-11-28,HYPE3,33.52,33.58,33.40,33.82,33.55,888900,2.982502e+07,108290.0,...,6.401009e+07,-0.020249,1.323350,1.0,1.193489,3.0,0.872657,33.0,0.890148,123.0
6824,2019-11-28,ITUB4,34.45,34.60,34.02,34.70,34.42,15850200,5.455893e+08,108290.0,...,7.232712e+08,-0.015031,1.014960,0.0,1.229654,3.0,0.896872,32.0,0.916943,123.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29292,2022-11-09,CSAN3,18.14,17.94,17.91,18.56,18.19,7979900,1.452018e+08,113580.0,...,1.771459e+08,0.044359,1.199198,0.0,0.678312,0.0,0.578123,13.0,0.659771,99.0
29293,2022-10-27,SUZB3,53.30,53.12,51.88,53.48,52.93,7576800,4.010589e+08,114641.0,...,3.157137e+08,-0.025142,1.495075,1.0,1.563089,1.0,1.690066,14.0,1.943038,100.0
29294,2022-10-27,RAIL3,20.52,20.96,20.33,21.21,20.93,12484600,2.614176e+08,114641.0,...,2.694105e+08,0.059656,0.565875,1.0,0.565112,1.0,0.669369,14.0,0.767965,99.0
29295,2022-11-30,CSAN3,17.29,18.03,17.21,18.03,17.68,18650700,3.298320e+08,112486.0,...,1.402352e+08,0.051435,0.924853,0.0,0.738753,1.0,0.578697,14.0,0.661460,98.0


# Test model

In [107]:
model = RandomForestClassifier(n_estimators=200, min_samples_split=50, random_state=1)

In [108]:
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["target"])
    preds = model.predict_proba(test[predictors])[:,1]
    preds[preds >=.6] = 1
    preds[preds <.6] = 0
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["target"], preds], axis=1)
    return combined

In [109]:
predictions = backtest(df, model, new_predictors)

Backtesting Progress:   4%|▍         | 1/23 [00:00<00:02,  9.92it/s]

Iteration up to index    50 | train size:    50 | took 0.10 sec


Backtesting Progress:   9%|▊         | 2/23 [00:00<00:03,  5.52it/s]

Iteration up to index  1050 | train size:  1050 | took 0.24 sec


Backtesting Progress:  13%|█▎        | 3/23 [00:00<00:05,  3.60it/s]

Iteration up to index  2050 | train size:  2050 | took 0.39 sec


Backtesting Progress:  17%|█▋        | 4/23 [00:01<00:07,  2.47it/s]

Iteration up to index  3050 | train size:  3050 | took 0.60 sec


Backtesting Progress:  22%|██▏       | 5/23 [00:02<00:10,  1.79it/s]

Iteration up to index  4050 | train size:  4050 | took 0.83 sec


Backtesting Progress:  26%|██▌       | 6/23 [00:03<00:12,  1.40it/s]

Iteration up to index  5050 | train size:  5050 | took 1.02 sec


Backtesting Progress:  30%|███       | 7/23 [00:04<00:13,  1.16it/s]

Iteration up to index  6050 | train size:  6050 | took 1.17 sec


Backtesting Progress:  35%|███▍      | 8/23 [00:05<00:15,  1.03s/it]

Iteration up to index  7050 | train size:  7050 | took 1.38 sec


Backtesting Progress:  39%|███▉      | 9/23 [00:07<00:17,  1.22s/it]

Iteration up to index  8050 | train size:  8050 | took 1.64 sec


Backtesting Progress:  43%|████▎     | 10/23 [00:09<00:18,  1.43s/it]

Iteration up to index  9050 | train size:  9050 | took 1.91 sec


Backtesting Progress:  48%|████▊     | 11/23 [00:11<00:20,  1.68s/it]

Iteration up to index 10050 | train size: 10050 | took 2.25 sec


Backtesting Progress:  52%|█████▏    | 12/23 [00:14<00:21,  1.93s/it]

Iteration up to index 11050 | train size: 11050 | took 2.50 sec


Backtesting Progress:  57%|█████▋    | 13/23 [00:16<00:21,  2.18s/it]

Iteration up to index 12050 | train size: 12050 | took 2.75 sec


Backtesting Progress:  61%|██████    | 14/23 [00:19<00:21,  2.40s/it]

Iteration up to index 13050 | train size: 13050 | took 2.90 sec


Backtesting Progress:  65%|██████▌   | 15/23 [00:22<00:21,  2.63s/it]

Iteration up to index 14050 | train size: 14050 | took 3.16 sec


Backtesting Progress:  70%|██████▉   | 16/23 [00:26<00:20,  2.86s/it]

Iteration up to index 15050 | train size: 15050 | took 3.40 sec


Backtesting Progress:  74%|███████▍  | 17/23 [00:29<00:18,  3.12s/it]

Iteration up to index 16050 | train size: 16050 | took 3.72 sec


Backtesting Progress:  78%|███████▊  | 18/23 [00:34<00:16,  3.40s/it]

Iteration up to index 17050 | train size: 17050 | took 4.04 sec


Backtesting Progress:  83%|████████▎ | 19/23 [00:38<00:14,  3.70s/it]

Iteration up to index 18050 | train size: 18050 | took 4.40 sec


Backtesting Progress:  87%|████████▋ | 20/23 [00:43<00:11,  3.99s/it]

Iteration up to index 19050 | train size: 19050 | took 4.67 sec


Backtesting Progress:  91%|█████████▏| 21/23 [00:48<00:08,  4.28s/it]

Iteration up to index 20050 | train size: 20050 | took 4.95 sec


Backtesting Progress:  96%|█████████▌| 22/23 [00:53<00:04,  4.57s/it]

Iteration up to index 21050 | train size: 21050 | took 5.26 sec


Backtesting Progress: 100%|██████████| 23/23 [00:58<00:00,  2.56s/it]

Iteration up to index 22050 | train size: 22050 | took 5.66 sec





In [110]:
predictions["Predictions"].value_counts()

Predictions
0.0    20449
1.0     1948
Name: count, dtype: int64

In [111]:
precision_score(predictions["target"], predictions["Predictions"])

0.5621149897330595

In [112]:
predictions["target"].value_counts() / predictions.shape[0]

target
0    0.507702
1    0.492298
Name: count, dtype: float64

### Backtesting with New Features

After introducing the new features (`Close_Ratio_h` and `Trend_h`) and applying the walk-forward (backtesting) approach, the model achieved a **precision of approximately 0.60**.

- **Precision:** 0.60
  When the model predicts that the price will rise, it is correct about 60% of the time.
  This is an improvement over earlier backtesting runs (~0.59 precision), showing that the new features helped the model better capture meaningful market patterns.
  It's important to mention that no evaluation tests were performed here, they'll be displayed in the EvaluationSecondModels.ipynb. Hence we should assume here that the model might (probably) be overfitting due to its high precision.

#### Why Performance Improved

The new features (`Close_Ratio_h` and `Trend_h`) capture both **relative price levels** and **momentum** across multiple time horizons.
This allows the Random Forest model to:

- Recognize **short-term corrections** and **long-term trends**.
- Respond to **multi-scale market dynamics**, improving its ability to generalize.
- Produce more **stable and realistic predictions** when evaluated through walk-forward validation.

Overall, combining richer temporal features with proper backtesting leads to **more reliable and actionable model performance** in a time-series trading context.
