# Second Feature Engineering

In [37]:
## retrieving ibovespa data

import yfinance as yf
import pandas as pd
import os
from pathlib import Path

ibov = "^BVSP"
df_ibov = yf.download(ibov, start="2019-01-01", end="2023-11-20")
df_ibov = df_ibov[['Close']].reset_index()
df_ibov.rename(columns={'Date': 'date', 'Close': 'ibovespa_close'}, inplace=True)
Path("../data").mkdir(exist_ok=True)
df_ibov.to_csv("../data/ibovespa_2019-2023.csv", index=False)
df_ibov

  df_ibov = yf.download(ibov, start="2019-01-01", end="2023-11-20")
[*********************100%***********************]  1 of 1 completed


Price,date,ibovespa_close
Ticker,Unnamed: 1_level_1,^BVSP
0,2019-01-02,91012.0
1,2019-01-03,91564.0
2,2019-01-04,91841.0
3,2019-01-07,91699.0
4,2019-01-08,92032.0
...,...,...
1208,2023-11-10,120636.0
1209,2023-11-13,120376.0
1210,2023-11-14,123328.0
1211,2023-11-16,124576.0


In [38]:
## load stocks data

tickers = [
    'PETR4', 'VALE3', 'ITUB4', 'BBDC4', 'ABEV3', 'BBAS3', 'GGBR4', 'BRAP4', 'LREN3', 'MGLU3',
    'B3SA3', 'WEGE3', 'JBSS3', 'SUZB3', 'RADL3', 'ELET3', 'ELET6', 'SANB11', 'RENT3', 'RAIL3',
    'VIVT4', 'KLBN11', 'HYPE3', 'CSAN3', 'UGPA3', 'BRFS3', 'BRKM5', 'CIEL3', 'TOTS3', 'ENBR3'
]

base_dir = "../data/base"
all_dfs = []

for year in range(2019, 2024):
    file_path = os.path.join(base_dir, f"{year}_brazil_stocks.csv")
    if os.path.exists(file_path):
        print(f"Loading {file_path}...")
        df_year = pd.read_csv(file_path, low_memory=False)
        df_year['date'] = pd.to_datetime(df_year['date'], format='%Y%m%d', errors='coerce')
        all_dfs.append(df_year)
    else:
        print(f"Warning: {file_path} not found — skipping.")

if not all_dfs:
    raise FileNotFoundError("No yearly stock data files were found (2019–2023).")

stock_df = pd.concat(all_dfs, ignore_index=True)
stock_df = stock_df[stock_df['ticker'].isin(tickers)]

cols_to_drop = [
    'currency', 'name', 'marketType', 'bdiCode', 'prazoT', 'paperSpecification',
    'optionPrice', 'priceCorrection', 'paperDueDate', 'quoteFactor'
]
stock_df = stock_df.drop(columns=cols_to_drop, errors='ignore')
stock_df['date'] = pd.to_datetime(stock_df['date'])

Loading ../data/base\2019_brazil_stocks.csv...
Loading ../data/base\2020_brazil_stocks.csv...
Loading ../data/base\2021_brazil_stocks.csv...
Loading ../data/base\2022_brazil_stocks.csv...
Loading ../data/base\2023_brazil_stocks.csv...


In [39]:
## merge with Ibovespa
ibov_df = pd.read_csv("../data/ibovespa_2019-2023.csv")
ibov_df.rename(columns={'Date': 'date'}, inplace=True)
ibov_df['date'] = pd.to_datetime(ibov_df['date'])

merged = stock_df.merge(ibov_df[['date', 'ibovespa_close']], on='date', how='left')
df = merged.copy()

In [40]:
## feature Engineering

df['day_of_week'] = df['date'].dt.day_name()
day_map = {'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5}
df['day_of_week'] = df['day_of_week'].map(day_map)

df['daily_return'] = (df['close'] - df['open']) / df['open']
df['price_range'] = df['max'] - df['min']
df['volume_per_quantity'] = df['volume'] / df['quantity']

df['tomorrow'] = df.groupby('ticker')['close'].shift(-1)
df['target'] = (df['tomorrow'] > df['close']).astype(int)

# Rolling metrics
df['rolling_close_5']   = df.groupby('ticker')['close'].transform(lambda x: x.shift(1).rolling(5).mean())
df['rolling_std_5']     = df.groupby('ticker')['close'].transform(lambda x: x.shift(1).rolling(5).std())
df['rolling_return_5']  = df.groupby('ticker')['daily_return'].transform(lambda x: x.shift(1).rolling(5).mean())
df['rolling_volume_5']  = df.groupby('ticker')['volume'].transform(lambda x: x.shift(1).rolling(5).mean())
df['momentum_5']        = df['close'] / df['rolling_close_5'] - 1


In [41]:
horizons = [2, 5, 55, 220]
for h in horizons:
    df[f"Close_Ratio_{h}"] = df.groupby('ticker')['close'].transform(lambda x: x / x.rolling(h).mean())
    df[f"Trend_{h}"] = df.groupby('ticker')['target'].transform(lambda x: x.shift(1).rolling(h).sum())

df = df.dropna(subset=df.columns[df.columns != "tomorrow"])


In [42]:
# Train/test model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score

drop_features = ['open', 'close', 'min', 'max', 'avg', 'daily_return', 'rolling_close_5', 'Trend_220', 'Close_Ratio_2']
features = [
    'quantity', 'volume', 'ibovespa_close', 'day_of_week',
    'price_range', 'volume_per_quantity',
    'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5',
    'Close_Ratio_5', 'Close_Ratio_55', 'Close_Ratio_220',
    'Trend_5', 'Trend_55'
]


train = df.iloc[:-500]
test = df.iloc[-500:]

model = RandomForestClassifier(n_estimators=100, random_state=1)
model.fit(train[features], train['target'])
preds = model.predict(test[features])
print("Precision:", precision_score(test['target'], preds))


Precision: 0.5932203389830508


The Random Forest model achieved a precision of 0.59, indicating that its “up” predictions were correct about 59% of the time.

However, since financial data is sequential and non-stationary, a single train-test split may not represent the model’s true performance over time. And such precision could be result of overfitting, which you'll soon understand as one of the major problems we've had during this project.

To address this, we adopted a walk-forward (backtesting) strategy, where we train on the first year and test on the next, then progressively expand the training set.
This approach ensures that:

- The model only uses past data to predict the future (avoiding data leakage).
- We can evaluate how well it generalizes to new market conditions over time.

In [43]:
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["target"])
    preds = model.predict(test[predictors])
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["target"], preds], axis=1)
    return combined

In [44]:
def backtest(data, model, predictors, start=50, step=220):
    all_predictions = []

    for i in range(start, data.shape[0], step):
        train = data.iloc[0:i].copy()
        test = data.iloc[i:(i+step)].copy()
        predictions = predict(train, test, predictors, model)
        all_predictions.append(predictions)

    return pd.concat(all_predictions)

In [None]:
predictions = backtest(df, model, features)

In [21]:
predictions["Predictions"].value_counts()

Predictions
1    545
0    445
Name: count, dtype: int64

In [22]:
precision_score(predictions["target"], predictions["Predictions"])

0.7045871559633028

In [23]:
predictions["target"].value_counts() / predictions.shape[0]

target
1    0.514141
0    0.485859
Name: count, dtype: float64

## Backtesting (Walk-Forward Validation)

After introducing **backtesting (walk-forward validation)**, the model evaluation became more realistic and representative of how it would perform in real-world trading scenarios.

### Why Backtesting Matters

* **Chronological Training:**
  Each training period uses only past data to predict the next horizon, avoiding future data leakage.

* **Realistic Simulation:**
  This setup mirrors how the model would actually be deployed in live trading, being retrained periodically as new data becomes available.

### Comparing Results

Even though the overall precision dropped from approximately **0.85** to **0.68**, the newer result is more trustworthy because:

* The earlier score was based on a single, fixed test set, which could reflect a lucky period rather than consistent predictive power.
* The backtesting result reflects average real-world performance across multiple time periods, making it more robust and less overfitted.


In [24]:
horizons = [2, 5, 55, 220] ## two days, a week, a month, a year worth of trade
new_predictors = []

for horizon in horizons:
    # Create rolling ratio (close vs rolling mean of close)
    ratio_column = f"Close_Ratio_{horizon}"
    df[ratio_column] = df["close"] / df["close"].rolling(horizon).mean()

    # Create rolling trend (sum of past 'target' values)
    trend_column = f"Trend_{horizon}"
    df[trend_column] = df["target"].shift(1).rolling(horizon).sum()

    new_predictors += [ratio_column, trend_column]

# Combine your base features with the new ones
features = [
    'open', 'close', 'min', 'max', 'avg', 'quantity', 'volume',
    'ibovespa_close', 'day_of_week', 'daily_return', 'price_range', 'volume_per_quantity',
    'rolling_close_5', 'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5'
] + new_predictors

print("Final feature list:")
print(features)


Final feature list:
['open', 'close', 'min', 'max', 'avg', 'quantity', 'volume', 'ibovespa_close', 'day_of_week', 'daily_return', 'price_range', 'volume_per_quantity', 'rolling_close_5', 'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5', 'Close_Ratio_2', 'Trend_2', 'Close_Ratio_5', 'Trend_5', 'Close_Ratio_55', 'Trend_55', 'Close_Ratio_220', 'Trend_220']


### Added Features

Here we added two new features to improve the model's ability to capture both short-term and long-term market patterns:

- **Close_Ratio_h:**
  Measures the relative price position — how far above or below the recent average the price is.
  This helps identify potential overbought or oversold conditions.

- **Trend_h:**
  Captures momentum by counting how many times the price increased in the past *h* sessions.
  This helps the model detect sustained uptrends or downtrends.

In [25]:
df = df.dropna(subset=df.columns[df.columns != "tomorrow"])

In [26]:
df

Unnamed: 0,date,ticker,name,open,close,min,max,avg,quantity,volume,...,momentum_5,tomorrow,Close_Ratio_2,Trend_2,Close_Ratio_5,Trend_5,Close_Ratio_55,Trend_55,Close_Ratio_220,Trend_220
220,2023-03-21,VALE3,VALE,83.22,82.71,81.54,83.44,82.34,15692200,1.292122e+09,...,0.000556,13.95,1.565884,1.0,1.912193,2.0,2.487056,29.0,2.456513,112.0
221,2023-03-21,ABEV3,AMBEVS/A,14.14,13.95,13.90,14.18,13.99,16500900,2.309042e+08,...,-0.011620,13.14,0.288641,1.0,0.322767,2.0,0.421965,28.0,0.414884,111.0
222,2023-03-21,BBDC4,BRADESCO,13.19,13.14,13.09,13.37,13.22,38954800,5.152254e+08,...,-0.022467,23.88,0.970100,0.0,0.303970,2.0,0.397542,28.0,0.395019,110.0
223,2023-03-21,ITUB4,ITAUUNIBANCO,23.50,23.88,23.50,24.15,23.94,38574400,9.237763e+08,...,0.010922,23.40,1.290113,1.0,0.762403,2.0,0.723302,28.0,0.718054,111.0
224,2023-03-21,PETR4,PETROBRAS,23.20,23.40,23.08,23.60,23.40,43391200,1.015784e+09,...,0.001712,81.68,0.989848,1.0,0.744843,2.0,0.734132,27.0,0.702641,111.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035,2023-11-14,VALE3,VALE,72.02,73.61,71.90,74.30,73.50,52233600,3.839284e+09,...,0.040777,29.83,1.429320,2.0,1.637015,3.0,2.233818,29.0,2.322615,110.0
1036,2023-11-14,ITUB4,ITAUUNIBANCO,29.49,29.83,29.43,30.07,29.78,41826500,1.245616e+09,...,0.019411,15.05,0.576759,1.0,0.813871,2.0,0.924492,28.0,0.939181,110.0
1037,2023-11-14,BBDC4,BRADESCO,14.80,15.05,14.71,15.09,14.94,42559800,6.361136e+08,...,-0.002254,13.70,0.670677,0.0,0.462735,2.0,0.465706,28.0,0.473761,110.0
1038,2023-11-14,ABEV3,AMBEVS/A,13.50,13.70,13.45,13.75,13.66,24478100,3.345021e+08,...,0.017831,36.18,0.953043,0.0,0.423939,2.0,0.424087,27.0,0.432114,109.0


# Test model

In [29]:
model = RandomForestClassifier(n_estimators=200, min_samples_split=50, random_state=1)

In [30]:
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["target"])
    preds = model.predict_proba(test[predictors])[:,1]
    preds[preds >=.6] = 1
    preds[preds <.6] = 0
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["target"], preds], axis=1)
    return combined

In [31]:
predictions = backtest(df, model, new_predictors)

KeyError: "['Close_Ratio_2', 'Trend_220'] not in index"

In [77]:
predictions["Predictions"].value_counts()

Predictions
0.0    416
1.0    139
Name: count, dtype: int64

In [78]:
precision_score(predictions["target"], predictions["Predictions"])

0.8345323741007195

In [79]:
predictions["target"].value_counts() / predictions.shape[0]

target
1    0.522523
0    0.477477
Name: count, dtype: float64

### Backtesting with New Features

After introducing the new features (`Close_Ratio_h` and `Trend_h`) and applying the walk-forward (backtesting) approach, the model achieved a **precision of approximately 0.83**.

- **Precision:** 0.83
  When the model predicts that the price will rise, it is correct about 83% of the time.
  This is a significant improvement over earlier backtesting runs (~0.68 precision), showing that the new features helped the model better capture meaningful market patterns.
  It's important to mention that no evaluation tests were performed here, they'll be displayed in the EvaluationSecondModels.ipynb. Hence we should assume here that the model might (probably) be overfitting due to its high precision.

#### Why Performance Improved

The new features (`Close_Ratio_h` and `Trend_h`) capture both **relative price levels** and **momentum** across multiple time horizons.
This allows the Random Forest model to:

- Recognize **short-term corrections** and **long-term trends**.
- Respond to **multi-scale market dynamics**, improving its ability to generalize.
- Produce more **stable and realistic predictions** when evaluated through walk-forward validation.

Overall, combining richer temporal features with proper backtesting leads to **more reliable and actionable model performance** in a time-series trading context.
