---
### The overfitting problem
---
#### I. Load the data

In [32]:
import pandas as pd

df = pd.read_excel('data/Microsoft_LinkedIn_Processed.xlsx', parse_dates=['Date'], index_col=0)
df.head(n=5)

Unnamed: 0_level_0,Close,High,Low,Open,Volume,change_tomorrow,change_tomorrow_direction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-12-08,55.181126,55.696671,55.027369,55.44342,21220800,1.549151,UP
2016-12-09,56.049416,56.067505,55.289669,55.334891,27349400,0.321666,UP
2016-12-12,56.230289,56.34787,55.823285,55.91373,20198100,1.286169,UP
2016-12-13,56.962929,57.36089,56.29363,56.528788,35718900,-0.478644,DOWN
2016-12-14,56.691578,57.388013,56.555907,56.981005,30352700,-0.159789,DOWN


---
#### II. Machine Learning Model

Separate the data:

1. Target: which variable do you want to predict?
2. Explanatory: which variables will you use to calculate the prediction?

In [33]:
# Select target variable (next day's price change) 
target = df.change_tomorrow

# Select explanatory variables (features for the model)
explanatory = df[['Open', 'High', 'Low', 'Close', 'Volume']]

---
#### III. Train test split

**🧠 Why do we use a train-test split in Machine Learning?**

When building a machine learning model, the goal is to create a system that generalizes well to unseen data. To evaluate this:

- **Training set** (typically 70–80%) is used to fit the model — it learns patterns from this data.
- **Testing set** (typically 20–30%) is kept aside to simulate future data — it allows us to evaluate how well the model performs on data it hasn't seen before.

Without this split, the model might overfit — learning the training data too well and failing to perform on new data. The train-test split ensures **realistic, unbiased performance evaluation**.

In [34]:
# Display the total number of days in the dataset
n_days = len(df.index)
print(f'Total number of days: {n_days}.')

# Calculate the index where the train/test split should occur (70% train, 30% test)
n_days_split = int(n_days * 0.70)
print(f'Day index for 70/30 split: {n_days_split}.')

# Split the data into training and testing sets based on the calculated index
X_train, y_train = explanatory.iloc[:n_days_split], target.iloc[:n_days_split]
X_test, y_test = explanatory.iloc[n_days_split:], target.iloc[n_days_split:]

Total number of days: 2091.
Day index for 70/30 split: 1463.


---
#### IV. Fit the model on train set

In [35]:
from sklearn.tree import DecisionTreeRegressor

In [36]:
# Create a Decision Tree Regressor model with:
# - max_depth=15: limits the depth of the tree to prevent overfitting
# - random_state=42: ensures reproducibility of results
model_dt_split = DecisionTreeRegressor(max_depth=15, random_state=42)

# Fit (train) the model using the training data
# X_train: input features for training
# y_train: target variable (next day price change) for training
model_dt_split.fit(X=X_train, y=y_train)

---
#### V. Evaluate model

In [37]:
from sklearn.metrics import mean_squared_error

On the test set.

In [38]:
y_pred_test = model_dt_split.predict(X=X_test)
mse_test = mean_squared_error(y_true=y_test, y_pred=y_pred_test)
print(f"Mean Squared Error on the test set: {mse_test:.4f}.")

Mean Squared Error on the test set: 4.5701.


On train set.

In [39]:
y_pred_train = model_dt_split.predict(X=X_train)
mse_train = mean_squared_error(y_true=y_train, y_pred=y_pred_train)
print(f"Mean Squared Error on the training set: {mse_train:.4f}.")

Mean Squared Error on the training set: 1.1358.


Error increases on unseen data, which is expected.

---
#### VI. Backtesting

In [40]:
from backtesting import Backtest, Strategy

Create the `Strategy`.

In [41]:
class Regression(Strategy):
    limit_buy = 1
    limit_sell = -5
    
    def init(self):
        self.model = DecisionTreeRegressor(max_depth=15, random_state=42)
        self.already_bought = False
        
        self.model.fit(X=X_train, y=y_train)

    def next(self):
        explanatory_today = self.data.df.iloc[[-1], :]
        forecast_tomorrow = self.model.predict(explanatory_today)[0]
        
        if forecast_tomorrow > self.limit_buy and self.already_bought == False:
            self.buy()
            self.already_bought = True
        elif forecast_tomorrow < self.limit_sell and self.already_bought == True:
            self.sell()
            self.already_bought = False
        else:
            pass

Run the backtest on `train` data.

In [42]:
bt = Backtest(X_train, Regression,
              cash=10000, commission=.002, exclusive_orders=True)

In [43]:
results = bt.run(limit_buy=1, limit_sell=-5)

df_results_train = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'In Sample (Train)'}, axis=1)
df_results_train

Unnamed: 0,In Sample (Train)
Start,2016-12-08 00:00:00
End,2022-09-30 00:00:00
Duration,2122 days 00:00:00
Exposure Time [%],98.906357
Equity Final [$],55918.949184
Equity Peak [$],72622.560138
Commissions [$],2822.152344
Return [%],459.189492


Run the backtest on `test` data.

In [44]:
bt = Backtest(X_test, Regression,
              cash=10000, commission=.002, exclusive_orders=True)

In [45]:
results = bt.run(limit_buy=1, limit_sell=-5)

df_results_test = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'Out of Sample (Test)'}, axis=1)
df_results_test

Unnamed: 0,Out of Sample (Test)
Start,2022-10-03 00:00:00
End,2025-04-03 00:00:00
Duration,913 days 00:00:00
Exposure Time [%],6.210191
Equity Final [$],12456.612297
Equity Peak [$],15484.175987
Commissions [$],78.104518
Return [%],24.566123


---
#### VII. Compare both backtests

In [46]:
df_results = pd.concat([df_results_train, df_results_test], axis=1)
df_results

Unnamed: 0,In Sample (Train),Out of Sample (Test)
Start,2016-12-08 00:00:00,2022-10-03 00:00:00
End,2022-09-30 00:00:00,2025-04-03 00:00:00
Duration,2122 days 00:00:00,913 days 00:00:00
Exposure Time [%],98.906357,6.210191
Equity Final [$],55918.949184,12456.612297
Equity Peak [$],72622.560138,15484.175987
Commissions [$],2822.152344,78.104518
Return [%],459.189492,24.566123
