In [36]:
# --- SETUP (run only in Colab) ---
# Clone the repository to access project files
#!git clone https://github.com/lautrevor/data-science-portfolio


# Notebook 3: Feature Engineering & Linear Regression Modeling

This notebook builds on the cleaned dataset and exploratory analysis developed in the previous notebooks. Using historical VFV data, we construct predictive features, apply a time-based train/test split, and evaluate a baseline model before fitting a linear regression model to predict next-day returns.

The goal of this notebook is to assess whether simple, interpretable features derived from past price behavior contain predictive signal, while following best practices for time-series modeling and avoiding data leakage.

## Data Loading and Preparation

This notebook uses the cleaned VFV dataset produced in Notebook 1. The dataset is reloaded here to ensure full reproducibility and independence from previous notebooks.

All feature engineering, target construction, and data splitting are performed explicitly within this notebook.

In [37]:
import pandas as pd

vfv_df = pd.read_csv("/content/data-science-portfolio/project-2-vfv-signal-vs-noise/data/vfv_clean.csv")
vfv_df = vfv_df.sort_values("Date").reset_index(drop=True)

vfv_df.head()

Unnamed: 0,Date,Adj Close,return,target
0,2013-01-03,21.237472534179688,0.005516,0.002352
1,2013-01-04,21.287412643432617,0.002352,-0.001564
2,2013-01-07,21.254125595092773,-0.001564,-0.003524
3,2013-01-08,21.17923164367676,-0.003524,0.004322
4,2013-01-09,21.27076911926269,0.004322,0.004695


## Feature Engineering and Target Definition

Predictive features are constructed using historical return information only. The target variable is defined as the next-day return to prevent look-ahead bias.

In [38]:
import numpy as np

# Target: next-day return
vfv_df["target_return"] = vfv_df["return"].shift(-1)

# Lagged returns
vfv_df["return_lag_1"] = vfv_df["return"].shift(1)
vfv_df["return_lag_5"] = vfv_df["return"].shift(5)
vfv_df["return_lag_10"] = vfv_df["return"].shift(10)

# Rolling statistics
vfv_df["rolling_mean_5"] = vfv_df["return"].rolling(5).mean()
vfv_df["rolling_mean_20"] = vfv_df["return"].rolling(20).mean()

vfv_df["rolling_std_5"] = vfv_df["return"].rolling(5).std()
vfv_df["rolling_std_20"] = vfv_df["return"].rolling(20).std()

## Final Modeling Dataset

Rows containing missing values introduced by lagging and rolling calculations are removed to form the final dataset used for modeling.

In [39]:
model_df = vfv_df.dropna().copy()

feature_cols = [
    "return_lag_1",
    "return_lag_5",
    "return_lag_10",
    "rolling_mean_5",
    "rolling_mean_20",
    "rolling_std_5",
    "rolling_std_20"
]

vfv_X = model_df[feature_cols]
vfv_y = model_df["target_return"]

## Time-Based Train/Test Split

To preserve the temporal structure of the data, the dataset is split into training and testing sets without shuffling. The test set consists of the most recent observations, simulating a real-world forecasting scenario.

In [40]:
from sklearn.model_selection import train_test_split

vfv_X_train, vfv_X_test, vfv_y_train, vfv_y_test = train_test_split(
    vfv_X,
    vfv_y,
    test_size=0.20,
    shuffle=False
)

vfv_X_train.shape, vfv_X_test.shape

((2613, 7), (654, 7))

## Baseline Model

Before fitting a linear regression model, a baseline predictor is established. The baseline model predicts the mean of the training target for all observations in the test set. This provides a simple benchmark to assess whether engineered features add predictive value beyond a naive approach.

In [41]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

baseline_pred = pd.Series(
    vfv_y_train.mean(),
    index=vfv_y_test.index
)

baseline_rmse = mean_squared_error(vfv_y_test, baseline_pred)**0.5
baseline_mae = mean_absolute_error(vfv_y_test, baseline_pred)

baseline_rmse, baseline_mae

(0.009075595012234596, 0.006070436977504857)

### Baseline Model Results

The baseline model, which predicts the mean of the training returns for all test observations, achieved an RMSE of approximately 0.009 and an MAE of approximately 0.006 on the test set. These values provide a reference level of error for next-day return prediction using a naive approach.

Given the inherent noise and volatility in daily financial returns, this baseline performance establishes a reasonable benchmark against which the linear regression model can be evaluated.

## Linear Regression Model

After establishing a baseline benchmark, a linear regression model is fit to evaluate whether the engineered features improve predictive performance for next-day returns. Linear regression is chosen for its simplicity and interpretability, allowing direct examination of how historical return-based features relate to future price movement.

Model performance is evaluated on the test set using the same error metrics as the baseline model. This comparison helps determine whether the linear regression model provides meaningful improvement over a naive mean-based prediction.

In [42]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(vfv_X_train, vfv_y_train)
lr_pred = lr_model.predict(vfv_X_test)

In [43]:
lr_rmse = mean_squared_error(vfv_y_test, lr_pred)**0.5
lr_mae = mean_absolute_error(vfv_y_test, lr_pred)

lr_rmse, lr_mae

(0.009067716175836196, 0.006213782713509596)

In [44]:
pd.DataFrame(
    {
        "RMSE": [baseline_rmse, lr_rmse],
        "MAE": [baseline_mae, lr_mae],
    },
    index=["Baseline", "Linear Regression"]
)

Unnamed: 0,RMSE,MAE
Baseline,0.009076,0.00607
Linear Regression,0.009068,0.006214


### Model Comparison and Interpretation

The linear regression model achieves a marginal improvement in RMSE compared to the baseline model, while slightly underperforming in terms of MAE. This suggests that the engineered features contain limited predictive signal for next-day returns, but the improvement is not substantial.

Given the high level of noise in daily financial return data, this result is expected. The analysis demonstrates that while simple historical features may capture weak patterns, they do not lead to strong predictive performance in this setting.

### Actual vs Predicted Returns

To visually assess model performance, we compare the actual next-day returns in the test set with the returns predicted by the linear regression model. This visualization helps illustrate how closely the model tracks observed market behavior over time.

In [45]:
import altair as alt

plot_df = pd.DataFrame({
    "Date": model_df.loc[vfv_y_test.index, "Date"],
    "Actual Return": vfv_y_test.values,
    "Predicted Return": lr_pred
})
plot_df.head()

Unnamed: 0,Date,Actual Return,Predicted Return
2632,2023-06-29,0.012107,-0.000672
2633,2023-06-30,-0.00134,-0.001227
2634,2023-07-04,0.003929,0.000249
2635,2023-07-05,-0.001527,-0.002712
2636,2023-07-06,-0.009464,-0.00112


In [46]:
vfv_lr_plot= alt.Chart(plot_df).transform_fold(
    ["Actual Return", "Predicted Return"],
    as_=["Series", "Return"]
).mark_line().encode(
    x=alt.X("Date:T", title="Date"),
    y=alt.Y("Return:Q", title="Daily Return"),
    color=alt.Color("Series:N", title=""),
    tooltip=["Date:T", "Series:N", "Return:Q"]
).properties(
    title="Actual vs Predicted Daily Returns (Test Set)",
    width=700,
    height=300
)
vfv_lr_plot

The visualization highlights the contrast between the highly volatile nature of actual daily returns and the smoother predictions produced by the linear regression model. While the model captures small fluctuations around the mean, it fails to track extreme positive or negative return spikes.

This pattern is consistent with the quantitative results, which showed only marginal improvement over the baseline model. Overall, the plot reinforces the conclusion that simple historical return-based features contain limited predictive power for next-day returns in a noisy financial time series.

## Limitations and Next Steps

This analysis focuses on simple, interpretable features derived from historical returns and uses a linear regression model to predict next-day returns. Given the high noise and low signal-to-noise ratio in daily financial data, strong predictive performance is not expected.

Future work could explore more expressive models (e.g., regularized regression or tree-based methods), alternative targets such as return direction, or longer prediction horizons. Additionally, incorporating external information such as macroeconomic indicators or volatility indices may help capture dynamics not present in historical price data alone.