# Overview
In this walk-forward backtesting exercise, we've cleaned and normalized the data using Min-Max Scaler to ensure that each feature contributes fairly to the predictive model. We opted for the Random Forest Regressor, valued for its robust predictions that avoid overfitting. The model's accuracy is measured by the RMSE metric, which assesses the average deviation of the predicted values from the actual figures. Utilizing RFECV has allowed us to identify the most impactful features, and with a six-month step size, we've ensured a substantial dataset for each training period. This methodically structured approach is aimed at delivering a dependable and interpretable model for accurate forecasts.

## 1. Reading and exploring sample_data.xlsx
Reading the xlsx file and changing DATE-column to panda's date object. This way it is easier tohandle all the dates.

In [1]:
import pandas as pd
import numpy as np

# Reading data
data = pd.read_excel('sample_data.xlsx')

# Changing date to panda's date object.
data['DATE'] = pd.to_datetime(data['DATE'], format='%Y%m%d')

# Exploring data
data.head()

Unnamed: 0,DATE,INDUSTRY,RET1M,FACTOR_A,FACTOR_B,FACTOR_C,FACTOR_D,FACTOR_E,FACTOR_F,FACTOR_G,...,FACTOR_K,FACTOR_L,FACTOR_M,FACTOR_N,FACTOR_O,FACTOR_P,FACTOR_Q,FACTOR_R,FACTOR_S,FACTOR_T
0,2009-12-31,Consumer Services,-6.772226,-10.997537,1.243996,3.206262,2.490379,6.002323,0.0,165.03511,...,20.725244,48.789809,-0.506947,-0.865906,101.283124,259.9141,8.99773,2.312226,-0.10355,12.653356
1,2009-12-31,Health Care,-1.203936,-11.335562,6.677766,6.64728,9.272592,25.475441,2.96351,5.02156,...,26.212564,16.894423,-0.168812,-0.34262,-32.618308,-82.703674,6.959125,0.375545,-0.003959,4.983358
2,2009-12-31,Utilities,-5.108941,-9.092325,7.201676,-1.721651,13.804003,51.570755,0.0,54.61165,...,13.964652,-0.572733,-0.394146,-1.555485,-344.163212,-398.10703,9.377763,-2.510556,-0.019275,2.674826
3,2009-12-31,Consumer Services,-9.497839,-8.027924,1.511073,6.260253,13.414311,56.804149,2.008608,76.116165,...,2.660146,10.793099,-0.430002,-1.210389,19.709256,93.98267,3.80794,-1.89034,-0.128025,7.674407
4,2009-12-31,Industrials,,-7.15456,-22.110048,9.091127,5.220209,48.536897,0.0,81.27489,...,-45.553073,18.730342,-0.488732,-1.620578,24.603775,259.78653,14.77162,4.311962,-0.288577,46.45668


## 2. Cleaning data and normalizing factors.
Data cleaning involved the removal of rows with missing values to maintain the accuracy of our dataset, resulting in a robust set of over 12,000 data points, prioritizing data quality over quantity. 
<br>The Min-Max Scaler was applied for normalization, bringing all feature values into a standardized range of [0,1], which simplifies comparison and aids in determining the most impactful features for the model.

In [2]:
from sklearn.preprocessing import MinMaxScaler

# Dropping NaN values.
clean_data = data.dropna().copy()

# Array of all the factors for normalizing.
factor_columns = [col for col in clean_data.columns if col.startswith('FACTOR')]

# Using min-max normalization
scaler = MinMaxScaler()

scaler.fit(clean_data[factor_columns])
clean_data[factor_columns] = scaler.transform(clean_data[factor_columns])

# Gathered all unique dates. These will be used for training and testing model.
unique_dates_wrong = clean_data['DATE'].dt.date.unique()

# Correcting so that there isn't time only date.
unique_dates = [date.strftime('%Y-%m-%d') for date in unique_dates_wrong]


## 3. Implementing functions for ML model and feature selection
ML model: Random Forest Regressor was used due to its inherit mechanism to reduce overfitting through ensemble learning.
<br>Performance metric: RMSE (Root Mean Squared Error) was used for calculating model's performance. It calculates the differnece between wanted value and observed one. It quantifies model's accuracy.
<br>Feature Selection: RFECV was decided due to simple implementation and easily deciding most important features.

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFECV
from sklearn.model_selection import TimeSeriesSplit

def backtest_model(backtest_data, train_start, train_end, val_end, test_end):
    # Backtest features
    backtest_features = [col for col in clean_data.columns if col.startswith('FACTOR')]
    # Split the data
    train_data = backtest_data[(backtest_data['DATE'] >= train_start) & (backtest_data['DATE'] < train_end)]
    val_data = backtest_data[(backtest_data['DATE'] >= train_end) & (backtest_data['DATE'] < val_end)]
    test_data = backtest_data[(backtest_data['DATE'] >= val_end) & (backtest_data['DATE'] < test_end)]

    # Features and target
    X_train, y_train = train_data[backtest_features], train_data['RET1M']
    X_val, y_val = val_data[backtest_features], val_data['RET1M']
    X_test, y_test = test_data[backtest_features], test_data['RET1M']

    # Initialize the RandomForestRegressor
    model = RandomForestRegressor(n_estimators=100, random_state=42)

    # Train the model
    model.fit(X_train, y_train)

    # Validate the model
    val_predictions = model.predict(X_val)
    val_mse = mean_squared_error(y_val, val_predictions)
    val_rmse = np.sqrt(val_mse)

    # Test the model
    test_predictions = model.predict(X_test)
    test_mse = mean_squared_error(y_test, test_predictions)
    test_rmse = np.sqrt(test_mse)

    # Return the performance
    return {
        'val_rmse': val_rmse,
        'test_rmse': test_rmse
    }

def feature_selection(selection_data):
    selection_features = [col for col in selection_data.columns if col.startswith('FACTOR')]
    X_train, y_train = selection_data[selection_features], selection_data['RET1M']
    
    # Initialize RFECV with time series split for cross-validation
    rfecv = RFECV(estimator=RandomForestRegressor(n_estimators=100, random_state=42), step=1, cv=TimeSeriesSplit(n_splits=5), scoring='neg_mean_squared_error')
    rfecv.fit(X_train, y_train)
    
    X_train_selected = rfecv.transform(X_train)
    
    selected_feature_names = [selection_features[i] for i in range(len(selection_features)) if rfecv.support_[i]]
    
    # Columns to be dropped
    columns_to_drop = [col for col in selection_data.columns if col not in selected_feature_names and col.startswith('FACTOR')] 
    
    return columns_to_drop


## 3.1 Deciding most important features
I ran feature tweaking from 2009-12-31 until end of 2011. Step size between those were 3 months this means that it got 8 iterations.
After that I was left with 6 features from original 21.

In [7]:
train_anchor = '2009-12-31'
i = 0
while i < 25:
    feature_data = clean_data[(clean_data['DATE'] >= train_anchor) & (clean_data['DATE'] < unique_dates[i + 3])]
    not_features = feature_selection(feature_data)
    
    # Drops features that are not important.
    clean_data = clean_data.drop(columns=not_features)
    i += 3
clean_data.head()

Unnamed: 0,DATE,INDUSTRY,RET1M,FACTOR_A,FACTOR_D,FACTOR_G,FACTOR_H,FACTOR_J,FACTOR_S
0,2009-12-31,Consumer Services,-6.772226,0.429835,0.017751,0.513261,0.352503,0.700106,0.999617
1,2009-12-31,Health Care,-1.203936,0.392419,0.074026,0.185097,0.247183,0.70001,0.999985
2,2009-12-31,Utilities,-5.108941,0.640721,0.111625,0.286799,0.331705,0.696723,0.999929
3,2009-12-31,Consumer Services,-9.497839,0.758539,0.108391,0.330902,0.415522,0.691109,0.999526
5,2009-12-31,Technology,-6.501144,0.313488,0.095293,0.282849,0.319724,0.718014,0.999965


## 4. Running the backtest model.
Backtest model was calculated from 2009-12-31 until 2022-07-29. Validation and testing was started in 2012 and start of training set was anchored to 2009-12-31. Step size was 6 months. Perfomance score (RMSE) was captured to perfomance array. 

In [8]:
performance = []
end = len(unique_dates) - 7
j = 25

while j < end:
    result = backtest_model(clean_data[(clean_data['DATE'] >= train_anchor) & (clean_data['DATE'] < unique_dates[end + 6])],
                                                            train_anchor, unique_dates[j], unique_dates[j+3], unique_dates[j+6])
    performance.append(result)
    j += 6

performance

[{'val_rmse': 7.874969295516336, 'test_rmse': 10.608215523121155},
 {'val_rmse': 7.298464236908811, 'test_rmse': 7.424073511749794},
 {'val_rmse': 7.0714387914427075, 'test_rmse': 7.766526688566164},
 {'val_rmse': 7.178929469763768, 'test_rmse': 6.706772243725366},
 {'val_rmse': 6.791496312714874, 'test_rmse': 6.306730173675444},
 {'val_rmse': 7.443429417045315, 'test_rmse': 7.32828104307904},
 {'val_rmse': 7.875390690945065, 'test_rmse': 8.807820900811885},
 {'val_rmse': 10.672654753540693, 'test_rmse': 10.56610290220458},
 {'val_rmse': 10.523234212674152, 'test_rmse': 7.26135748030953},
 {'val_rmse': 7.374649786922917, 'test_rmse': 7.37129828504117},
 {'val_rmse': 6.561039359812239, 'test_rmse': 7.705027062518132},
 {'val_rmse': 7.2299376514577895, 'test_rmse': 7.985070330561301},
 {'val_rmse': 7.634322660782542, 'test_rmse': 6.867066321367985},
 {'val_rmse': 9.997507845961932, 'test_rmse': 11.67419720136889},
 {'val_rmse': 7.633872579795264, 'test_rmse': 10.341469094970355},
 {'val_

## 5. Calculating performance
Performance was calculated by taking mean of both test and validation rmse result. This way we could see how well model performed overtime.

In [9]:
# Calculate mean of val_rmse
mean_val_rmse = sum(d['val_rmse'] for d in performance) / len(performance)

# Calculate mean of test_rmse
mean_test_rmse = sum(d['test_rmse'] for d in performance) / len(performance)

print(f"Mean of val_rmse: {mean_val_rmse}")
print(f"Mean of test_rmse: {mean_test_rmse}")

Mean of val_rmse: 9.0060815456057
Mean of test_rmse: 8.734702416783316


## 6. Analyzing performance
The results indicate that the model has a moderately good predictive performance, with the mean validation RMSE at approximately 9.01 and the mean test RMSE at approximately 8.73. This suggests that the model is reasonably consistent across both validation and test datasets, which is a positive sign of its generalization capabilities. However, these figures also imply there is room for improvement, perhaps by further refining the model, feature engineering, or addressing any potential overfitting issues.