# First Assignment - FINTECH 540 - Machine Learning for FinTech

In this assignment, you will gain hands-on experience applying linear models to financial market data. Specifically, you will work with time series prices of the 30 constituents of the *Dow Jones Industrial Average (DJIA)* Index. The dataset covers the period from June $2^{nd}$, 2017, through June $2^{nd}$, 2023. The price series of the ETF associated with the DJIA index is also provided, whose symbol is *DIA*. The dataset is uploaded on Sakai in the same place where you found this notebook.

You will deal with three consecutive tasks, so in general, you can only perform a task if you have solved the previous one. You can obtain at most 100 points for this home assignment. The tasks are briefly summarized below, and you can find the relative prompt in each subsection of this notebook:
- Build descriptive linear models (CAPM) for all the index constituents (*20 points*).
- Select a subset of constituents and fit a predictive linear model to forecast the index value (*40 points*).
- Repeat the linear modeling exercise using boostrapped returns (*40 points*).

## About this notebook

You only need to write the final code between the `### START CODE HERE ###` and `### END CODE HERE ###` comments. You can create more cells to experiment with and prepare your final code at your convenience. Remember to put the final version of the code where it is asked. Before submitting, remember to fully run your notebook from the start to the end to ensure that there will be no runtime error. Avoiding following such guidelines will result in a decrease in the total points.

## Task 1 - Build descriptive linear models (CAPM) for all the index constituents (*20 points*)

The Capital Asset Pricing Model (CAPM) is represented as:

$$R_i - R_f =   \beta_i (R_m - R_f) + e_i$$

Where:
- $R_i$ is the return of the asset or security $i$.
- $R_f$ is the risk-free rate, representing the return on a risk-free investment.
- $\beta_i$ is the beta of the asset $i$, which measures its sensitivity to market movements.
- $R_m$ is the market portfolio's return (the index).
- $e_i$ is the error term or residual representing unexplained variation in the asset's return.

The CAPM equation helps estimate the return of an asset based on its risk relative to the market and the risk-free rate. You can calculate the daily risk-free rate by using the following formula.

$$ r_{\text{daily}} = \left(1 + r_{\text{annual}}\right)^{\frac{1}{365}} - 1 $$

Where:
- $r_{\text{daily}}$ is the daily yield. It represents the expected daily return on investment.
- $ r_{\text{annual}} $ is the annual yield. It represents the expected annual return on investment.
- The formula assumes daily compounding, meaning the investment's return is calculated daily over a year (365 days). It allows to do the modeling based on daily returns.

For this task, you can use an annual yield of *5.482%* per the annualized U.S. 3-month Treasury Bill yield.

To solve this part of the homework, you have to:
- Compute the daily yield from the annualized provided in the prompt.
- Prepared the data to fit the CAPM for each company in the DJIA index described above.
- Fit the CAPM for each company and check the estimated sensitivity to market movements.
- Select a subset of stocks sensitive to market movements between 0.85 and 1.15. Before including a symbol, ensure the estimated sensitivity is statistically significant. Store the symbols in a Python list before moving to the next task.

Before performing the CAPM modeling, remember to split the dataset into a training set and a test set and use only the training set to perform Task 1. Use *2022-01-01* as a cutoff date. Ensure the cutoff date is included in the test set and not in the train set.

**Motivation behind the task**

Fitting individual CAPM models allows for a detailed assessment of each stock's risk profile. CAPM provides a systematic way to quantify the sensitivity of each stock's returns to market movements, as measured by the beta coefficient. This individual assessment is valuable because different stocks may exhibit varying levels of market sensitivity.

Selecting stocks based on their beta values is usually a risk-based approach to portfolio construction. By choosing stocks with higher (lower) beta values, you are essentially selecting those that tend to exhibit greater (lower) price volatility in response to market fluctuations. This can be seen as a deliberate strategy to include riskier (safer) assets in the portfolio.

This task will set the basis for selecting a subset of index constituents to be used for a predictive model. 

**Grading Criteria**

- **Data Preparation (10 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **CAPM Model Fitting (10 points)**: Points will be awarded based on the correctness and completeness of the CAPM models, including accurate significance evaluation and the subset of stock selection based on the beta estimations.

In [1]:
### START CODE HERE ###

import numpy as np
import pandas as pd
import statsmodels.api as sm


#0.read and get to know my data
data = pd.read_csv("dows_daily.csv")
"""
pd.set_option('display.max_columns', None)

print("----------shape----------------")
print(data.shape)
data.head(5)
data.tail(5)

#1.Data Preparation
#1.0 check

print("----------type check ----------------")
print(data.dtypes) 

print("----------null value check-----------")
print("null value detected:\n",format(data.isnull().sum())) 

print("----------duplicated value check ------------")

print("dpulicated value detected:{}".format(data.duplicated().sum())) 
"""
#1.1 prepare return data
rtn_data = pd.DataFrame()
for col in data.columns:
    if col == "Date":
        rtn_data["Date"] = data[col]
    else:
        rtn_data["{}_rtn".format(col)] = data[col].pct_change(1) * 100        
rtn_data.dropna(inplace = True) # drop the first row as we can't find rtn for the first date.


#2. Data set split
cut_off_date="2022-01-01" #Use 2022-01-01 as a cutoff date.
train_data = rtn_data[rtn_data["Date"]<cut_off_date]
test_data = rtn_data[rtn_data["Date"]>=cut_off_date]
#print("train data shape: {}, test data shape: {} ".format(train_data.shape,test_data.shape))


#3. fit the model
assets = []
for col in rtn_data.columns:
    if col !="Date" and col != "DIA_rtn":
        assets.append(col)
        
daily_rf = (1+0.05482)**(1/365) - 1  #annual_rf = 5.482% per the annualized U.S. 3-month Treasury Bill yield
safe_assets =[]
for asset in assets:
    Y = train_data[asset] - daily_rf
    X = train_data["DIA_rtn"] - daily_rf
    model = sm.OLS(Y,X)
    result = model.fit() 
    p_value = result.pvalues[0]
    beta=result.params[0]
    if p_value < 0.05 and 0.85 < beta < 1.15:
        safe_assets.append(asset[0:-4])

print(safe_assets)


### END CODE HERE ###

['NKE.N', 'CSCO.OQ', 'DIS.N', 'INTC.OQ', 'HD.N', 'UNH.N', 'MSFT.OQ', 'HON.OQ', 'CRM.N', 'IBM.N', 'MMM.N', 'AAPL.OQ', 'CAT.N', 'V.N', 'TRV.N']


In [2]:
result.summary()

0,1,2,3
Dep. Variable:,BA.N_rtn,R-squared (uncentered):,0.512
Model:,OLS,Adj. R-squared (uncentered):,0.511
Method:,Least Squares,F-statistic:,1208.0
Date:,"Tue, 03 Oct 2023",Prob (F-statistic):,1.1000000000000001e-181
Time:,19:11:50,Log-Likelihood:,-2523.7
No. Observations:,1154,AIC:,5049.0
Df Residuals:,1153,BIC:,5055.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
DIA_rtn,1.6800,0.048,34.761,0.000,1.585,1.775

0,1,2,3
Omnibus:,394.036,Durbin-Watson:,1.896
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9055.425
Skew:,1.023,Prob(JB):,0.0
Kurtosis:,16.57,Cond. No.,1.0


## Task 2 - Select a subset of constituents and fit a predictive linear model to forecast the index value (*40 points*)

In this task, you will apply linear predictive modeling techniques to forecast the value of the DIA ETF on the DJIA index using the subset of its constituents you selected in the previous task. The goal is to build a predictive linear model that accurately estimates the future index return based on the historical data of selected constituent stocks. Note that to perform this predictive task, you have to prepare the data accordingly. Don't use the excess returns with respect to a daily risk-free rate for this task, but use the plain returns instead.

The predictive linear regression equation to estimate the dependent variable \(Y\) at time \(t+1\) is represented as:

$$ Y_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \ldots + \beta_k X_{k,t} + \varepsilon_{t} $$

In this equation:

- $Y_{t+1}$ represents the dependent variable at time $t+1$ that we want to predict. Note that the dependent variable is real-valued.
- $\beta_0$ is the intercept or constant term.
- $\beta_1, \beta_2, \ldots, \beta_k$ are the $k$ coefficients for the independent variables $ X_{1,t}, X_{2,t}, \ldots, X_{k,t} $ at time $t$. you can assume $k$ to be the number of selected stocks from the previous task. Note that the regressors are real-valued.
- $\varepsilon_{t}$ represents the error term at time $t$, capturing unexplained variation or noise in the dependent variable at that specific time.

Before performing the linear regression modeling, remember to split the dataset into a training set and a test set. Use *2022-01-01* as a cutoff date, the same way you did in the previous task. Make sure the cutoff date will be included in the test set and not in the train set.

Assess the performance of your predictive model using an appropriate evaluation metric for a regression problem like this one. Evaluate the model on the test set to ensure its predictive accuracy out-of-sample.

**Grading Criteria**

- **Data Preparation (15 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **Predictive Regression Model Building (20 points)**: Points will be awarded based on the correctness and completeness of the regression model built using selected stocks' returns and the index return.

- **Model Evaluation (5 points)**: Points will be awarded based on the proper choice of evaluation metric.

In [12]:
### START CODE HERE ###
import numpy as np
import pandas as pd
import statsmodels.api as sm


#0.data preparation
data = pd.read_csv("dows_daily.csv")

#0.1 prepare return data
rtn_data = pd.DataFrame()
for col in data.columns:
    if col == "Date":
        rtn_data["Date"] = data[col]
    else:
        rtn_data["{}_rtn".format(col)] = data[col].pct_change(1) * 100        
rtn_data.dropna(inplace = True) # drop the first row as we can't find rtn for the first date.

#0.2 Dataset split
cut_off_date="2022-01-01" #Use 2022-01-01 as a cutoff date.
train_data = rtn_data[rtn_data["Date"]<cut_off_date]
test_data = rtn_data[rtn_data["Date"]>=cut_off_date]

#0.3 derive X and Y
Y = train_data["DIA_rtn"].shift(-1).dropna()
X = train_data.loc[:, [asset + "_rtn" for asset in safe_assets]].iloc[0:-1,:]
X = sm.add_constant(X)


#1.fit the model
model = sm.OLS(Y,X)
result = model.fit()
result.summary()



#2.test 
test_X = test_data.loc[:, [asset + "_rtn" for asset in safe_assets]].iloc[0:-1,:]
test_X = sm.add_constant(test_X)

test_Y_hat = result.predict(test_X)
test_Y = test_data["DIA_rtn"].shift(-1).dropna()
test_mse = ((test_Y - test_Y_hat)**2).mean()

print("MSE for the test set:{}".format(test_mse,4))
### END CODE HERE ###


MSE for the test set:1.4553804904975767


In [4]:
Y

1      -0.231285
2       0.165586
3       0.066125
4       0.382328
5      -0.126957
          ...   
1149    0.990650
1150    0.294831
1151    0.233523
1152   -0.211051
1153   -0.206004
Name: DIA_rtn, Length: 1153, dtype: float64

In [5]:
X

Unnamed: 0,const,NKE.N_rtn,CSCO.OQ_rtn,DIS.N_rtn,INTC.OQ_rtn,HD.N_rtn,UNH.N_rtn,MSFT.OQ_rtn,HON.OQ_rtn,CRM.N_rtn,IBM.N_rtn,MMM.N_rtn,AAPL.OQ_rtn,CAT.N_rtn,V.N_rtn,TRV.N_rtn
1,1.0,0.056625,-0.687930,-0.615787,0.055066,-0.360476,-0.011149,0.724638,-0.402895,0.572309,0.236764,-0.232221,-0.977806,-0.707881,0.416017,0.183779
2,1.0,-0.999811,-0.629723,-0.957567,-0.577876,-0.381162,0.808385,0.332042,-0.561840,-0.306413,-0.026245,-0.392784,0.337816,-0.617871,-0.787157,-1.076727
3,1.0,1.429116,0.158428,0.398104,0.359812,0.492866,1.321756,-0.179261,-0.015067,0.318332,-0.912253,-0.194732,0.595662,-0.994739,0.313185,-0.427316
4,1.0,-0.056359,0.000000,-1.510574,0.606729,-0.942179,-0.829649,-0.607819,0.625377,-0.153190,0.741820,0.453636,-0.244577,1.449135,0.000000,0.226721
5,1.0,0.488722,-0.759253,1.246166,-2.110746,-0.351792,-0.126589,-2.265462,0.179708,-4.361644,1.314924,0.480723,-3.877670,0.866584,-1.592257,1.171433
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1149,1.0,0.139023,1.218918,1.152225,0.667059,0.361440,0.252970,0.447179,1.674594,0.134494,0.678227,1.349629,0.364382,2.003463,-0.614792,-0.025858
1150,1.0,1.152894,1.830443,-0.540259,1.227831,1.767950,0.831685,2.318563,0.589611,2.038398,0.757866,0.988741,2.297481,0.087294,0.466254,0.898804
1151,1.0,-0.692207,0.173447,1.570681,-0.346554,0.527110,0.688689,-0.350416,0.300344,-1.103368,0.767361,0.531975,-0.576720,0.116290,0.192988,0.692130
1152,1.0,1.418099,0.676846,-0.212629,0.135240,1.137315,0.524914,0.205128,0.231828,-0.356234,0.542864,0.433461,0.050198,0.343626,0.055033,0.241853


In [6]:
test_Y

1155    0.598884
1156   -1.027537
1157   -0.466918
1158   -0.022076
1159   -0.419530
          ...   
1505    0.942792
1506   -0.096723
1507   -0.302554
1508    0.430930
1509    2.154469
Name: DIA_rtn, Length: 355, dtype: float64

In [7]:
test_X

Unnamed: 0,const,NKE.N_rtn,CSCO.OQ_rtn,DIS.N_rtn,INTC.OQ_rtn,HD.N_rtn,UNH.N_rtn,MSFT.OQ_rtn,HON.OQ_rtn,CRM.N_rtn,IBM.N_rtn,MMM.N_rtn,AAPL.OQ_rtn,CAT.N_rtn,V.N_rtn,TRV.N_rtn
1155,1.0,-1.199976,-0.331387,1.207308,3.320388,-1.534903,0.027881,-0.466817,-0.820105,0.523354,1.780637,0.061926,2.500422,0.125762,2.178026,-0.434699
1156,1.0,1.044513,-3.024066,-0.657055,-0.131554,1.027800,-2.265669,-1.714712,1.063830,-2.830189,1.455454,1.400923,-1.269161,5.352657,0.465158,2.086677
1157,1.0,-2.488130,-1.583673,-0.346754,1.373730,-1.356458,-0.246486,-3.838789,0.985646,-8.282641,0.144907,-0.410586,-2.659989,0.765774,-1.105817,0.484277
1158,1.0,-0.745763,1.061712,1.101875,0.259885,-0.363422,-4.092385,-0.790189,-0.113712,0.650064,-2.083635,-0.830130,-1.669335,1.019340,-0.113636,1.602303
1159,1.0,-2.527322,0.344714,0.592734,-1.055360,-2.994381,-2.352816,0.050975,2.338488,-0.366572,-0.376829,1.095506,0.098837,0.991036,-1.269625,2.408674
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1505,1.0,-0.747992,1.280727,-1.044123,-5.517241,-1.494332,-0.650958,3.845786,0.213697,0.406582,0.859325,0.061862,0.669227,0.133524,0.508436,-1.318257
1506,1.0,0.027912,1.692841,0.170184,5.839416,2.120314,0.799665,2.138562,0.769751,2.634462,1.680341,-0.113344,1.410486,0.866749,0.729698,-0.794610
1507,1.0,-0.920845,0.621741,-0.532337,3.413793,-0.146843,-0.346818,-0.504671,0.412903,1.592091,0.457755,-0.907778,1.065952,-0.897073,-1.497711,0.580417
1508,1.0,-1.182876,-0.996612,0.159417,4.834945,-3.060876,1.540065,-0.851424,-1.516320,2.060584,-0.687365,-2.862794,-0.028201,-1.977132,-0.275221,-2.337123


In [8]:
result.summary()

0,1,2,3
Dep. Variable:,DIA_rtn,R-squared:,0.073
Model:,OLS,Adj. R-squared:,0.061
Method:,Least Squares,F-statistic:,5.964
Date:,"Tue, 03 Oct 2023",Prob (F-statistic):,4.24e-12
Time:,19:11:50,Log-Likelihood:,-1906.2
No. Observations:,1153,AIC:,3844.0
Df Residuals:,1137,BIC:,3925.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0717,0.038,1.897,0.058,-0.002,0.146
NKE.N_rtn,0.0185,0.026,0.708,0.479,-0.033,0.070
CSCO.OQ_rtn,-0.0150,0.034,-0.442,0.659,-0.082,0.052
DIS.N_rtn,-0.0175,0.027,-0.645,0.519,-0.071,0.036
INTC.OQ_rtn,-0.0246,0.023,-1.075,0.283,-0.070,0.020
HD.N_rtn,0.0878,0.034,2.610,0.009,0.022,0.154
UNH.N_rtn,-0.0428,0.028,-1.551,0.121,-0.097,0.011
MSFT.OQ_rtn,-0.1007,0.042,-2.376,0.018,-0.184,-0.018
HON.OQ_rtn,0.0134,0.043,0.314,0.753,-0.070,0.097

0,1,2,3
Omnibus:,390.064,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13100.177
Skew:,-0.892,Prob(JB):,0.0
Kurtosis:,19.416,Cond. No.,6.83


## Task 3 - Augment the Dataset with Bootstrapped Alphas and Fit again the Linear Predictive Models (40 points)

In this task, we explore the concept of bootstrapped alphas and their role in predictive modeling. Bootstrapped alphas are used as proxy trading signals for real alphas that can be practically obtained. These signals are correlated with future returns and can play the role of good predictors in the predictive modeling process. Don't use the excess returns with respect to a daily risk-free rate for this task, but use the plain returns instead when you have to calculate the boostrapped alphas.

We define bootstrapped alphas $\alpha_t$ as per the formula below:

$$\alpha_{i,t} := \rho_{\text{boot}} r_{i,t+1} + \sqrt{1 - \rho_{\text{boot}}^{2}} z_{i,t}$$

where:
- $r_{i,t+1}$ represents the next period return of the traded security $i$, which is given to you.
- $z_{i,t} \sim \mathbb{N}(0,\sigma^{2})$ is a randomly drawn scalar associated for each company $i$, which is not given and you have to sample. When sampling, ensure that each sampled vector is independent of the other since you have to draw samples for each company you will use as regressors. The number of companies stays the same that you used in the previous task and that you have selected by fitting the CAPM model in task 1.
- $\sigma^{2}_{i}$ is an estimate of the true conditional variance of the security $i$, which you have to calculate based on the given returns. Note that you have to calculate those variances on the train set only. Use the same cutoff applied in the previous task to define what the training set is.
- $\rho_{\text{boot}} \in [-1,1]$ is a correlation coefficient, which you have to set equal to 0.25.

In this setting, the parameter $\rho_{\text{boot}}$ artificially regulates the strength of the trading signal you create. We remark that regressing the bootstrapped alpha $\alpha_t$ on the future returns $r_{t+1}$ results in an $R^2$ equal to $\rho^2$.

The equation above formalizes the calculation of the boostrapped alpha for a single security while you will have more than one security. Try to make your calculations as efficient as possible by computing them simultaneously. It is possible by using calculations between pandas dataframe. Remember that $z_{i,t} \sim \mathcal{N}(0,\sigma^{2}_{i})$ can be calculated as $z_{i,t} = \sqrt{\sigma^{2}_{i}}u_{i,t}$ where $u_{i,t} \sim \mathcal{N}(0,1)$. 

Once you calculate the boostrapped alphas, repeat the linear predictive forecasting exercise as in the previous task. This time you will use the boostrapped alphas as predictors, while you will keep the same target as before, the index returns. In other words, the target stays the same as in the previous task (future returns for DIA) by looking at the equation below. Still, the predictors change from the current returns of the constituents to the alpha bootstrap you have calculated.

$$ Y_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \ldots + \beta_k X_{k,t} + \varepsilon_{t} $$

To ensure reproducibility, please set the random seed to 42. Don't use another seed, and remember to set it. Avoiding to follow these guidelines will result in point deductions.

**Motivation behind the task**

In the dynamic and complex world of financial markets, predictive modeling is a potent tool to decipher underlying patterns and trends that govern security prices. Coming up with good predictors for a certain set of assets is a complicated task that is not necessarily the purpose of this assignment. The concept of bootstrapped alphas, as delineated in this exercise, emerges as a sophisticated method to engineer artificial trading signals that can potentially enhance the predictive power of financial models. It is equivalent to assuming that we have a way to predict the future returns of the index constituents. Look at the alpha bootstrap equation to understand why we are talking about future returns by looking at what the prices indicate.

The utilization of bootstrapped alphas is grounded in the mathematical formulation provided, where the alpha ($\alpha_{i,t}$) for a security $i$ at time $t$ is constructed using a combination of the next period return of the security ($r_{i,t+1}$) and a stochastic component ($z_{i,t}$) drawn from a normal distribution. This formulation allows for the incorporation of both deterministic and random elements, thereby mimicking the inherent uncertainty and volatility observed in financial markets.

By setting the correlation coefficient ($\rho_{\text{boot}}$) to 0.25, we are essentially moderating the influence of the artificial trading signal, ensuring that it does not overwhelmingly dictate the behavior of the bootstrapped alphas. This parameter, therefore, serves as a tuning knob, allowing us to control the strength of the trading signal and, consequently, its predictive power. However, you have to keep this parameter fixed for this exercise, as indicated by the prompt.

The subsequent step of employing these bootstrapped alphas as predictors in a linear predictive forecasting model is an exercise to highlight how well one can expect to forecast index returns, given a good way to predict future returns for the constituents. By replacing the current returns of the constituents with the calculated bootstrapped alphas, we are essentially enhancing the model with artificially generated yet statistically grounded signals that can potentially unveil deeper insights into the market dynamics.

**Grading Criteria**

- **Data Preparation (30 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **Predictive Regression Model Building (5 points)**: Points will be awarded based on the correctness and completeness of the regression model built using selected stocks' boostrapped alpha and the index return.
- **Model Evaluation (5 points)**: Points will be awarded based on the proper choice of evaluation metric.

In [9]:
### START CODE HERE ###
import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(42)

#0.data preparation
data = pd.read_csv("dows_daily.csv")
data = data[["Date","DIA"]+safe_assets] #safe_assets is from task1
#0.1 prepare return data
rtn_data = pd.DataFrame()
for col in data.columns:
    if col == "Date":
        rtn_data["Date"] = data[col]
    else:
        rtn_data[col] = data[col].pct_change(1) * 100    

rtn_data.dropna(inplace = True) # drop the first row as we can't find rtn for the first date.
rtn_data.reset_index(inplace = True,drop=True)


#0.2 bootstrapped alpha
rho_boot = 0.25
cut_off_date="2022-01-01" #Use 2022-01-01 as a cutoff date.

#0.2.1 random component 
sigma = rtn_data[rtn_data["Date"]<cut_off_date].iloc[:,2:].std() #caculate conditional variance sigma for securities
norm_dist_var = pd.DataFrame(np.random.normal(0,1,[rtn_data.shape[0],rtn_data.shape[1]-2])) # generate u~N(0,1),ignore the first two columns Date and DIA
z = sigma.values * norm_dist_var
random_component = ((1-rho_boot**2)**(1/2)*z)
new_columns = rtn_data.columns[2:]
random_component.columns = new_columns

#0.2.2 fixed component
fixed_component = rho_boot * rtn_data.iloc[:,2:]
fixed_component.reset_index(inplace = True,drop=True)

#0.2.3 calculate bootstrapped alpha
bsa_data = pd.DataFrame()# "bsa" stands for bootstrapped alpha
bsa_data = fixed_component + random_component

#1. split again on processed bootstrapped alpha data set
train_data = bsa_data[rtn_data["Date"]<cut_off_date] #(1154,30) Date DIA XXXX....
test_data = bsa_data[rtn_data["Date"]>=cut_off_date] #(356,30) Date DIA XXXX...

#2.fit the model
X = sm.add_constant(train_data)
Y = rtn_data[rtn_data["Date"]<cut_off_date]["DIA"]
Y.reset_index(inplace = True,drop=True)
model = sm.OLS(Y,X)
result = model.fit()

#3. predict the test set
test_X = sm.add_constant(test_data)
test_Y_hat = result.predict(test_X)
test_data.reset_index(inplace = True,drop=True)
test_Y = rtn_data[rtn_data["Date"]>=cut_off_date]["DIA"]

#4. Evaluation metrics
test_mse = ((test_Y - test_Y_hat)**2).mean()
print("Model's Adjusted R square: {}".format(round(result.rsquared_adj,4)))
print("MSE for the test set: {}".format(round(test_mse,4)))

### END CODE HERE ###

Model's Adjusted R square: 0.3415
MSE for the test set: 0.8585


In [10]:
result.summary()

0,1,2,3
Dep. Variable:,DIA,R-squared:,0.35
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,40.86
Date:,"Tue, 03 Oct 2023",Prob (F-statistic):,1.68e-95
Time:,19:11:50,Log-Likelihood:,-1702.5
No. Observations:,1154,AIC:,3437.0
Df Residuals:,1138,BIC:,3518.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0209,0.032,0.660,0.510,-0.041,0.083
NKE.N,0.0805,0.016,4.988,0.000,0.049,0.112
CSCO.OQ,0.0999,0.018,5.606,0.000,0.065,0.135
DIS.N,0.0650,0.017,3.773,0.000,0.031,0.099
INTC.OQ,0.0705,0.014,5.141,0.000,0.044,0.097
HD.N,0.0929,0.019,4.905,0.000,0.056,0.130
UNH.N,0.0993,0.017,5.894,0.000,0.066,0.132
MSFT.OQ,0.1205,0.018,6.681,0.000,0.085,0.156
HON.OQ,0.1424,0.019,7.611,0.000,0.106,0.179

0,1,2,3
Omnibus:,186.038,Durbin-Watson:,2.236
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2759.881
Skew:,-0.178,Prob(JB):,0.0
Kurtosis:,10.568,Cond. No.,2.45
