# Section 1: Linear Regression Application 1: CAPM
Capital Asset Pricing Model (CAPM) is one of the most important pricing model in finance, identified as:
$$
E(R_i)=r_f+\beta \times E(R_m - r_f)
$$
where:
- **\( $E(R_i)$ \)**: Expected return of asset i.
- **\( $r_f$ \)**: Risk-free rate.
- **\( $R_m$\)**: Return of market portfolio.
- **\( $\beta$ \)**: Measure the sensitivity of the change in $E(R_i)$ when $E(R_m)$ changes.
Based on the linear regression theory, the intercept in the regression is uncertain, which is violating the CAPM. To meet the model, we can reshape CAPM:
$$
E(R_i) - r_f = \alpha +\beta \times E(R_m - r_f) + \epsilon_i
$$
where: 
- **\( $\alpha)$ \)**: Jensen's Alpha (or Alpha). Asset i is underpriced (overpriced) if Alpha is significant and > 0 (< 0). 
- **\( $\epsilon_i$ \)**: Disturbance term in the linear regression. 


In this section, we will regress our first financial model based on CAPM.

In [1]:
import pandas as pd
import numpy as np 
import yfinance as yf 
import statsmodels.api as sm 


In [2]:
#step 1: Select a security and download the data via yfinance package

#I selected walmart
tickers = ["WMT", "SPY"]

#I consider 10-year historical data
start_date = "2014-01-01"
end_date = "2025-01-01"

#I consider monthly data

freq = "1mo"

#I finally download the data from yfinance, including closing price only

p_close = yf.download(tickers, start_date, end_date, interval= freq)["Close"]

p_close


[*********************100%***********************]  2 of 2 completed


  quotes.loc[idx2, "Low"] = _np.nanmin([quotes["Low"][n - 1], quotes["Low"][n - 2]])
  quotes.loc[idx2, "High"] = _np.nanmax([quotes["High"][n - 1], quotes["High"][n - 2]])
  quotes.loc[idx2, "Close"] = quotes["Close"][n - 1]
  quotes.loc[idx2, "Adj Close"] = quotes["Adj Close"][n - 1]
  quotes.loc[idx2, "Volume"] += quotes["Volume"][n - 1]
  quotes.loc[idx2, "Low"] = _np.nanmin([quotes["Low"][n - 1], quotes["Low"][n - 2]])
  quotes.loc[idx2, "Close"] = quotes["Close"][n - 1]
  quotes.loc[idx2, "Adj Close"] = quotes["Adj Close"][n - 1]
  quotes.loc[idx2, "Volume"] += quotes["Volume"][n - 1]


Unnamed: 0_level_0,SPY,WMT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-01-01,178.179993,24.893333
2014-02-01,186.289993,24.900000
2014-03-01,187.009995,25.476667
2014-04-01,188.309998,26.570000
2014-05-01,192.679993,25.590000
...,...,...
2024-08-01,563.679993,77.230003
2024-09-01,573.760010,80.750000
2024-10-01,568.640015,81.949997
2024-11-01,602.549988,92.500000


In [3]:
#step 2: Calculate the monthly return of walmart and market portfolio.

r = np.log(p_close) - np.log(p_close.shift(1))

returns = r.dropna()

returns

Unnamed: 0_level_0,SPY,WMT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-02-01,0.044510,0.000268
2014-03-01,0.003857,0.022895
2014-04-01,0.006927,0.042020
2014-05-01,0.022941,-0.037581
2014-06-01,0.015654,-0.022393
...,...,...
2024-08-01,0.023097,0.117913
2024-09-01,0.017725,0.044570
2024-10-01,-0.008964,0.014751
2024-11-01,0.057923,0.121099


In [4]:
#step 3: determine the risk-free rate 

# to save your time, I assume the annual risk-free rate as 3% for the entire class and semester. 

#But we need to convert these risk-free rate to monthly perspective. 

r_f = (1+0.03)**(1/12) - 1

r_f


0.0024662697723036864

In [5]:
#step 4: determine the excess return of walmart (Y) and the market risk premium

Y = returns["WMT"] - r_f

X = returns["SPY"] - r_f




In [6]:
#step 5: present linear regression

X = sm.add_constant(X)

capm = sm.OLS(Y, X).fit()

print(capm.summary())

                            OLS Regression Results                            
Dep. Variable:                    WMT   R-squared:                       0.150
Model:                            OLS   Adj. R-squared:                  0.144
Method:                 Least Squares   F-statistic:                     22.82
Date:                Tue, 31 Dec 2024   Prob (F-statistic):           4.77e-06
Time:                        19:50:36   Log-Likelihood:                 209.66
No. Observations:                 131   AIC:                            -415.3
Df Residuals:                     129   BIC:                            -409.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0042      0.004      0.969      0.3

In [7]:
#Step 6: Interpret the regression result.
#Based on the regression, I observe that the beta (slope of regression) of walmart is 0.4755, and I round it up to 0.48. The alpha (the constant term)
#of walmart is 0.0043. Based on this information, what's your comment on Walmart's stock?







In [8]:
#In-Class Exersize 1 (10 minutes): Select a stock and replicate the entire regression process. Based on your estimation results, introduce this stock and tell me about tyour thoughts.













# Section 2: Linear Regression Application 2: Multi-factor Model and Arbitrage Pricing Theory (APT)
CAPM assume that the return of stock is influenced by market fluctuation only. However, the return of stock can not only be influenced by market, and it has more information that can add into our regression model. For example, some macro indicators such as interest rate, GDP growth rate, or firm's specific information such as size (total asset) and revenue, etc. Multifactor model is widely used in hedging, stock pitch and statistical arbitrage. 
The multifactor model is shown below:
$$
E(R_i)=r_f+\beta_1 \times \lambda_1 + \beta_2 \times \lambda_2+...+\beta_n \times \lambda_n
$$
where:
- **\( $E(R_i)$ \)**: Expected return of asset i.
- **\( $r_f$ \)**: Risk-free rate.
- **\( $\lambda_n$\)**: Risk premium of factor n (or factor premium), calculated by ($R_n-r_f$) .
- **\( $\beta_n$ \)**: Measure the sensitivity of the change in $E(R_i)$ when factor n changes.

Again, based on the linear regression theory, the intercept in the regression is still uncertain. To meet the model requirment, we can reshape multi-factor model to:
$$
E(R_i) - r_f = \alpha +\beta_1 \times \lambda_1 + \beta_2 \times \lambda_2+...+\beta_n \times \lambda_n+\epsilon_i
$$
where: 
- **\( $\alpha)$ \)**: Jensen's Alpha (or Alpha). Asset i is underpriced (overpriced) if Alpha is significant and > 0 (< 0). 
- **\( $\epsilon_i$ \)**: Disturbance term in the multiple linear regression. 


In this section, we will regress our first financial model based on multifactor model.

## Section 2.1: Fama-French 3-factor and 5-factor models
Eugene Fama and Kenneth French developed their 3-factor model in 1992 that expands the CAPM by adding size risk and value risk factors to the market risk factor in CAPM. Fama shared the Nobel Prize in Economic Science because of his research for "Efficient Market Hypothesis". The 3-factor model is:

$$
R_{i,t}-r_{f,t}=\alpha_{i,t}+\beta_1(R_{m,t}-r_{f,t})+\beta_2SMB_t+\beta_3HML_t+\epsilon_{i,t}
$$
where:
- **\( $R_{i,t}$ \)**: Expected return of asset i at time t.
- **\( $r_{f,t}$ \)**: Risk-free rate at time t.
- **\( $R_{m,t}$\)**: Return of market risk premium at time t.
- **\( $SMB_t$\)**: Size premium (small minus big) at time t.
- **\( $HML_t$ \)**: Value premium (high minus low) at time t.
- **\( $\beta_{1,2,3}$ \)**: factor coefficients.

The 5-factor model is:

$$
R_{i,t}-r_{f,t}=\alpha_{i,t}+\beta_1(R_{m,t}-r_{f,t})+\beta_2SMB_t+\beta_3HML_t+\beta_4RMW_t+\beta_5CMA_t+\epsilon_{i,t}
$$
where:
- **\( $RMW_t$\)**: Difference between the returns with robust and weak profitability (robust minus weak) at time t.
- **\( $CMA_t$ \)**: Difference between the returns on conservative and aggressive investment strategy (conservative minus aggressive) at time t.

Let's apply the these two model to re-regress walmart's return again.


In [9]:

#step 1: first we need to retrieve the dataset that estimated by Fama and French for all three factors.


factor_3 =pd.read_csv("3_factor.csv")

factor_3


Unnamed: 0,date,Mkt-RF,SMB,HML,RF
0,192607,2.96,-2.56,-2.43,0.22
1,192608,2.64,-1.17,3.82,0.25
2,192609,0.36,-1.40,0.13,0.23
3,192610,-3.24,-0.09,0.70,0.32
4,192611,2.53,-0.10,-0.51,0.31
...,...,...,...,...,...
1175,202406,2.77,-3.06,-3.31,0.41
1176,202407,1.24,6.80,5.74,0.45
1177,202408,1.61,-3.55,-1.13,0.48
1178,202409,1.74,-0.17,-2.59,0.40


In [10]:
factor_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1180 entries, 0 to 1179
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    1180 non-null   int64  
 1   Mkt-RF  1180 non-null   float64
 2   SMB     1180 non-null   float64
 3   HML     1180 non-null   float64
 4   RF      1180 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 46.2 KB


In [11]:
#step 2: Now we need to create a dataset that including several stocks that we're interested in for further applications. 

#Based on my preference, I will select 5 different stocks with different market cap

# stock 1: Walmart (WMT) for mega-cap, stock 2: Coca-Cola (KO) for large-cap, stock 3: the GAP for mid-cap (GAP), and stock 4: Cros for small-cap (CROX).


tickers = ["WMT", "KO", "GAP", "CROX"]

#I already identified time inverval and frequency before, then I will not identify it again.


p_close = yf.download(tickers, start_date, end_date, interval= freq)["Close"]

p_close



[*********************100%***********************]  4 of 4 completed


  quotes.loc[idx2, "High"] = _np.nanmax([quotes["High"][n - 1], quotes["High"][n - 2]])
  quotes.loc[idx2, "Low"] = _np.nanmin([quotes["Low"][n - 1], quotes["Low"][n - 2]])
  quotes.loc[idx2, "Close"] = quotes["Close"][n - 1]
  quotes.loc[idx2, "Adj Close"] = quotes["Adj Close"][n - 1]
  quotes.loc[idx2, "Volume"] += quotes["Volume"][n - 1]
  quotes.loc[idx2, "High"] = _np.nanmax([quotes["High"][n - 1], quotes["High"][n - 2]])
  quotes.loc[idx2, "Low"] = _np.nanmin([quotes["Low"][n - 1], quotes["Low"][n - 2]])
  quotes.loc[idx2, "Close"] = quotes["Close"][n - 1]
  quotes.loc[idx2, "Adj Close"] = quotes["Adj Close"][n - 1]
  quotes.loc[idx2, "Volume"] += quotes["Volume"][n - 1]
  quotes.loc[idx2, "High"] = _np.nanmax([quotes["High"][n - 1], quotes["High"][n - 2]])
  quotes.loc[idx2, "Low"] = _np.nanmin([quotes["Low"][n - 1], quotes["Low"][n - 2]])
  quotes.loc[idx2, "Close"] = quotes["Close"][n - 1]
  quotes.loc[idx2, "Adj Close"] = quotes["Adj Close"][n - 1]
  quotes.loc[idx2, "Volume"

Unnamed: 0_level_0,CROX,GAP,KO,WMT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-01-01,15.350000,38.080002,37.820000,24.893333
2014-02-01,15.230000,43.750000,38.200001,24.900000
2014-03-01,15.600000,40.060001,38.660000,25.476667
2014-04-01,15.130000,39.299999,40.790001,26.570000
2014-05-01,14.930000,41.230000,40.910000,25.590000
...,...,...,...,...
2024-08-01,146.169998,22.430000,72.470001,77.230003
2024-09-01,144.809998,22.049999,71.860001,80.750000
2024-10-01,107.820000,20.770000,65.309998,81.949997
2024-11-01,105.599998,24.250000,64.080002,92.500000


In [12]:
# next we can finish similar process that calculating the return of these four stocks


r = np.log(p_close) - np.log(p_close.shift (1))

returns = r.dropna()

returns

Unnamed: 0_level_0,CROX,GAP,KO,WMT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-02-01,-0.007848,0.138802,0.009997,0.000268
2014-03-01,0.024004,-0.088113,0.011970,0.022895
2014-04-01,-0.030591,-0.019154,0.053632,0.042020
2014-05-01,-0.013307,0.047942,0.002938,-0.037581
2014-06-01,0.006676,0.008213,0.034830,-0.022393
...,...,...,...,...
2024-08-01,0.084173,-0.045750,0.082368,0.117913
2024-09-01,-0.009348,-0.017087,-0.008453,0.044570
2024-10-01,-0.294959,-0.059803,-0.095575,0.014751
2024-11-01,-0.020805,0.154907,-0.019013,0.121099


In [13]:
returns.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 131 entries, 2014-02-01 to 2024-12-01
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   CROX    131 non-null    float64
 1   GAP     131 non-null    float64
 2   KO      131 non-null    float64
 3   WMT     131 non-null    float64
dtypes: float64(4)
memory usage: 5.1 KB


In [14]:
# Next thing we need to do is combining two dataset together before runing linear regression. 

# We notice that the 3-factor document starts from July,1926 to Oct 2024, where our return's dataset starts from Feb 2014 to Dec 2024. We need to match the observation (row of dataset) before merging.


factor_3 = factor_3[factor_3["date"] >=201402]

factor_3 

Unnamed: 0,date,Mkt-RF,SMB,HML,RF
1051,201402,4.65,0.34,-0.31,0.00
1052,201403,0.43,-1.81,4.93,0.00
1053,201404,-0.19,-4.18,1.17,0.00
1054,201405,2.06,-1.88,-0.13,0.00
1055,201406,2.61,3.09,-0.70,0.00
...,...,...,...,...,...
1175,202406,2.77,-3.06,-3.31,0.41
1176,202407,1.24,6.80,5.74,0.45
1177,202408,1.61,-3.55,-1.13,0.48
1178,202409,1.74,-0.17,-2.59,0.40


In [15]:
# we also notice that the return's dataset are indexed by the date, and we don't need the last two observations, which is the return in Nov and Dec. 

# Next step we need to delete the last two observation in our returns dataset.

returns = returns.iloc[:-2, ]

returns

Unnamed: 0_level_0,CROX,GAP,KO,WMT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-02-01,-0.007848,0.138802,0.009997,0.000268
2014-03-01,0.024004,-0.088113,0.011970,0.022895
2014-04-01,-0.030591,-0.019154,0.053632,0.042020
2014-05-01,-0.013307,0.047942,0.002938,-0.037581
2014-06-01,0.006676,0.008213,0.034830,-0.022393
...,...,...,...,...
2024-06-01,-0.064350,-0.192456,0.011376,0.029222
2024-07-01,-0.082598,-0.017311,0.047405,0.013642
2024-08-01,0.084173,-0.045750,0.082368,0.117913
2024-09-01,-0.009348,-0.017087,-0.008453,0.044570


In [19]:
# Next is combining these two dataset together. Notice that these two dataset have different index type, 
#e.g., index of 3-factor is from 1051 to 1179, whereas the index of returns is from 2014-02-01 to 2024-10-01. 
# unmatech index will report an error when you are trying to combine two dataset 
# so you need to "re-index" these two dataset
# below are methods

factor_3 = factor_3.reset_index(drop=True)

factor_3


Unnamed: 0,date,Mkt-RF,SMB,HML,RF
0,201402,4.65,0.34,-0.31,0.00
1,201403,0.43,-1.81,4.93,0.00
2,201404,-0.19,-4.18,1.17,0.00
3,201405,2.06,-1.88,-0.13,0.00
4,201406,2.61,3.09,-0.70,0.00
...,...,...,...,...,...
124,202406,2.77,-3.06,-3.31,0.41
125,202407,1.24,6.80,5.74,0.45
126,202408,1.61,-3.55,-1.13,0.48
127,202409,1.74,-0.17,-2.59,0.40


In [20]:
returns = returns.reset_index(drop = True)

returns

Unnamed: 0,CROX,GAP,KO,WMT
0,-0.007848,0.138802,0.009997,0.000268
1,0.024004,-0.088113,0.011970,0.022895
2,-0.030591,-0.019154,0.053632,0.042020
3,-0.013307,0.047942,0.002938,-0.037581
4,0.006676,0.008213,0.034830,-0.022393
...,...,...,...,...
124,-0.064350,-0.192456,0.011376,0.029222
125,-0.082598,-0.017311,0.047405,0.013642
126,0.084173,-0.045750,0.082368,0.117913
127,-0.009348,-0.017087,-0.008453,0.044570


In [21]:
# now we can combine these two dataset after adjusting the index.

reg_factor_3 = pd.concat([factor_3, returns], axis = 1) 

reg_factor_3

Unnamed: 0,date,Mkt-RF,SMB,HML,RF,CROX,GAP,KO,WMT
0,201402,4.65,0.34,-0.31,0.00,-0.007848,0.138802,0.009997,0.000268
1,201403,0.43,-1.81,4.93,0.00,0.024004,-0.088113,0.011970,0.022895
2,201404,-0.19,-4.18,1.17,0.00,-0.030591,-0.019154,0.053632,0.042020
3,201405,2.06,-1.88,-0.13,0.00,-0.013307,0.047942,0.002938,-0.037581
4,201406,2.61,3.09,-0.70,0.00,0.006676,0.008213,0.034830,-0.022393
...,...,...,...,...,...,...,...,...,...
124,202406,2.77,-3.06,-3.31,0.41,-0.064350,-0.192456,0.011376,0.029222
125,202407,1.24,6.80,5.74,0.45,-0.082598,-0.017311,0.047405,0.013642
126,202408,1.61,-3.55,-1.13,0.48,0.084173,-0.045750,0.082368,0.117913
127,202409,1.74,-0.17,-2.59,0.40,-0.009348,-0.017087,-0.008453,0.044570


In [32]:
#

reg_factor_3["CROX"] = reg_factor_3["CROX"]*100
reg_factor_3["GAP"] = reg_factor_3["GAP"]*100
reg_factor_3["WMT"] = reg_factor_3["WMT"]*100

reg_factor_3

Unnamed: 0,date,Mkt-RF,SMB,HML,RF,CROX,GAP,KO,WMT
0,201402,4.65,0.34,-0.31,0.00,-0.784836,13.880235,0.009997,0.026775
1,201403,0.43,-1.81,4.93,0.00,2.400380,-8.811325,0.011970,2.289524
2,201404,-0.19,-4.18,1.17,0.00,-3.059140,-1.915387,0.053632,4.201972
3,201405,2.06,-1.88,-0.13,0.00,-1.330690,4.794164,0.002938,-3.758109
4,201406,2.61,3.09,-0.70,0.00,0.667555,0.821261,0.034830,-2.239296
...,...,...,...,...,...,...,...,...,...
124,202406,2.77,-3.06,-3.31,0.41,-6.435005,-19.245560,0.011376,2.922208
125,202407,1.24,6.80,5.74,0.45,-8.259844,-1.731096,0.047405,1.364158
126,202408,1.61,-3.55,-1.13,0.48,8.417316,-4.574961,0.082368,11.791258
127,202409,1.74,-0.17,-2.59,0.40,-0.934779,-1.708679,-0.008453,4.456994


In [34]:
# Next, we can start to identify different dependent variables. 

r_crox = reg_factor_3["CROX"] - reg_factor_3["RF"]
r_gap = reg_factor_3["GAP"] -reg_factor_3["RF"]
r_ko = reg_factor_3["KO"] -reg_factor_3["RF"]
r_wmt = reg_factor_3["WMT"] -reg_factor_3["RF"]



In [35]:
# All dependent variables are included in 3-factor dataset. Now we need to identify dependent variables

X = reg_factor_3.iloc[:, 1:4]
X = sm.add_constant(X)


In [36]:
#now we run 3-factor model separately for different stocks.

reg_wmt_3f = sm.OLS(r_wmt, X).fit()

print(reg_wmt_3f.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.220
Model:                            OLS   Adj. R-squared:                  0.202
Method:                 Least Squares   F-statistic:                     11.78
Date:                Tue, 31 Dec 2024   Prob (F-statistic):           7.53e-07
Time:                        20:09:47   Log-Likelihood:                -380.06
No. Observations:                 129   AIC:                             768.1
Df Residuals:                     125   BIC:                             779.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1575      0.426      0.369      0.7

In [37]:
reg_crox_3f = sm.OLS(r_crox, X).fit()

print(reg_crox_3f.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.268
Model:                            OLS   Adj. R-squared:                  0.250
Method:                 Least Squares   F-statistic:                     15.24
Date:                Tue, 31 Dec 2024   Prob (F-statistic):           1.65e-08
Time:                        20:09:48   Log-Likelihood:                -505.29
No. Observations:                 129   AIC:                             1019.
Df Residuals:                     125   BIC:                             1030.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2342      1.126     -0.208      0.8

In [38]:
reg_gap_3f = sm.OLS(r_gap, X).fit()

print(reg_gap_3f.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.374
Model:                            OLS   Adj. R-squared:                  0.359
Method:                 Least Squares   F-statistic:                     24.91
Date:                Tue, 31 Dec 2024   Prob (F-statistic):           1.06e-12
Time:                        20:11:24   Log-Likelihood:                -506.25
No. Observations:                 129   AIC:                             1020.
Df Residuals:                     125   BIC:                             1032.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.2746      1.134     -2.006      0.0

In [39]:
reg_ko_3f = sm.OLS(r_ko, X).fit()

print(reg_ko_3f.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     1.733
Date:                Tue, 31 Dec 2024   Prob (F-statistic):              0.164
Time:                        20:12:22   Log-Likelihood:                 55.975
No. Observations:                 129   AIC:                            -103.9
Df Residuals:                     125   BIC:                            -92.51
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1252      0.015     -8.624      0.0