# Analysis of Cross-Sectional Data

Linear regression is an essential tool of any econometrician and is widely used throughout finance and economics. Linear regression’s success is owed to two key features: the availability of simple, closed-form estimators, and the ease and directness of interpretation.

## Model Description

$$Y = X \beta + \varepsilon$$

With the following assumptions:

* $E(\varepsilon) = 0 $
* $V(\varepsilon) = \sigma^2 I$ (covariance stationary)
* $X$ is nonstochastic fix full rank $K$

**OLS Estimator**: $\hat{\beta} = (X'X)^{-1} X'y$

**OLS Variance Estimator**: 

$$ \hat{\sigma}^2 = \frac{ \hat{\varepsilon}' \hat{\varepsilon} } {n-k}$$

The main assumptions are:

* Linearity
* conditional mean is zero
* conditional homoskedasticity ($\sigma^2$)
* conditional normality
* X is full rank

What does the heteroscedasticity mean?

The disturbance in matrix A is homoskedastic; this is the simple case where OLS is the best linear unbiased estimator. The disturbances in matrices B and C are heteroskedastic. In matrix B, the variance is time-varying, increasing steadily across time; in matrix C, the variance depends on the value of x. The disturbance in matrix D is homoskedastic because the diagonal variances are constant, even though the off-diagonal covariances are non-zero and ordinary least squares is inefficient for a different reason: serial correlation.

$$A = \sigma^2 \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \ \ \ B = \sigma^2 \begin{bmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{bmatrix} $$ 
$$C = \sigma^2 \begin{bmatrix} x_1 & 0 & 0 \\ 0 & x_2 & 0 \\ 0 & 0 & x_3 \end{bmatrix} \ \ \ D = \sigma^2 \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix}$$

An alternative to modeling heteroskedastic data is to transform the data so that is is homoskedastic using generalized least squares (GLS). GLS extends OLS to allow for arbitrary weighting matrices. The GLS estimator of β is defined: 

$$\hat{\beta}^{GLS} = (X'W^{-1}X)^{-1} X' W^{-1} y$$

for some positive definite matrix W. The full value of GLS is only realized when $W$ is wisely chosen. 

We could also use maximum likelihood to estimate the coefficient. It is important to note that the derivation of the OLS estimator does not require an assumption of normality. Moreover, the unbiasedness, variance, and BLUE properties do not rely on the conditional normality of residuals.  


T-tests can be used to test a single hypothesis involving one or more coefficient. 

In linear factors models, Fama and French (1992) use returns on specially constructed portfolios as factors to capture specific types of risk. We will first study Fama French 3-factor model and then move to the models with more factors. 

The traditional asset pricing model, known formally as the capital asset pricing model (CAPM) uses only one variable to describe the returns of a portfolio or stock with the returns of the market as a whole. In contrast, the Fama–French model uses three variables. Fama and French started with the observation that two classes of stocks have tended to do better than the market as a whole: (i) small caps and (ii) stocks with a high book-to-market ratio (B/P, customarily called value stocks, contrasted with growth stocks).

They then added two factors to CAPM to reflect a portfolio's exposure to these two classes:

$$r = R_f + \beta(R_m - R_f) + b_s \cdot SMB + b_v \cdot HML + +\alpha + \epsilon$$

Or we could write it as:

$$r - R_f = \alpha +  \beta(R_m - R_f) + b_s \cdot SMB + b_v \cdot HML + \epsilon $$

We will use the dataset from Fama and French, which you could download [here](http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html).

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1.1.4


In [107]:
ff_monthly = pd.read_csv('../data/ThreeFactorsMonthly.CSV', skiprows=2)  # skip the first two rows
ff_monthly.head()

Unnamed: 0.1,Unnamed: 0,Mkt-RF,SMB,HML,RF
0,192607,2.96,-2.3,-2.87,0.22
1,192608,2.64,-1.4,4.19,0.25
2,192609,0.36,-1.32,0.01,0.23
3,192610,-3.24,0.04,0.51,0.32
4,192611,2.53,-0.2,-0.35,0.31


In [108]:
ff_monthly.tail()

Unnamed: 0.1,Unnamed: 0,Mkt-RF,SMB,HML,RF
1228,2017,21.51,-4.96,-13.84,0.8
1229,2018,-6.93,-3.15,-9.34,1.81
1230,2019,28.28,-6.26,-10.68,2.14
1231,2020,23.67,13.07,-47.2,0.44
1232,Copyright 2021 Kenneth R. French,,,,


In [109]:
ff_monthly[ff_monthly['Unnamed: 0'] == '  1927'].index  # we need to drop all rows after 1138

Int64Index([1138], dtype='int64')

In [110]:
ff_monthly = ff_monthly.iloc[0:1138, :]
ff_monthly.tail()

Unnamed: 0.1,Unnamed: 0,Mkt-RF,SMB,HML,RF
1133,202012,4.63,4.81,-1.36,0.01
1134,202101,-0.04,7.19,2.85,0.00
1135,202102,2.79,2.11,7.07,0.00
1136,Annual Factors: January-December,,,,
1137,,Mkt-RF,SMB,HML,RF


In [111]:
ff_monthly = ff_monthly.iloc[0:1136, :]
ff_monthly.tail()

Unnamed: 0.1,Unnamed: 0,Mkt-RF,SMB,HML,RF
1131,202010,-2.1,4.44,4.03,0.01
1132,202011,12.47,5.48,2.11,0.01
1133,202012,4.63,4.81,-1.36,0.01
1134,202101,-0.04,7.19,2.85,0.0
1135,202102,2.79,2.11,7.07,0.0


In [112]:
ff_monthly_clean = ff_monthly.set_index('Unnamed: 0')
ff_monthly_clean.head()

Unnamed: 0_level_0,Mkt-RF,SMB,HML,RF
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
192607,2.96,-2.3,-2.87,0.22
192608,2.64,-1.4,4.19,0.25
192609,0.36,-1.32,0.01,0.23
192610,-3.24,0.04,0.51,0.32
192611,2.53,-0.2,-0.35,0.31


In [113]:
ff_monthly_clean.tail()

Unnamed: 0_level_0,Mkt-RF,SMB,HML,RF
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
202010,-2.1,4.44,4.03,0.01
202011,12.47,5.48,2.11,0.01
202012,4.63,4.81,-1.36,0.01
202101,-0.04,7.19,2.85,0.0
202102,2.79,2.11,7.07,0.0


In [49]:
ff_monthly.columns = ['Date', 'mkt-rf', 'smb', 'hml', 'rf']

In [66]:
ff_monthly['Date']

0                                 192607
1                                 192608
2                                 192609
3                                 192610
4                                 192611
                      ...               
1228                                2017
1229                                2018
1230                                2019
1231                                2020
1232    Copyright 2021 Kenneth R. French
Name: Date, Length: 1233, dtype: object

In [62]:
ff_monthly.tail()

Unnamed: 0,Date,mkt-rf,smb,hml,rf
1228,2017,21.51,-4.96,-13.84,0.8
1229,2018,-6.93,-3.15,-9.34,1.81
1230,2019,28.28,-6.26,-10.68,2.14
1231,2020,23.67,13.07,-47.2,0.44
1232,Copyright 2021 Kenneth R. French,,,,


In [35]:
ff_monthly.tail() # some parts of dataset include the annual factor

Unnamed: 0,Mkt-RF,SMB,HML,RF
2017,21.51,-4.96,-13.84,0.8
2018,-6.93,-3.15,-9.34,1.81
2019,28.28,-6.26,-10.68,2.14
2020,23.67,13.07,-47.2,0.44
Copyright 2021 Kenneth R. French,,,,


In [47]:
ff_monthly.index.apply(lambda x: len(x)>=6, axis=1)

AttributeError: 'Index' object has no attribute 'apply'

In [37]:
# covert index to datetime format
ff_monthly['YM'] = pd.to_datetime(ff_monthly.index, format="%Y%m", errors='coerce').dropna()

ValueError: Length of values (1136) does not match length of index (1233)

In [29]:
ff_monthly.tail()

Unnamed: 0,Mkt-RF,SMB,HML,RF
NaT,21.51,-4.96,-13.84,0.8
NaT,-6.93,-3.15,-9.34,1.81
NaT,28.28,-6.26,-10.68,2.14
NaT,23.67,13.07,-47.2,0.44
NaT,,,,


In [30]:
ff_monthly.dropna()

Unnamed: 0,Mkt-RF,SMB,HML,RF
1926-07-01,2.96,-2.30,-2.87,0.22
1926-08-01,2.64,-1.40,4.19,0.25
1926-09-01,0.36,-1.32,0.01,0.23
1926-10-01,-3.24,0.04,0.51,0.32
1926-11-01,2.53,-0.20,-0.35,0.31
...,...,...,...,...
NaT,13.30,6.53,22.86,0.20
NaT,21.51,-4.96,-13.84,0.80
NaT,-6.93,-3.15,-9.34,1.81
NaT,28.28,-6.26,-10.68,2.14
