# Chapter 7: Multifactor Models and Performance Measures

In Chapter 6, Capital Asset Pricing Model, we discussed the simplest one-factor linear
model: CAPM. As mentioned, this one-factor linear model serve as a benchmark for
more advanced and complex models. In this chapter, we will focus on the famous
Fama-French three-factor model, Fama-French-Carhart four-factor model, and
Fama-French five-factor model. After understanding those models, readers should
be able to develop their own multifactor linear models, such as by adding Gross
Domestic Product (GDP), Consumer Price Index (CPI), a business cycle indicator
or other variables as an extra factor(s). In addition, we will discuss performance
measures, such as the Sharpe ratio, Treynor ratio, and Jensen's alpha. In particular,
the following topics will be covered in this chapter:

• Introduction to the Fama-French three-factor model

• Fama-French-Carhart four-factor model

• Fama-French five-factor model

• Other multiplefactor models

• Sharpe ratio and Treynor ratio

• Lower partial standard deviation and Sortino ratio

• Jensen's alpha

• How to merge different datasets

## Introduction to the Fama-French three-factor model

First let's consider the basic three-factor linear model. See my notes on google docs for the full explanation.

Below, we will write some basic python code to illustrate this.


In [8]:
import statsmodels.api as sm
import pandas as pd

y = [0.065, 0.0265, -0.0593, -0.001,0.0346]
x1 = [0.055, -0.09, -0.041,0.045,0.022]
x2 = [0.025, 0.10, 0.021,0.145,0.012]
x3= [0.015, -0.08, 0.341,0.245,-0.022]
df= pd.DataFrame({"y":y,"x1":x1, 'x2':x2,'x3':x3})

y= df['y']
x=df[['x1','x2','x3']]
x = sm.add_constant(x) 

result=sm.OLS(y,x).fit()
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.960
Model:                            OLS   Adj. R-squared:                  0.841
Method:                 Least Squares   F-statistic:                     8.073
Date:                Fri, 09 Feb 2024   Prob (F-statistic):              0.252
Time:                        12:06:52   Log-Likelihood:                 16.837
No. Observations:                   5   AIC:                            -25.67
Df Residuals:                       1   BIC:                            -27.24
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0336      0.013      2.518      0.2

  warn("omni_normtest is not valid with less than 8 observations; %i "


We should note that the textbook still assumes pandas has an OLS module, though this has been evidently deprecated in favor of statsmodel.api. WE will be using statsmodel.api

In [12]:
import scipy.stats as stats
alpha=0.05
dfNumerator=3
dfDenominator=1
f=stats.f.ppf(q=1-alpha, dfn=dfNumerator, dfd=dfDenominator)
print(f)

215.70734536960884


The confidence level is equal to 1 minus alpha, that is, 95% in this case. The higher
the confidence level, the more reliable the result, such as 99% instead of 95%. The
most-used confidence levels are 90%, 95%, and 99%. dfNumeratro (dfDenominator)
is the degree of freedom for the numerator (denominator), which depends on the
simple sizes. From the preceding result of OLS regression, we know that those two
values are 3 and 1.


From the preceding values, F=8.1 < 215.7 (critical F-value), we should accept the null
hypothesis that all coefficients are zero, that is, the quality of the model is not good.
On the other hand, a P-value of 0.25 is way higher the critical value of 0.05. It also
means that we should accept the null hypothesis. This makes sense since we have
entered those values without any meanings.


Consider a second example that uses IBM historic data.

In [15]:
import datetime
from datetime import datetime, timedelta
import yfinance as yf

now = datetime.now()
five_years_ago = now - timedelta(days=365 * 5)
# Format the output to display only the time and year
formatted_now = datetime(now.year, now.month, now.day)
formatted_before = datetime(five_years_ago.year, five_years_ago.month, five_years_ago.day)

stock = yf.download('IBM',start = formatted_before, end = formatted_now)
stock.tail()

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-02-02,187.100006,187.389999,185.619995,185.789993,184.111465,4054200
2024-02-05,185.509995,185.779999,183.259995,183.419998,181.762894,4379600
2024-02-06,183.550003,184.679993,183.039993,183.410004,181.752991,3337600
2024-02-07,183.339996,184.020004,182.630005,183.740005,182.080002,4841200
2024-02-08,182.630005,184.550003,181.490005,184.360001,184.360001,5161200


Now let's consider the three factor model, with Adj Close as the dependent variable and Open, High, Close as the independent ones.

In [16]:
X = stock[['Open','High','Volume']]
Y = stock['Adj Close']
X = sm.add_constant(X)

result=sm.OLS(Y,X).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:              Adj Close   R-squared:                       0.810
Model:                            OLS   Adj. R-squared:                  0.809
Method:                 Least Squares   F-statistic:                     1776.
Date:                Fri, 09 Feb 2024   Prob (F-statistic):               0.00
Time:                        15:16:24   Log-Likelihood:                -4320.3
No. Observations:                1258   AIC:                             8649.
Df Residuals:                    1254   BIC:                             8669.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -48.5670      2.347    -20.689      0.0

The first three commands import three Python modules. The command line of x=sm.
add_constant(x) will add a column of 1s. If the line is missing, we would force a
zero intercept. To enrich our experience of running a three-factor linear model, this
time, a different OLS function is applied. The advantage of using the statsmodels.
apilsm.OLS() function is that we could find more information about our results,
such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC),
skew, and kurtosis. The discussion of their definitions will be postponed to the next
chapter (Chapter 8, Time-Series Analysis). The corresponding output after running the
preceding Python program is given here:

## Fama-French three-factor model

TBD

## How to Merge Datasets

Consider the following code as a means to merge datasets

In [1]:
import pandas as pd
import scipy as s
x= pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']})

y = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K6'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']})

In [3]:
print(s.shape(x))
print(x)

(4, 3)
  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3


  print(s.shape(x))


In [4]:
print(s.shape(y))
print(y)

(4, 3)
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K6  C3  D3


  print(s.shape(y))


Now assume we want to merge these two datasets using the common column "key". Since the common values of this variable are K0, K1
and K2. The final result should have three rows and five columns since K3 and K6
are not the common values by the two datasets; see the result shown here:

In [5]:
result = pd.merge(x,y, on='key')
print(result)

  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2


As we can see, key was not duplicated when merged. Note that the merge function can be further expanded upon to include various joins - full / outer, inner, left, and right. The format of an inner join demands both datasets have the same items. An analogy
is students from a family with both parents. The left join is based on the left dataset.
In other words, our benchmark is the first dataset (left). An analogy is choosing
students from families with a mum. The right is the opposite of the left, that is, the
benchmark is the second dataset (right). The outer is the full dataset which contain
both datasets, the same as students from all families: with both parents, with mum
only, and with dad only. We illustrate an example below:

In [6]:
import pandas as pd
import scipy as sp
x= pd.DataFrame({'YEAR': [2010,2011, 2012, 2013],
    'IBM': [0.2, -0.3, 0.13, -0.2],
    'WMT': [0.1, 0, 0.05, 0.23]})
y = pd.DataFrame({'YEAR': [2011,2013,2014, 2015],
    'C': [0.12, 0.23, 0.11, -0.1],
    'SP500': [0.1,0.17, -0.05, 0.13]})

print(pd.merge(x,y, on='YEAR'))
print(pd.merge(x,y, on='YEAR',how='outer'))
print(pd.merge(x,y, on='YEAR',how='left'))
print(pd.merge(x,y, on='YEAR',how='right'))

   YEAR  IBM   WMT     C  SP500
0  2011 -0.3  0.00  0.12   0.10
1  2013 -0.2  0.23  0.23   0.17
   YEAR   IBM   WMT     C  SP500
0  2010  0.20  0.10   NaN    NaN
1  2011 -0.30  0.00  0.12   0.10
2  2012  0.13  0.05   NaN    NaN
3  2013 -0.20  0.23  0.23   0.17
4  2014   NaN   NaN  0.11  -0.05
5  2015   NaN   NaN -0.10   0.13
   YEAR   IBM   WMT     C  SP500
0  2010  0.20  0.10   NaN    NaN
1  2011 -0.30  0.00  0.12   0.10
2  2012  0.13  0.05   NaN    NaN
3  2013 -0.20  0.23  0.23   0.17
   YEAR  IBM   WMT     C  SP500
0  2011 -0.3  0.00  0.12   0.10
1  2013 -0.2  0.23  0.23   0.17
2  2014  NaN   NaN  0.11  -0.05
3  2015  NaN   NaN -0.10   0.13


When the common variable has different names in those two datasets, we should
specify their names by using left_on='left_name' and right_on='another_
name'; see the following code:

In [7]:
import pandas as pd
import scipy as sp
x= pd.DataFrame({'YEAR': [2010,2011, 2012, 2013],
    'IBM': [0.2, -0.3, 0.13, -0.2],
    'WMT': [0.1, 0, 0.05, 0.23]})
y = pd.DataFrame({'date': [2011,2013,2014, 2015],
    'C': [0.12, 0.23, 0.11, -0.1],
    'SP500': [0.1,0.17, -0.05, 0.13]})
print(pd.merge(x,y, left_on='YEAR',right_on='date'))

   YEAR  IBM   WMT  date     C  SP500
0  2011 -0.3  0.00  2011  0.12   0.10
1  2013 -0.2  0.23  2013  0.23   0.17


If we intend to merge based on the index (row numbers), we specify that left_
index='True', and right_index='True'; see the following code. In a sense, since
both datasets have four rows, we simply put them together, row by row. The true
reason is that for those two datasets, there is no specific index. For a comparison, the
ffMonthly.pkl data has the date as its index:

In [8]:
import pandas as pd
import scipy as sp
x= pd.DataFrame({'YEAR': [2010,2011, 2012, 2013],
    'IBM': [0.2, -0.3, 0.13, -0.2],
    'WMT': [0.1, 0, 0.05, 0.23]})
y = pd.DataFrame({'date': [2011,2013,2014, 2015],
    'C': [0.12, 0.23, 0.11, -0.1],
    'SP500': [0.1,0.17, -0.05, 0.13]})
print(pd.merge(x,y, right_index=True,left_index=True))

   YEAR   IBM   WMT  date     C  SP500
0  2010  0.20  0.10  2011  0.12   0.10
1  2011 -0.30  0.00  2013  0.23   0.17
2  2012  0.13  0.05  2014  0.11  -0.05
3  2013 -0.20  0.23  2015 -0.10   0.13
