# **Runtime Dependencies: Must Run First!**



In [None]:
import pandas as pd
from matplotlib import pyplot as plt

# Statsmodels API - Standard
import statsmodels.api as sm

# Statsmodels API - Formulaic
import statsmodels.formula.api as smf

# ### Bonus: Multiple Outputs Per Cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# **Module 9 - Topic 1: Linear Regression with Statsmodels**



This notebook serves as an introduction to the Statsmodels API package for running a linear regression in Python.

This won't go too in depth into the statistical background of linear regression using Ordinary Least Squares (OLS) and covers more of the syntax in Python.

You should also know how to import datasets into Python with Pandas for this tutorial.

Modules 9.1 and 9.2 review preprocessing financial data and a little bit of CAPM. If you're just interested in the linear regression syntax, jump to Module 9.3!

## **Module 9.1.1: CAPM Regression (Single Variable Linear Regression)**

In finance, the Capital Asset Pricing Model (CAPM) describes the relationship between returns on assets (usually stocks) and the underlying market risk.

The equation relating the returns is as follows:

$$E[R_i] = R_f + \beta_i(E[R_m] - R_f)$$

Where...

- $E[R_i]$ is the expected return on the asset
- $E[R_m]$ is the expected return on the market
- $R_f$ is the riskfree rate
- $\beta$ is the linear coefficient relating excess asset & market returns


*And as a note, according to the CAPM theory, $\alpha$ should be zero! However, we'll probably see that this is not true for all assets.*

In this notebook, we're going to walk through the process of calculating the alpha and beta for some stocks!

Want to know more? Here's an article: https://www.investopedia.com/terms/c/capm.asp

## **Module 9.1.2: Setting Up Regression**

To run this regression, we just need to setup our independent and dependent variables into a dataframe!

For our CAPM regression, the dependent variable, expected excess return on an asset $E[R_i] - R_f$, is going to be the Y input for our observed data.

For our X, the independent variable, we're going to have the expected excess return on the market, $E[R_m] - R_f$.

When we run this OLS regression, we'll have the $\alpha$ and $\beta$ for our equity!

Let's import our data and work through it! We're going to start with Apple!

(Stock return info from Yahoo finance!)

In [None]:
loc = "https://github.com/mhall-simon/python/blob/main/data/misc/stocks-factors-capm.xlsx?raw=true"

aapl = pd.read_excel(loc, sheet_name="AAPL", index_col=0, parse_dates=True)
aapl.head()

To calculate returns, we're going to assume that someone purchased the stock at the end of the previous period, and sells it at the end of the current period.

This means that the only important column is Close!

In [None]:
aapl['Return'] = (aapl.Close - aapl.Close.shift(1)) / aapl.Close.shift(1)
aapl.head()

Nice!

Now, we don't have the return info for the first period because we can't reference an unknown time period!

Let's drop the NaN value!

In [None]:
aapl.dropna(inplace=True)
aapl.head()

Bonus Box: Since we're realizing the return at the end of the month, we should use an offset to make sure everything lines up! This isn't necessary for the regression, but is a quick and cool feature!

In [None]:
from pandas.tseries.offsets import MonthEnd
aapl.index = aapl.index + MonthEnd(1)
aapl.head()

Now, let's isolate just the Return information for each stock and set them nicely into a DataFrame together!

In [None]:
R = pd.DataFrame(aapl.Return)

# Rename column for Apple
R = R.rename(columns={"Return":"AAPL"})
R.head()

Now let's do this again and add Amazon and Tesla!

In [None]:
# Import Sheets
amzn = pd.read_excel(loc, sheet_name="AMZN", index_col=0, parse_dates=True)
tsla = pd.read_excel(loc, sheet_name="TSLA", index_col=0, parse_dates=True)

# Calculate Returns
amzn['Return'] = (amzn.Close - amzn.Close.shift(1)) / amzn.Close.shift(1)
tsla['Return'] = (tsla.Close - tsla.Close.shift(1)) / tsla.Close.shift(1)

# Drop Nulls
amzn.dropna(inplace=True)
tsla.dropna(inplace=True)

# Month-End Offset
amzn.index = amzn.index + MonthEnd(1)
tsla.index = tsla.index + MonthEnd(1)

# Merge Everything Together!
R = pd.merge(R, tsla.Return, left_index=True, right_index=True)
R = pd.merge(R, amzn.Return, left_index=True, right_index=True)

# Rename Columns
R = R.rename(columns={"Return_x":"TSLA","Return_y":"AMZN"})

R.head()

Now, we just need to make these right for the CAPM regression by making them excess returns! To do this we need the risk free rate information.

Below is how we can import the data from the Kenneth French library:


In [None]:
from datetime import datetime
dp = lambda x: datetime.strptime(x, "%Y%m")

ff = amzn = pd.read_excel(loc, sheet_name="MktRf", index_col=0, parse_dates=True, date_parser=dp, header=3)
ff.head()

Now we just need to run our offset again so it all matches! (This doesn't change our calculation, just makes merging work perfectly, every time!)

In [None]:
ff.index = ff.index + MonthEnd(1)
ff.head()

Let's merge this all together into a single DataFrame!

By default, we only get matches with the inner merge!

In [None]:
R = pd.merge(R, ff.RF, left_index=True, right_index=True)
R.head()

Now, we can easily use broadcasting to subtract out the RF Rate!

In [None]:
R.AAPL = R.AAPL - R.RF
R.TSLA = R.TSLA - R.RF
R.AMZN = R.AMZN - R.RF

R.head()

Now we can just isolate our stock data!

In [None]:
Re = R.iloc[:,0:3]
Re.head()

To finish setting up, we just need our excess return on the market! Which is provided to us in the FF dataset!

We need to divide this number by 100 to keep everything consistent too!

In [None]:
Re = Re.merge(ff['Mkt-RF'], left_index=True, right_index=True)
Re['Mkt-RF'] = Re['Mkt-RF'] / 100
Re.head()

Now, we have all of the data needed to run our regression setup and ready to go!

## **Module 9.1.3: Linear Regression Default Method**

Using statsmodels, it's really easy to get our linear regression running!

Let's do it for Tesla first!

Why do we need to add a constant to our independent variable? They way the package was setup, without this constant, you'll only get the slope coefficient returned! So we add an extra constant to get both the y-intercept and slope!

In [None]:
Y = Re.TSLA

X = Re['Mkt-RF']
X = sm.add_constant(X)

model = sm.OLS(Y,X)
res = model.fit()

print(res.summary())

Based upon our results, we can see that Tesla has a beta of 2.38 and an alpha (residual) of -0.0583!

Note: Keep in mind that the index you use for both the RF rate and Market returns will change your calculations!

In [None]:
alpha = res.params['const']
beta = res.params['Mkt-RF']

alpha, beta

Now, let's visualize this really quickly!

In [None]:
plt.scatter(Re['Mkt-RF'],Re['TSLA'])
plt.plot(Re['Mkt-RF'], Re['Mkt-RF']*beta + alpha)

plt.title("CAPM Regression Results")
plt.ylabel("Excess Returns on Asset")
plt.xlabel("Excess Returns on Market")

plt.show();

## **Module 9.1.4: Linear Regression Formulaic Method**

In statsmodels, you can also use the formulaic method for running a regression!

This yields the exact same results. We just need to rename a column.

I feel that this method is easier when we get into multi variable linear regression!

We're going to use this to get the alpha and beta for Apple!

In [None]:
data = Re.rename(columns={'Mkt-RF':'MktEx'})

resf = smf.ols(formula='AAPL ~ MktEx', data=data).fit()
print(resf.summary())