# Homework 1 - Using OLS
## Data Analysis
### FINM August Review 

Mark Hendricks

hendricks@uchicago.edu

$$\newcommand{\spy}{\text{spy}}$$
$$\newcommand{\hyg}{\text{hyg}}$$

# Data
* This homework uses the file, `/data/multi_asset_etf_data.xlsx`.
* Find the data in the Github repo associated with the module, (link on Canvas.)

The data file contains...
* Return rates, $r_t^i$, for various asset classes, (via ETFs.)
* Most notable among these securities is SPY, the return on the S&P 500. Denote this as $r^{\spy}_t$.
* A separate tab gives return rates for a particular portfolio, $r_t^p$.

In [8]:
import pandas as pd
info = pd.read_excel("../data/multi_asset_etf_data.xlsx", sheet_name="info")
security_returns = pd.read_excel("../data/multi_asset_etf_data.xlsx", sheet_name="security returns", index_col="Date")
portfolio_returns = pd.read_excel("../data/multi_asset_etf_data.xlsx", sheet_name="portfolio returns", index_col="Date")


# 1. Regression
## 1. 
Estimate the regression of the portfolio return on SPY:

$$r^p_t = \alpha + \beta r^{\spy}_t + \epsilon_t^{p,\spy}$$

Specifically, report your estimates of alpha, beta, and the r-squared.

In [9]:
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

In [33]:

x = security_returns["SPY"] 

y = portfolio_returns["portfolio"]
import statsmodels.api as sm
model = sm.OLS(y, sm.add_constant(x)).fit()
alpha = model.params[0]
beta = model.params[1]
r2 = model.rsquared
Vals = [alpha, beta, r2]
Indices = ['alpha', 'beta', 'r2']
Res = pd.DataFrame(Vals, index = Indices, columns=["Value"])
Res




Unnamed: 0,Value
alpha,-0.001336
beta,0.637509
r2,0.759422


## 2. 
Estimate the regression of the portfolio return on SPY and on HYG, the return on high-yield
corporate bonds, denoted as $r^{\hyg}_t$:

$$r^p_t = {\alpha} + {\beta}^{\spy}r^{\spy}_t + {\beta}^{\hyg}r^{\hyg}_t + {\epsilon}_t$$

Specifically, report your estimates of alpha, the betas, and the r-squared.

*Note that the parameters (such as $\beta^{\spy}$) in this multivariate model are not the same as used in the univariate model of part 1. 

In [34]:
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

In [50]:
X = security_returns[['SPY','HYG']]
y = portfolio_returns["portfolio"]
import statsmodels.api as sm
model_multi = sm.OLS(y, sm.add_constant(X)).fit()
model_multi.summary()
alpha = model_multi.params[0]
beta_spy = model_multi.params[1]
beta_hyg = model_multi.params[2]
r2 = model_multi.rsquared
Vals = [alpha, beta_spy, beta_hyg, r2]
Indices = ['alpha', 'beta_spy', 'beta_hyg', 'r2']
Res = pd.DataFrame(Vals, index = Indices, columns=["Value"])
Res

Unnamed: 0,Value
alpha,-0.001396
beta_spy,0.392238
beta_hyg,0.525285
r2,0.836633


How do you access the individual r-squareds of spy and hyg?

## 3. 
Calculate the series of fitted regression values, sometimes referred to as $\hat{y}$ in standard textbooks:

$$\hat{r}^p_t = \hat{\alpha} + \hat{\beta}^{\spy}r^{\spy}_t + \hat{\beta}^{\hyg}r^{\hyg}_t$$

Your statistical package will output these fitted values for you, or you can construct them using the estimated parameters.

How does this compare to the r-squared of the regression in problem 2?

Since correlation is the square root of r-square we can calculate the correlation.

In [54]:
corr_portfolio = portfolio_returns.corrwith(model.fittedvalues)[0]
corr_portfolio_multi = portfolio_returns.corrwith(model_multi.fittedvalues)[0]
print(f'Correlation between portfolio and fitted values: {corr_portfolio_multi:.2%}.')
#formats it as a decimal with two decimal places
print(f'Square of this correlation is {corr_portfolio_multi**2:.2%} which equals the R-squared.')
#** is what you do for squared

Correlation between portfolio and fitted values: 91.47%.
Square of this correlation is 83.66% which equals the R-squared.


This high correlation value indicates a strong linear relationship between the actual portfolio returns and the fitted values predicted by the regression model. The model's predictions closely follow the actual returns.
An R-squared value of 83.66% means that approximately 83.66% of the variation in the portfolio returns can be explained by the returns on SPY and HYG. This indicates that the model has a good fit and the independent variables (SPY and HYG) are good predictors of the portfolio returns.

## 4. 
How do the SPY betas differ across the univariate and multivariate models? How does this relate to the
correlation between $r^{\spy}$ and $r^{\hyg}$?

In [61]:
beta_spy_Uni = model.params[1]
beta_spy_Uni
beta_spy_Multi = model_multi.params[1]
beta_spy_Multi
corr_spy_hyg = security_returns['SPY'].corr(security_returns['HYG'])
print(f'Correlation between SPY and HYG is {corr_spy_hyg:.1%}')

Correlation between SPY and HYG is 77.0%


## 5. 
Without doing any calculation, would you expect the sample residual of the univariate regression or multivariate regression to have higher correlation to $r^{\hyg}$?

Since the second regression includes HYG as a regression. Since sample residuals are always uncorrelated (in-sample) to the regressors we would expect to have a correlation of zero in the multivariate model. In the first regression HYG is not a regressor so a portion of the variation in residuals could be attributed to HYG. 


***

# 2. Decomposing and Replicating

## 1.
The portfolio return, $r_t^p$, is a combination of the base assets that are provided here. Use linear regression to uncover which weights were used in constructing the portfolio.

$$r_t^p = \alpha +\left(\boldsymbol{\beta}\right)' \boldsymbol{r}_t + \epsilon_t$$

where $\boldsymbol{r}$ denotes the vector of returns for the individual securities.
* What does the regression find were the original weights?
* How precise is the estimation? Consider the R-squared and t-stats.

*Feel free to include an $\alpha$ in this model, even though you know the portfolio is an exact function of the individual securities. The estimation should find $\alpha$ of (nearly) zero.*

## 1.
The portfolio return, $r_t^p$, is a combination of the base assets that are provided here. Use linear regression to uncover which weights were used in constructing the portfolio.

$$r_t^p = \alpha +\left(\boldsymbol{\beta}\right)' \boldsymbol{r}_t + \epsilon_t$$

where $\boldsymbol{r}$ denotes the vector of returns for the individual securities.
* What does the regression find were the original weights?
* How precise is the estimation? Consider the R-squared and t-stats.

*Feel free to include an $\alpha$ in this model, even though you know the portfolio is an exact function of the individual securities. The estimation should find $\alpha$ of (nearly) zero.*


## 2.

$$\newcommand{\targ}{EEM}$$

Suppose that we want to mimic a return, **EEM** using the other returns. Run the following regression–but
do so **only using data through the end of 2020.**

$$r_t^{\targ} = \alpha +\left(\boldsymbol{\beta}^{\boldsymbol{r}}\right)' \boldsymbol{r}_t + \epsilon_t$$

where $\boldsymbol{r}$ denotes the vector of returns for the other securities, excluding the target, **EEM**.

#### (a) 
Report the r-squared and the estimate of the vector, $\boldsymbol{\beta}$.

#### (b) 
Report the t-stats of the explanatory returns. Which have absolute value greater than 2?

#### (c) 
Plot the returns of **EEM** along with the replication values.

## 3.
Perhaps the replication results in the previous problem are overstated given that they estimated the parameters within a sample and then evaluated how well the result fit in the same sample. This is known as in-sample fit.

Using the estimates through **2020**, (the α and βˆ from the previous problem,) calculate the out-of-sample (OOS) values of the replication, using the **2021-2023** returns, denoted $\boldsymbol{r}_t^{\text{oos}}$:

$$\hat{r}_t^{\targ} = \left(\widehat{\boldsymbol{\beta}}^{\boldsymbol{r}}\right)' \boldsymbol{r}_t^{\text{oos}}$$

#### (a) 
What is the correlation between $\hat{r}_t^{\targ}$ and $\boldsymbol{r}_t^{\text{oos}}$?

#### (b) 
How does this compare to the r-squared from the regression above based on in-sample data, (through 2020?)

***