# Finance Research Practicum Python Diagnostic

15.453 Finance Research Practicum 
- Term: IAP 2020
- TA: Huili Song (huilis@mit.edu)

This is a basic Python diagnostic designed to assess your familiarity with Python. You will also likely come across similar problems during FRP. Feel free to use online resources or any other materials, but it’s highly advised you work on this by yourself.

# Brain Warm-up

 You know 2 + 2 comes to the same as 2 x 2. Now find a set of three different whole numbers whose sum is equal to their total when multiplied.

# Programming resources

- Datacamp: Intermediate Python https://www.datacamp.com/courses/intermediate-python-for-data-science
- Datacamp: Importing Data (Part 1) https://www.datacamp.com/courses/importing-data-in-python-part-1
- Datacamp: Importing Data (Part 2) https://www.datacamp.com/courses/importing-data-in-python-part-2
- Datacamp: Introduction to Financial Concepts in Python https://www.datacamp.com/courses/intro-to-financial-concepts-using-python
- Datacamp: Introduction to Linear Models in Python https://www.datacamp.com/courses/introduction-to-linear-modeling-in-python
- Datacamp: Merging dataframes with pandas https://www.datacamp.com/courses/merging-dataframes-with-pandas

## Part 0: Set up

Import necessary Libraries

In [1]:
# If it is the first time, you may need to install the necessary packages
!pip install pandas



In [2]:
import pandas as pd
import numpy as np
import os
import statsmodels.api as sm
# ...

## Part 1: Working with Data

A large amount of time will be spent cleansing and manipulating data. This part will assess your familiarity in manipulating data types and creating new ones.
1. Please download daily data of the S&P500, Seasonally Adjusted Quarterly GDP Growth and the 10-Year Constant Maturity Rate from 2009-01-01. Load both datasets into Python. Links to both are below:


- https://fred.stlouisfed.org/series/DGS10
- https://fred.stlouisfed.org/series/SP500
- https://fred.stlouisfed.org/series/A191RP1Q027SBEA


In [5]:
### Load Dataset here
data_folder = '/Users/sophia/Desktop/FRP TA/FRP Coding/'
df_DGS10 = pd.read_csv(data_folder+ 'DGS10.csv')
df_SP500 = pd.read_csv(data_folder+ 'SP500.csv')
df_GDP = pd.read_csv(data_folder+ 'A191RP1Q027SBEA.csv')

In [6]:
df_DGS10.head()

Unnamed: 0,DATE,DGS10
0,2009-01-02,2.46
1,2009-01-05,2.49
2,2009-01-06,2.51
3,2009-01-07,2.52
4,2009-01-08,2.47


In [7]:
df_DGS10.tail()

Unnamed: 0,DATE,DGS10
2870,2020-01-03,1.8
2871,2020-01-06,1.81
2872,2020-01-07,1.83
2873,2020-01-08,1.87
2874,2020-01-09,1.85


In [8]:
df_SP500

Unnamed: 0,DATE,SP500
0,2010-01-11,1146.98
1,2010-01-12,1136.22
2,2010-01-13,1145.68
3,2010-01-14,1148.46
4,2010-01-15,1136.03
5,2010-01-18,.
6,2010-01-19,1150.23
7,2010-01-20,1138.04
8,2010-01-21,1116.48
9,2010-01-22,1091.76


In [9]:
df_GDP.head()

Unnamed: 0,DATE,A191RP1Q027SBEA
0,2009-01-01,-4.5
1,2009-04-01,-1.2
2,2009-07-01,1.9
3,2009-10-01,5.9
4,2010-01-01,2.6


2.	This data needs to be cleansed. Some days have a price level “.” which causes Python to read this column as an object type instead of numeric. Please:


- Remove all rows with price “.” in the DGS10 & SP500 datasets

- Reformat all price columns to type numeric

- Rescale the GDP dataset’s returns to be in real levels (divide everything by 100)

- Rename the second column of the GDP dataset to “GDPReturn”


In [10]:
# Remove all rows with price “.” in the DGS10 & SP500 datasets
df_SP500 = df_SP500[df_SP500.SP500!= '.']
df_DGS10 = df_DGS10[df_DGS10.DGS10!= '.']

In [11]:
# Reformat all price columns to type numeric
df_SP500.SP500 = pd.to_numeric(df_SP500.SP500).copy()
df_DGS10.DGS10 = pd.to_numeric(df_DGS10.DGS10).copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [12]:
# Rescale the GDP dataset’s returns to be in real levels (divide everything by 100)
df_GDP.A191RP1Q027SBEA = df_GDP.A191RP1Q027SBEA.copy()/100

In [13]:
# Rename the second column of the GDP dataset to “GDPReturn”
df_GDP.rename(columns={'A191RP1Q027SBEA':'GDPReturn'}, inplace=True)

3.	It’s not a good idea to work with level data, so let’s transform the data. Please compute the daily returns of both the S&P and 10 Yr CMT and create a new column called “SP_Return” and “CMT_Return” respectively. The first row’s return should be NA.

In [14]:
df_SP500['SP_Return'] = df_SP500.SP500.pct_change(1).copy()
df_DGS10['CMT_Return'] = df_DGS10.DGS10.pct_change(1).copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


4.	Merge the two dataframes together into a master data frame. Please only keep rows where both dataframes have price data for. Also remove the first row since there is no return data here


In [15]:
df_master = df_SP500.set_index('DATE').join(df_DGS10.set_index('DATE')).dropna().filter(['SP_Return','CMT_Return'])

In [16]:
df_master

Unnamed: 0_level_0,SP_Return,CMT_Return
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-12,-0.009381,-0.028571
2010-01-13,0.008326,0.016043
2010-01-14,0.002427,-0.010526
2010-01-15,-0.010823,-0.015957
2010-01-19,0.012500,0.008108
2010-01-20,-0.010598,-0.013405
2010-01-21,-0.018945,-0.016304
2010-01-22,-0.022141,0.000000
2010-01-25,0.004598,0.011050
2010-01-26,-0.004203,-0.002732


5.	You’ll notice we have a period mismatch; quarterly returns for GDP but daily for S&P and CMT. Please create a final table containing quarterly GDP and quarterly S&P & CMT returns. Use dates according to the quarterly GDP dataset. Also remove the first row since there is no return data here



In [17]:
df_SP500.index = pd.to_datetime(df_SP500.DATE)
df_DGS10.index = pd.to_datetime(df_DGS10.DATE)
df_GDP.index = pd.to_datetime(df_GDP.DATE)

df_SP500_R_Q = df_SP500.filter(['SP500']).resample('QS').first()
df_DGS10_R_Q = df_DGS10.filter(['DGS10']).resample('QS').first()
df_SP500_R_Q['SP_Return'] = df_SP500_R_Q.SP500.pct_change(1).copy()
df_DGS10_R_Q['CMT_Return'] = df_DGS10_R_Q.DGS10.pct_change(1).copy()

df_master_Q = df_SP500_R_Q.join(df_DGS10_R_Q).dropna().join(df_GDP).dropna().filter(['SP_Return','CMT_Return','GDPReturn'])

In [18]:
df_master_Q 

Unnamed: 0_level_0,SP_Return,CMT_Return,GDPReturn
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-04-01,0.027132,0.01039,0.057
2010-07-01,-0.127943,-0.239075,0.042
2010-10-01,0.115703,-0.141892,0.043
2011-01-01,0.109602,0.322835,0.012
2011-04-01,0.047599,0.029762,0.056
2011-07-01,0.005449,-0.069364,0.025
2011-10-01,-0.179477,-0.440994,0.054
2012-01-01,0.161777,0.094444,0.058
2012-04-01,0.111177,0.126904,0.033
2012-07-01,-0.037723,-0.274775,0.026


## Part 2: Understanding your data

1.	Provide the following information:


- Min, Max, 1st & 3rd quartile, Mean, Median of both return columns
- Which days did the Max/Min returns occur for both columns?
- Bonus points: What happened on these days to justify the returns?
- Correlations between both
- Standard deviation of both columns


In [19]:
# Min, Max, 1st & 3rd quartile, Mean, Median of both return columns
df_master.describe()

Unnamed: 0,SP_Return,CMT_Return
count,2497.0,2497.0
mean,0.000459,-2.2e-05
std,0.009289,0.021561
min,-0.066634,-0.097701
25%,-0.003277,-0.013089
50%,0.000603,0.0
75%,0.005088,0.012539
max,0.049594,0.101064


In [20]:
df_master_Q.describe()

Unnamed: 0,SP_Return,CMT_Return,GDPReturn
count,38.0,38.0,38.0
mean,0.027799,-0.000424,0.041053
std,0.070628,0.182278,0.017983
min,-0.179477,-0.440994,0.001
25%,0.009559,-0.122456,0.029
50%,0.037547,0.0,0.042
75%,0.057772,0.094004,0.05375
max,0.161777,0.503067,0.079


In [21]:
# Which days did the Max/Min returns occur for both columns?

print(df_master.idxmax())
print(df_master.max())
print(df_master.idxmin())
print(df_master.min())
# Bonus points: What happened on these days to justify the returns?

SP_Return     2018-12-26
CMT_Return    2016-11-09
dtype: object
SP_Return     0.049594
CMT_Return    0.101064
dtype: float64
SP_Return     2011-08-08
CMT_Return    2016-06-24
dtype: object
SP_Return    -0.066634
CMT_Return   -0.097701
dtype: float64


In [24]:
# Correlations between both
df_master_Q.corr()

Unnamed: 0,SP_Return,CMT_Return,GDPReturn
SP_Return,1.0,0.485578,-0.030807
CMT_Return,0.485578,1.0,-0.065821
GDPReturn,-0.030807,-0.065821,1.0


In [25]:
# Standard deviation of both columns
df_master.std()

SP_Return     0.009289
CMT_Return    0.021561
dtype: float64

## Part 3: Modelling & Analytics

1.	Let’s see if there is any predictability between the two columns. Please run a regression, the explanatory variable (X) is the CMT return. The response variable (Y) is the SP’s return. Include an intercept term as well! Please assign your regression results to a variable as well.


- Provide the coefficients
- Run a t-test and provide t-statistics on both coefficients (and p-values)
- What about R squared, Adjusted R Squared?
- Run an F-test and provide the F-statistic along with P-values


In [26]:
X = df_master["CMT_Return"] 
y = df_master["SP_Return"] 
X = sm.add_constant(X) 
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)

# Print out the statistics
model.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,SP_Return,R-squared:,0.198
Model:,OLS,Adj. R-squared:,0.198
Method:,Least Squares,F-statistic:,615.7
Date:,"Sun, 12 Jan 2020",Prob (F-statistic):,1.13e-121
Time:,23:43:52,Log-Likelihood:,8416.1
No. Observations:,2497,AIC:,-16830.0
Df Residuals:,2495,BIC:,-16820.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0005,0.000,2.785,0.005,0.000,0.001
CMT_Return,0.1917,0.008,24.814,0.000,0.177,0.207

0,1,2,3
Omnibus:,270.728,Durbin-Watson:,2.091
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1988.158
Skew:,-0.215,Prob(JB):,0.0
Kurtosis:,7.35,Cond. No.,46.4


2.	From a 90% and 95% significance level, can bond returns explain the S&P?

In [None]:
# 90% is ok, 95% is not

3.	Is the regression model suitable for modelling this phenomena compared to just an intercept term? (Hint: F-Test)

In [None]:
# Yes.

4. Now run a regression where the explanatory variable (X) is the CMT return and GDP growth. The response variable (Y) is the SP’s return. Your regression will be on a quarterly basis. Include an intercept term as well! 

In [49]:
X = df_master_Q[["CMT_Return",'GDPReturn']]
y = df_master_Q["SP_Return"] 
X = sm.add_constant(X) 
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,SP_Return,R-squared:,0.254
Model:,OLS,Adj. R-squared:,0.212
Method:,Least Squares,F-statistic:,5.968
Date:,"Thu, 17 Jan 2019",Prob (F-statistic):,0.00589
Time:,17:59:49,Log-Likelihood:,52.297
No. Observations:,38,AIC:,-98.59
Df Residuals:,35,BIC:,-93.68
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0040,0.023,0.174,0.863,-0.043,0.051
CMT_Return,0.1937,0.057,3.420,0.002,0.079,0.309
GDPReturn,0.5912,0.522,1.132,0.265,-0.469,1.651

0,1,2,3
Omnibus:,2.301,Durbin-Watson:,2.214
Prob(Omnibus):,0.316,Jarque-Bera (JB):,1.8
Skew:,-0.532,Prob(JB):,0.407
Kurtosis:,2.933,Cond. No.,50.7


# Brain Warm-up Solution

The three different whole numbers whose sum is equal to their total when multiplied are 1, 2, and 3