# EXERCISE 3:

***Neural Networks and Gaussian process:***
Predict the SP500 with the  nancial indicators assigned to your team in the google spreadsheet (ep, dp, de, dy, dfy, bm, svar, ntis, in, tbl , see RLab3 2 GWcausalSP500.R), some lagged series of these indicators and lags of the target using a Neural Network and a GP regression with your desired kernel. Predict return, or price, or direction (up or down). For which target works best? Do some feature selection to disregard some variables, select appropriate lags: causality, (distance) correlation, VAR-test, Lasso ... (The script RLab5 GausProc.R can be of help. The dataset is goyal-welch2022Monthly.csv and work within the period 1927/2021.)

In our case, we have been assigned variables dp, de and ep.

# INDEX:

0. [DATA AND LIBRARY IMPORTS](#0.-DATA-AND-LIBRARY-IMPORTS)

1. [PREPROCESSING AND FEATURE ENGINEERING](#1.-PREPROCESSING-AND-FEATURE-ENGINEERING)

2. [LAG CREATION AND SELECTION](#2.-LAG-CREATION-AND-SELECTION)

# 0. DATA AND LIBRARY IMPORTS

[Back to Index](#INDEX)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import grangercausalitytests, adfuller, acf, pacf
from sklearn.preprocessing import StandardScaler, RobustScaler
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from scipy.stats import chi2
import dcor

After some further steps on the way down, we decided to keep the last period from 1921 for now, as it will be useful for havig a first value for the differenced variables straight from the beginning. We will also use the previous periods in order to be able to have some lags for the variables straight from the beginning of 1927.

In [None]:
snp = pd.read_csv('goyal-welch2022Monthly.csv')

snp['yyyymm'] = snp['yyyymm'].astype(str)
snp['yyyymm'] = pd.to_datetime(snp['yyyymm'], format='%Y%m')
snp = snp.loc[(snp['yyyymm'] >= '1921-01-01') & (snp['yyyymm'] < '2022-01-01')].reset_index(drop=True)
snp['Index'] = snp['Index'].str.replace(',', '').astype(float)

display(snp)

[Back to Index](#INDEX)

# 1. PREPROCESSING AND FEATURE ENGINEERING

[Back to Index](#INDEX)

In [None]:
snp.info()

We can observe that THERE ARE NO MISSING VALUES FOR OUR VARIABLES OF INTEREST (Index, D12 and E12).

We compute the new features as we need to compute:

- Dividend Price Ratio (DP)
- Dividend Earnings Ratio (DE)
- Earnings Price Ratio (EP)

In [None]:
snp = snp[['yyyymm', 'Index', 'D12', 'E12']]

snp['LogReturns'] = np.log(snp['Index']).diff()
snp['PriceDiv'] = snp['Index'] + snp['D12']
snp['LogReturnsDiv'] = np.log(snp['PriceDiv']).diff()

# We need to fill the NaN values with 0 because the ADF test doesn't tolerate NaN values and we might still need to further differentiate the series.
snp.fillna({'LogReturns': 0}, inplace=True)
snp.fillna({'LogReturnsDiv': 0}, inplace=True)

In [None]:
snp['DP'] = np.log(snp['D12']) - np.log(snp['Index'])
snp['DE'] = np.log(snp['D12']) - np.log(snp['E12'])
snp['EP'] = np.log(snp['E12']) - np.log(snp['Index'])

display(snp.head())

In [None]:
fig, axes = plt.subplots(nrows=9, ncols=1, figsize=(10, 18))

columns_to_plot = ['Index', 'PriceDiv', 'D12', 'E12', 'LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']
for i, col in enumerate(columns_to_plot):
    axes[i].plot(snp['yyyymm'], snp[col], marker='', linestyle='-')
    axes[i].set_title(col)
    axes[i].set_xlabel('Date')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

In [None]:
# We test for stationarity in the data

for col in columns_to_plot:
    adf_result = adfuller(snp[col])
    print(f'ADF Statistic for {col}: {adf_result[0]}')
    print(f'p-value for {col}: {adf_result[1]}\n')

The variables we are interested in are mostly stationary but there are a couple that should be further differenced in order to make them stationary.

That being, said, log returns and log returns + dividends are already quite surely stationary so we're not going to bother with their stationarity anymore.

Additionally, in further boxplots we have been able to see that variables DE and EP are quite skewed so, even if they are already decently stationary, we will also difference them in order to center them a bit more.

In [None]:
snp['DP'] = snp['DP'].diff()
snp['DE'] = snp['DE'].diff()
snp['EP'] = snp['EP'].diff()

display(snp.head())

# We can safely remove the first row now instead of filling it with 0
snp.dropna(axis = 0, inplace = True)

display(snp.head())

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 6))

columns_to_plot = ['DP', 'DE', 'EP']
for i, col in enumerate(columns_to_plot):
    axes[i].plot(snp['yyyymm'], snp[col], marker='', linestyle='-')
    axes[i].set_title(col)
    axes[i].set_xlabel('Date')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

# We test again for stationarity in the data

for col in columns_to_plot:
    adf_result = adfuller(snp[col])
    print(f'ADF Statistic for {col}: {adf_result[0]}')
    print(f'p-value for {col}: {adf_result[1]}\n')

We can see how now our data is completely stationary and we may proceed. Careful attention to the outliers will be needed, though, as they are very specific to the 2008 crisis. We will first standardize the data and afterwards, if there are still outliers, we will treat them or maybe consider scaling with robust scaling (i.e. taking away the median instead of the mean).

In [None]:
columns_to_plot = ['LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(15, 10))
for i, col in enumerate(columns_to_plot):
    sns.boxplot(y=col, data=snp, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

In [None]:
std_scaler = StandardScaler()
robust_scaler = RobustScaler()

snp_std = snp.copy()

for col in columns_to_plot:
    snp_std[col] = robust_scaler.fit_transform(snp_std[[col]])

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(15, 10))
for i, col in enumerate(columns_to_plot):
    sns.boxplot(y=col, data=snp, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

As we can observe, the scaled version, even with the robust scaler, returns ranges which are still bigger than the range from the previous version. Even with the outliers we had, as we had already applied a logarithmic transformation to the data, the outliers were not excessively far away from the center, with the exception of the DE and EP ratios.

What we will do is "manually" transform the data which is bigger than 0.3 in the five columns we are considering. We will first try to simply perform a sort of winsorization on those values to the nearest value that we establish as "the maximum" we allow. As such, we simply set the values exceeding ±0.3, to ±0.3.

In [None]:
for col in columns_to_plot:
    snp[col] = snp[col].clip(upper=0.3, lower=-0.3)

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(15, 10))
for i, col in enumerate(columns_to_plot):
    sns.boxplot(y=col, data=snp, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=1, figsize=(10, 10))

columns_to_plot = ['LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']
for i, col in enumerate(columns_to_plot):
    axes[i].plot(snp['yyyymm'], snp[col], marker='', linestyle='-')
    axes[i].set_title(col)
    axes[i].set_xlabel('Date')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

[Back to Index](#INDEX)

# 2. LAG CREATION AND SELECTION

[Back to Index](#INDEX)

In [None]:
columns_to_plot = ['LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']

fig, axes = plt.subplots(10, 1, figsize=(10, 20))

for i, col in enumerate(columns_to_plot):
    acf_index = 2 * i
    plot_acf(snp[col], lags = 100, alpha = 0.05, ax = axes[acf_index], title = f'ACF for {col}')
    axes[acf_index].grid(True)
    
    pacf_index = 2 * i + 1
    plot_pacf(snp[col], lags = 100, alpha = 0.05, ax = axes[pacf_index], title = f'PACF for {col}')
    axes[pacf_index].grid(True)

plt.tight_layout()
plt.show()

We can observe how there is no clear patter in the ACF and PACF plots for most of the variables, so we will just start using an arbitrary number of lags and try different options from there. That being said, there is some lag correlation for DE and EP so those lags are probably going to be the most relevant, although with 5-10 lags we will already cover these correlations.

Let's perform some causality tests to see how many lags we should consider for each variable. As we have already seen from the ACF and PACF that the maximum amount of relevant lags is less than 10, we will consider only 5 lags for each variable.

In [None]:
snp_lags = snp.copy()
snp_lags.drop(['Index', 'D12', 'E12', 'PriceDiv'], axis = 1, inplace = True)
display(snp_lags)

In [None]:
intervals = [   ["1927-01-01", "1932-12-01"],
                ["1933-01-01", "1970-12-01"],
                ["1971-01-01", "1997-12-01"],
                ["1998-01-01", "2005-12-01"],
                ["2006-01-01", "2021-11-01"]]

causality_tests = []

for interval in intervals:
    snp_temp = snp_lags[(snp_lags['yyyymm'] >= interval[0]) & (snp_lags['yyyymm'] <= interval[1])].dropna().reset_index(drop=True)
    snp_temp = snp_temp.drop('yyyymm', axis = 1)
    print('\n\n',interval[0],' , ',interval[1],'\n p = 5 \n')
    
    print("\n\nFor lagged LogReturns\n", '#'*20)
    result = grangercausalitytests(snp_temp[["LogReturns", "LogReturns"]], 5)
    causality_tests.append(result)
    
    print("\n\nFor lagged DP\n", '#'*20)
    result = grangercausalitytests(snp_temp[["LogReturns", "DP"]], 5)
    causality_tests.append(result)
    
    print("\n\nFor lagged DE\n", '#'*20)
    result = grangercausalitytests(snp_temp[["LogReturns", "DE"]], 5)
    causality_tests.append(result)
    
    print("\n\nFor lagged EP\n", '#'*20)
    result = grangercausalitytests(snp_temp[["LogReturns", "EP"]], 5)
    causality_tests.append(result)

We can observe how it appears that the EP ratio tends to be the most causal for LogReturns, especially in some periods more than others. Then we can also see that some lags of the DE ratio also pass the test (although only at a 10% confidence interval). As a result we are going to keep evaluating the lags to use, but we can probably safely try to use more than just 5 lags for DE and EP ratios.

That being said, the results are not too surprising because the Dividends are just one component of the price, but a lot of people value more highly a stock which keeps its value and grows in price more than another stock which returns dividends but doesn't grow as much, because you can always resell the stock if needed.
As a result, a company which is having consistent earnings is more probably higher valued than another with lower earnings and, as such, its price will probably be higher as well, hence this result. If a company does well one month, probably a lot of people will be at least a little bit more interested in buying their stock. The same can be said about an index like the S&P500. It's better to know that they are getting consistent and significant earnings more than knowing whether they paid dividends or not.

In [None]:
np.corrcoef(snp_lags['LogReturns'])

In [None]:
for interval in interval:
    snp_temp = snp_lags[(snp_lags['yyyymm'] >= interval[0]) & (snp_lags['yyyymm'] <= interval[1])].dropna()
    snp_temp = snp_temp.drop('yyyymm', axis = 1)
    np.corrcoef(snp_temp)

In [None]:
for interval in intervals:
    snp_temp = snp_lags[(snp_lags['yyyymm'] >= interval[0]) & (snp_lags['yyyymm'] <= interval[1])].dropna()
    snp_temp = snp_temp.drop(['yyyymm', 'LogReturns', 'LogReturnsDiv'], axis = 1)
    corr = snp_temp.corr()
    display(corr.style.background_gradient(cmap='coolwarm', axis=None).format("{:.2f}"))

In [None]:
.

In [None]:
cols_corr_matrix = columns_to_plot.copy()

for i in range(1, 11):
    snp[f'Return_lag_{i}'] = snp['LogReturns'].shift(i)
    snp[f'ReturnsDiv_lag_{i}'] = snp['LogReturnsDiv'].shift(i)
    snp[f'DP_lag_{i}'] = snp['DP'].shift(i)
    snp[f'DE_lag_{i}'] = snp['DE'].shift(i)
    snp[f'EP_lag_{i}'] = snp['EP'].shift(i)
    
    cols_corr_matrix.append(f'Return_lag_{i}')
    cols_corr_matrix.append(f'ReturnsDiv_lag_{i}')
    cols_corr_matrix.append(f'DP_lag_{i}')
    cols_corr_matrix.append(f'DE_lag_{i}')
    cols_corr_matrix.append(f'EP_lag_{i}')

display(snp.head())

In [None]:
snp = snp.loc[(snp['yyyymm'] >= '1927-01-01') & (snp['yyyymm'] < '2022-01-01')].reset_index(drop=True)

In [None]:
corr = snp[cols_corr_matrix].corr()
corr.style.background_gradient(cmap='coolwarm', axis=None).format("{:.2f}")

We can see how the Log Returns are not super correlated with anything but are actually a little bit correlated (less than 10%) with some things like some of its own lags and some lags of the DP and EP ratios as well. That being said, obviously we cannot really use the most correlated variables, which are the lag 0 ratios because they are directly computed using the Price so we would not have them in the actual period we will be trying to forecast.

[Back to Index](#INDEX)

# 3. NEURAL NETWORK

[Back to Index](#INDEX)

[Back to Index](#INDEX)

# 4. GAUSSIAN PROCESS

[Back to Index](#INDEX)

[Back to Index](#INDEX)