<a></a>
<div style="border-radius: 10px; border: 1px solid #0F9CF5; background-color: #232323; white-space: nowrap;">
    <p style="margin-top: -10px; margin-bottom: 0px; margin-left: 10px; font-size: 1.15em; padding: 10px; overflow: hidden;">
        <span style="color: orange; font-size: 2em;">&#9432;  </span>
        Click the <span style="color: orange;">Run All</span> <img style="max-height: 1.5em; border: 1px solid orange;" src="../img/RunAll.png" /> button in the toolbar above to run the code in this notebook 
    </p>
</div>

<a id="document-top"></a>
# BQuant Machine Learning Series Part 2

<a href='https://bloombergslides.com/view/mail?iID=WcXFGQVnhTVHh6Kp4gJc'>Link to Episode 2 - ML Series Video - Regression </a>

In [None]:
#Import Libraries
from scipy import stats
import numpy as np
import pandas as pd
import bqviz as bqv
import plotly.graph_objects as go
import plotly.express as px
from sklearn.linear_model import LinearRegression
import math
from src.shared import * ## Shared library for retrieving data via BQL for Machine Learning Series

<h3>First Example - Regression of returns of IBM against the S&P 500 Index </h3>

In [None]:
#fetch daily return data for IBM and SPX Index
df_ibm_spx = fetch_daily_return_data(['IBM US Equity','SPX Index'])
df_ibm_spx.tail()

Plot daily return of IBM against the daily return of S&P 500

In [None]:
bqv.ScatterPlot(df=df_ibm_spx,x='SPX Index',y='IBM US Equity', tick_format='.2%').set_style().show()

<h4>Scipy library</h4>
<h5>Ordinary Least Squares Regression, using stats.linregress</h5>

SciPy Linear Regression:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

In [None]:
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(x=df_ibm_spx['SPX Index'],y=df_ibm_spx['IBM US Equity'])

OLS regression coefficients

In [None]:
slope, intercept, r_value, p_value, std_err

R Squared

In [None]:
r_value*r_value

<h4>Compare to HRA &lt;Go&gt; on the terminal</h4>  
I.e. <font color='aqua'>IBM US Equity SPX Index HRA &lt;GO&gt;</font>
<img src='img\hra_ibm_spx.jpg'>

<h4>Use Plotly visualization library to show daily returns scatter alongside regression line</h4>

In [None]:
#import library
import plotly.graph_objects as go
import plotly.express as px

#initialize figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_ibm_spx['SPX Index'], y=df_ibm_spx['IBM US Equity'],
                    mode='markers', name='Daily return',))

#construct regression line from its coordinates from the scipy regression coefficients
line_x = [df_ibm_spx['SPX Index'].min(),  df_ibm_spx['SPX Index'].max()]
line_y = [line_x[0]*slope+intercept,  line_x[1]*slope+intercept]

#add regression line to this plot
fig.add_trace(go.Scatter(x=line_x,y=line_y,name='OLS regression line', mode='lines'))

#show the figure
fig.update_layout(template='plotly_dark', xaxis_title="S&P", yaxis_title="IBM", yaxis_tickformat = '.2%',xaxis_tickformat = '.2%')
fig.show()

<h5>Use Plotly Library to plot our data and do the OLS regression all at once</h5>
Example: <a href='https://plotly.com/python/ml-regression/'>Plotly - ML Regression in Python</a>

In [None]:
fig = px.scatter(
    df_ibm_spx, x='SPX Index', y='IBM US Equity',
    trendline='ols', trendline_color_override='red'
)
fig.update_layout(template='plotly_dark', xaxis_title="IBM Daily Return", yaxis_title="S&P Daily Return",yaxis_tickformat = '.2%',xaxis_tickformat = '.2%')
fig.show()

<h3>Example 2: Bitcoin return vs S&P 500</h3>

In [None]:
df_weekly_3yr=fetch_weekly_return_data(ticker_universe=['BTC Index','SPX Index'], lookback='-3y')
df_weekly_3yr.head()

In [None]:
fig = px.scatter(
    df_weekly_3yr, x='SPX Index', y='BTC Index', opacity=0.65, 
    trendline='ols', trendline_color_override='red', title='3 year weekly BTC vs SPX' 
)
fig.update_layout(template='plotly_dark', xaxis_title="S&P 500", yaxis_title="BTC", yaxis_tickformat = '.1%',xaxis_tickformat = '.1%')
fig.show()

In [None]:
df_weekly_1yr=fetch_weekly_return_data(ticker_universe=['BTC Index','SPX Index'], lookback='-13m')
fig = px.scatter(
    df_weekly_1yr, x='SPX Index', y='BTC Index', opacity=0.65, 
    trendline='ols', trendline_color_override='red', title='12 month weekly BTC vs SPX' 
)
fig.update_layout(template='plotly_dark', xaxis_title="S&P 500", yaxis_title="BTC", yaxis_tickformat = '.1%',xaxis_tickformat = '.1%')
fig.show()

<h4>Overfitting Example - Polynomial Linear Regression applied to BTC versus S&P 500 returns</h4>

Scikit-learn Linear Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Scikit-learn Polynomial Features:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression


#Fetch weekly return data of BTC and SPX  (This is our training data set)
df_weekly_spx_btc = fetch_weekly_return_data(ticker_universe=['BTC Index','SPX Index'], lookback='-14m')
df_X = df_weekly_spx_btc[['SPX Index']]
df_y = df_weekly_spx_btc[['BTC Index']]

#Fit training data to a polynomial regression model using sklearn.preprocessing and sklearn.pipeline
degree=9
polyreg=make_pipeline(PolynomialFeatures(degree),LinearRegression())
polyreg.fit(df_X,df_y)

#Using the predicted model, get the projected values over the space of X
X_seq = np.linspace(df_X.min(),df_X.max(),300).reshape(-1,1)
pred_y = polyreg.predict(X_seq)
l_pred_y = list(pred_y.reshape(-1,len(pred_y))[0])
l_X_seq = list(X_seq.reshape(-1,len(X_seq))[0])
df_scat = pd.DataFrame(data={'x':l_X_seq,'y':l_pred_y})

#Plot the original observations (BTC weekly returns) alongside the predicted values from the (massively overfitted) model
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_scat['x'], y=df_scat['y'],   #fitted model
                    mode='lines',
                    name='Predicted BTC return'))

fig.add_trace(go.Scatter(x=df_X['SPX Index'].values,y=df_y['BTC Index'].values, #observed values
                         mode='markers',
                         name='Observed BTC weekly return'))

fig.update_layout(template='plotly_dark', xaxis_title="S&P 500", yaxis_title="BTC", yaxis_tickformat = '%',xaxis_tickformat = '%')
fig.show()

In [None]:
polyreg.score(X=df_X['SPX Index'].values.reshape(-1,1), y=df_y['BTC Index'])

<h3>Next Example: Bond Spread vs Duration</h3>

In [None]:
df_bond_spreads = retrieve_bond_spread_data()
df_bond_spreads.head()

In [None]:
bqv.ScatterPlot(df=df_bond_spreads,x='duration',y='spread').show()

<h5>Non-linear regression using a Loess Regression ("moving regression")</h5>
Weighted regressions over localized subsets of data
<br><b>Advantage</b>: fits this data better than a line does
<br><b>Disadvantage</b>: more complicated to represent

In [None]:
fig = px.scatter(x=df_bond_spreads['duration']/365,y=df_bond_spreads['spread'],trendline='lowess', trendline_color_override='cyan')
fig.update_layout(template='plotly_dark', xaxis_title="Duration", yaxis_title="Spread", yaxis_tickformat = '',xaxis_tickformat = '')
fig

<h5>Ordinary Least Squares Regression</h5>

In [None]:
fig = px.scatter(x=df_bond_spreads['duration'],y=df_bond_spreads['spread'],trendline='ols', trendline_color_override='cyan')
fig.update_layout(template='plotly_dark', xaxis_title="Duration", yaxis_title="Spread", yaxis_tickformat = '',xaxis_tickformat = '')
fig


Fit the OLS model, and see the coefficients Slope and Intercept

In [None]:
#Fit the OLS model using sklearn.linear_model.LinearRegression
X = df_bond_spreads.duration.values.reshape(-1, 1)
y = df_bond_spreads.spread.values
model = LinearRegression()
model.fit(X,y)

#Model Coefficients
print('Slope: {:.4f}'.format(model.coef_[0]))
print('Intercept: {:.4f}'.format(model.intercept_))

Use the model to predict the spread of a bond with hypothetical duration of 7

In [None]:
model.predict([[7]])

<h5>Residuals: the difference between observed data and the data predicted by our model</h5>
Note any patterns in the residuals. To make a good regression, the residuals should randomly distributed, like random noise. If there is a predictable bias in the residuals (such as if all the points to the far right are consistently below the predicted levels, thus all having negative residuals), that will translate to a biased model.

In [None]:
#Compute residuals as observed bond spreads minus the bond spreads predicted by the model
residuals = y - model.predict(X)

#Construct regression line from the model
x_vals_line = [df_bond_spreads.duration.min(),df_bond_spreads.duration.max()]
y_vals_line = model.predict(np.array(x_vals_line).reshape(-1,1)).tolist()

#Plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_bond_spreads['duration'].tolist(),y=df_bond_spreads['spread'].tolist(),mode='markers',name='Bond Spread'))
fig.add_trace(go.Scatter(x=x_vals_line, y=y_vals_line, name='Regression line', mode='lines'))
fig.add_trace(go.Bar(x=df_bond_spreads['duration'].tolist(),y=residuals.tolist(),width=0.09, name='Residuals'))
fig.update_layout(template='plotly_dark', xaxis_title="Duration", yaxis_title="Spread", yaxis_tickformat = '',xaxis_tickformat = '')
fig.show()

<h5>In addition to Bond Spread and Duration, also retrieve the issuer's EBITDA Margin and Debt to Common Equity Ratio</h5>

In [None]:
df_extra_bond = retrieve_additional_bond_spread_data()
df_extra_bond.head()

Some of these columns have null data

In [None]:
df_extra_bond.count()

<h5><b>Outliers</b></h5>

- We can get remove of any bonds that have missing data
- We can fill in missing data with an educated guess

If we remove any bond that has any null data, we have fewer data points to train the model

In [None]:
df_extra_bond.dropna().count()

Separate the data into two partitions -- training data and test data

In [None]:
i_max_of_train_data = math.ceil(len(df_extra_bond.dropna())/2)
df_train_data = df_extra_bond.dropna().iloc[0:i_max_of_train_data]
df_train_data.count()

In [None]:
df_test_data = df_extra_bond.dropna().iloc[i_max_of_train_data:]
df_test_data.dropna().count()

<h4>Multivariate Regression on Z_Spread against three features: Duration, Amount Outstanding, and Debt to Common Equity </h4>

In [None]:
df_train_data.head()

In [None]:
#Multivariate Regression on Z_Spread against duration, Amount Outstanding, and Debt to Common Equity Ratio
y = df_train_data['Z_spread']
x = df_train_data[['duration','AmtOut','Debt_to_Com_Eqy']]

#Define multiple linear regression model (using sklearn library)
linear_regression = LinearRegression()

#Fit the multivariate linear regression model
linear_regression.fit(x,y)

Estimated coefficients of our model

In [None]:
linear_regression.coef_

R squared of the fit on training data

In [None]:
linear_regression.score(x,y)

Use our model to predict the z-spread for a hypothetical bond with duration 10, Amount Outstanding of 500M, and Debt to Equity of 1000

In [None]:
linear_regression.predict([[10,500,1000]])

<h5>Compute multivariate regression on the three features (Duration, Amount Outstanding, and Debt to Common Equity), to get a predicted Z spread for each bond in test data set</h5>
Note that we fitted the model on the <b>training data set</b> and we then apply the model to the separate <b>test data set</b>


In [None]:
df_sub_test = df_test_data[['duration','AmtOut','Debt_to_Com_Eqy','Z_spread']].copy()

#put test data into correct configuration to feed into model
l_features=['duration','AmtOut','Debt_to_Com_Eqy']
num_features =3
l_lists = list()
for i in range(len(df_sub_test)):
    l_i = list()
    for i_feat in range(num_features):
        l_i.append(df_sub_test.iloc[i].loc[l_features[i_feat]])
    l_lists.append(l_i)

#predict values on test data set
predicted_vals = linear_regression.predict(l_lists)
df_sub_test = df_sub_test.assign(predicted_Z_spread=predicted_vals)

df_sub_test.tail()

Plot the Predicted Z Spread and Observed Z Spread versus Duration


In [None]:
fig = go.Figure()

#duration_plot
df = df_sub_test.sort_values('duration')
fig.add_trace(go.Scatter(x=df['duration'],y=df['Z_spread'],mode='markers',name='Observed Z Spread'))
fig.add_trace(go.Scatter(x=df['duration'],y=df['predicted_Z_spread'],mode='markers',name='Predicted Z Spread'))
fig.update_layout(template='plotly_dark', xaxis_title="Duration", yaxis_title="Spread", yaxis_tickformat = '',xaxis_tickformat = '')
fig.show()

In [None]:
#Reshape data to produce a faceted plot
df_melted = df_sub_test.melt(id_vars=['Z_spread','predicted_Z_spread'])

# Show faceted figure, with a plot for each of our features
fig = px.scatter(data_frame=df_melted,x='value',y=['Z_spread','predicted_Z_spread'], facet_col='variable', template='plotly_dark')
fig.update_xaxes(matches=None)
fig.show()

<h3>Additional Resources</h3>

<h4>Python Libraries</h4>

SciPy Linear Regression:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

Scikit-learn Linear Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Scikit-learn ML Packages:
https://scikit-learn.org/stable/supervised_learning.html

Ridge Regression in Scikit-learn (Regularization):
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

ML Regression in Plotly:
https://plotly.com/python/ml-regression/

<h4>Helpful Blog Posts</h4>

A Complete Machine Learning Project Walk-Through in Python:
https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420

Supervised Learning - Basics of Linear Regression: 
https://towardsdatascience.com/supervised-learning-basics-of-linear-regression-1cbab48d0eba

<h4>Regression from scratch for IBM versus SPX</h4>

Website reference: <br>
https://towardsdatascience.com/linear-regression-from-scratch-cd0dee067f72 </br>
https://machinelearningmastery.com/implement-simple-linear-regression-scratch-python/

In [None]:
##Regression from Scratch for IBM (y) versus SPX (X)
df_ibm_spx = fetch_daily_return_data(['IBM US Equity','SPX Index'])
df_ibm_spx.tail()

In [None]:
X = df_ibm_spx['SPX Index'].values
y = df_ibm_spx['IBM US Equity'].values

#Calculate mean of dependent variable (IBM) returns and independendent variable (SPX) returns
x_mean = np.mean(X)
y_mean = np.mean(y)

#total N
n = len(X)

#calculate the coefficients by calculating the squared distance
numerator = 0
denominator = 0

for i in range(n):
    numerator += (X[i] - x_mean) * (y[i] - y_mean)
    denominator += (X[i] - x_mean) ** 2
    
slope = numerator / denominator
slope

In [None]:
intercept = y_mean - (slope * x_mean)
intercept

Calculate the R Squared

In [None]:
sumofsquares = 0 
sumofresiduals = 0

#sum total squares over some of total residuals
for i in range(n):
    predicted_y = intercept + slope * X[i]
    sumofsquares += (y[i] - y_mean) ** 2
    sumofresiduals += (y[i] - predicted_y) ** 2
    
r2_score = 1 - (sumofresiduals/sumofsquares)
r2_score