# Linear Regression

Linear regression of ESG features with respect to yearly average of Adjusted Closing Stock Price, yearly average of log returns and alpha estimation of 500 US companies.

In [95]:
import numpy as np
import pandas as pd
import quandl
import matplotlib.pyplot as plt
import datetime
import requests
import json
import sklearn
pd.set_option('display.max_columns', None)

To generate the feaures for this regression we have combined data from Quandl and Revit. More specifically we have taken the 'EOD US Stock Exchange' dataset from Quandl and the 'ESG' dataset from Refinitiv.

In [97]:
aggregate_data = pd.read_csv('data/aggr_data_scores_500_alpha.csv')
aggregate_data.head(5)
#there are many nones below because for some years data tha was available in Quandle was missing from Refinitiv

Unnamed: 0.1,Unnamed: 0,Year,Adj_Close,Adj_High,Adj_Low,Adj_Open,Adj_Volume,CSR Strategy Score,Close,Community Score,Date,Dividend,ESG Combined Score,ESG Controversies Score,ESG Period Last Update Date,ESG Score,Emissions Score,Environment Pillar Score,Governance Pillar Score,High,Human Rights Score,Innovation Score,Instrument,Low,Management Score,Open,Period End Date,Product Responsibility Score,Resource Use Score,Shareholders Score,Social Pillar Score,Split,Volume,Workforce Score,log_ret,Alpha
0,0,2010,30.845093,31.163094,30.525322,30.852593,60912.962963,,30.845093,,,0.0,,,,,,,,31.163094,,,,30.525322,,30.852593,,,,,,1.0,60912.962963,,-8.9e-05,-1.055688
1,9,2010,38.814604,39.057985,38.256056,38.665767,5574.074074,,40.578704,,,0.0,,,,,,,,40.833146,,,,39.99477,,40.423102,,,,,,1.0,5574.074074,,0.000628,-0.605885
2,18,2010,4.269206,4.313814,4.220202,4.265311,409411.094444,,12.583148,,,0.0,,,,,,,,12.714626,,,,12.438711,,12.571667,,,,,,1.0,160553.37037,,0.001218,-0.236265
3,37,2010,4.935766,5.030044,4.830713,4.927749,396727.777778,,6.612037,,,0.0,,,,,,,,6.738333,,,,6.471306,,6.601296,,,,,,1.0,396727.777778,,0.003435,1.154216
4,46,2010,11.315,11.478054,11.107165,11.292222,94481.481481,,11.315,,,0.0,,,,,,,,11.478054,,,,11.107165,,11.292222,,,,,,1.0,94481.481481,,0.003414,1.141391


We need a list of all features.

In [98]:
col = np.array(aggregate_data.columns)
print(col)

['Unnamed: 0' 'Year' 'Adj_Close' 'Adj_High' 'Adj_Low' 'Adj_Open'
 'Adj_Volume' 'CSR Strategy Score' 'Close' 'Community Score' 'Date'
 'Dividend' 'ESG Combined Score' 'ESG Controversies Score'
 'ESG Period Last Update Date' 'ESG Score' 'Emissions Score'
 'Environment Pillar Score' 'Governance Pillar Score' 'High'
 'Human Rights Score' 'Innovation Score' 'Instrument' 'Low'
 'Management Score' 'Open' 'Period End Date'
 'Product Responsibility Score' 'Resource Use Score' 'Shareholders Score'
 'Social Pillar Score' 'Split' 'Volume' 'Workforce Score' 'log_ret'
 'Alpha']


We need to drop rows with excessive NaN values and do something with the NaNs we do not drop

In [99]:
#we have to drop nans
aggregate_data = aggregate_data.dropna(axis='rows', thresh=33) #this means that if there are more than 33 NaNs the row will be dropped
aggregate_data = aggregate_data.fillna(0)
aggregate_data.shape

(533, 36)

In [100]:
#create the X vector by dropping unwanted features
X = aggregate_data.drop(axis = 1,labels = ['Year','Instrument','Period End Date','Open','High','Low',
                                 'Close','Volume','Dividend','Split','Adj_Open','Adj_High',
                                 'Adj_Low','Adj_Volume','Date',
                                 'Adj_Close',
                                 'log_ret',
                                 'Alpha',
                                 'ESG Period Last Update Date', #irrelevant and causes trouble
                                 'Unnamed: 0', #introduced by the alpha calculation process - means nothing
                                ])
#replace Trues with 1 and Falses with 0
#this works only on strings. Fields that are defined as booleans are fine.
X = X.replace('True',1)
X = X.replace('False',0)

y = aggregate_data['Adj_Close']
y2 = aggregate_data['log_ret']
y3 = aggregate_data['Alpha']

X.shape

(533, 16)

In [101]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=False)

model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

By sorting regression coefficients we can see which ESG features are important for each performance metric. More specifically we can see which features affect the metric negatively, which positively and to which the metric is indifferent.

In [102]:
#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

Social Pillar Score             -8.033173
Emissions Score                 -6.950419
Resource Use Score              -6.340002
Innovation Score                -6.121963
ESG Score                       -1.230015
ESG Combined Score              -0.504230
CSR Strategy Score               0.062091
Governance Pillar Score          0.132524
Shareholders Score               0.133699
Management Score                 0.148773
ESG Controversies Score          0.484048
Human Rights Score               1.244849
Community Score                  1.678616
Product Responsibility Score     2.034902
Workforce Score                  3.760503
Environment Pillar Score        20.232934
dtype: float64


### Regularisation

More sofisticated regression techniques include a regualarisation function. Here we used _Lasso_ and _Ridge_ regression.

In [103]:
from sklearn.linear_model import Lasso
model2 = Lasso(fit_intercept=False,alpha=0.0001)

model2.fit(X,y)



Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [104]:
params = pd.Series(model2.coef_, index=X.columns)
print(params.sort_values())

Social Pillar Score            -0.651844
ESG Combined Score             -0.519297
Emissions Score                -0.235667
Resource Use Score             -0.192200
Community Score                -0.168183
Management Score               -0.128128
Governance Pillar Score        -0.115641
CSR Strategy Score             -0.008223
Innovation Score                0.022023
Shareholders Score              0.032169
Workforce Score                 0.073621
Human Rights Score              0.204219
Product Responsibility Score    0.423359
Environment Pillar Score        0.448773
ESG Controversies Score         0.490710
ESG Score                       1.057260
dtype: float64


In [105]:
from sklearn.linear_model import Ridge

model3 = Ridge(fit_intercept=False,alpha=0.001)
model3.fit(X,y)

Ridge(alpha=0.001, copy_X=True, fit_intercept=False, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [106]:
params = pd.Series(model3.coef_, index=X.columns)
print(params.sort_values())

Social Pillar Score             -8.022560
Emissions Score                 -6.947295
Resource Use Score              -6.337143
Innovation Score                -6.119106
ESG Score                       -1.230163
ESG Combined Score              -0.504245
CSR Strategy Score               0.062094
Governance Pillar Score          0.132544
Shareholders Score               0.133705
Management Score                 0.148791
ESG Controversies Score          0.484053
Human Rights Score               1.243509
Community Score                  1.676236
Product Responsibility Score     2.032822
Workforce Score                  3.755748
Environment Pillar Score        20.224146
dtype: float64


We started our analysis with the average closing stocks price of each company. However this sheer stock price is not a good measure for profitability. Below, we repeat the above process for Log Returns and a metric representative of Alpha.

## Log Returns (Yearly Average)

In [107]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=False)

model.fit(X,y2)

#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

Workforce Score                -0.000776
Environment Pillar Score       -0.000476
Community Score                -0.000386
Product Responsibility Score   -0.000336
Human Rights Score             -0.000211
ESG Score                      -0.000128
ESG Controversies Score        -0.000003
CSR Strategy Score              0.000008
Shareholders Score              0.000009
Governance Pillar Score         0.000012
Management Score                0.000014
ESG Combined Score              0.000021
Resource Use Score              0.000155
Innovation Score                0.000165
Emissions Score                 0.000181
Social Pillar Score             0.001761
dtype: float64


In [108]:
from sklearn.linear_model import Lasso

model = Lasso(fit_intercept=False,alpha = 0.00001)

model.fit(X,y2)

#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

Resource Use Score             -1.053608e-05
ESG Score                      -7.156741e-06
Environment Pillar Score       -3.430390e-06
ESG Controversies Score        -3.102196e-06
Management Score               -2.182483e-06
Workforce Score                -1.975079e-06
Emissions Score                -8.473259e-07
Innovation Score               -7.290217e-07
Governance Pillar Score        -0.000000e+00
Social Pillar Score             0.000000e+00
Community Score                 8.691892e-07
Product Responsibility Score    2.169457e-06
Shareholders Score              3.638889e-06
CSR Strategy Score              3.741760e-06
Human Rights Score              7.379974e-06
ESG Combined Score              2.269089e-05
dtype: float64




In [109]:
from sklearn.linear_model import Ridge

model = Ridge(fit_intercept=False,alpha = 0.01)

model.fit(X,y2)

#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

Workforce Score                -0.000767
Environment Pillar Score       -0.000473
Community Score                -0.000382
Product Responsibility Score   -0.000332
Human Rights Score             -0.000208
ESG Score                      -0.000126
ESG Controversies Score        -0.000003
CSR Strategy Score              0.000007
Shareholders Score              0.000009
Governance Pillar Score         0.000012
Management Score                0.000013
ESG Combined Score              0.000021
Resource Use Score              0.000154
Innovation Score                0.000164
Emissions Score                 0.000180
Social Pillar Score             0.001739
dtype: float64


## Alpha

In [110]:
model = LinearRegression(fit_intercept=False)
model.fit(X,y3)

#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

Social Pillar Score            -16.359298
Environment Pillar Score        -0.806599
Management Score                -0.242652
Governance Pillar Score         -0.163243
ESG Combined Score              -0.127885
Shareholders Score              -0.064846
ESG Controversies Score         -0.016693
CSR Strategy Score               0.018977
Innovation Score                 0.072037
Emissions Score                  0.083856
Resource Use Score               0.132169
ESG Score                        1.718025
Human Rights Score               1.940346
Product Responsibility Score     3.112921
Community Score                  3.566763
Workforce Score                  7.154550
dtype: float64


In [111]:
model = Lasso(fit_intercept=False,alpha = 0.00001)
model.fit(X,y3)
#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

ESG Combined Score             -0.146579
Human Rights Score             -0.061279
Environment Pillar Score       -0.047743
Governance Pillar Score        -0.012156
ESG Controversies Score        -0.010728
Management Score               -0.008880
Product Responsibility Score    0.003777
ESG Score                       0.005402
Community Score                 0.011316
Innovation Score                0.015754
Shareholders Score              0.020810
Social Pillar Score             0.023600
Emissions Score                 0.024522
Workforce Score                 0.049351
Resource Use Score              0.074193
CSR Strategy Score              0.074885
dtype: float64




In [113]:
model = Ridge(fit_intercept=False,alpha = 0.01)
model.fit(X,y3)
#get the coefficients
params = pd.Series(model.coef_, index=X.columns)
print(params.sort_values())

Social Pillar Score            -16.164147
Environment Pillar Score        -0.809196
Management Score                -0.240156
Governance Pillar Score         -0.161391
ESG Combined Score              -0.128113
Shareholders Score              -0.063931
ESG Controversies Score         -0.016620
CSR Strategy Score               0.019571
Innovation Score                 0.074980
Emissions Score                  0.087097
Resource Use Score               0.135095
ESG Score                        1.698978
Human Rights Score               1.916442
Product Responsibility Score     3.075793
Community Score                  3.524302
Workforce Score                  7.069699
dtype: float64


### Model Evaluation

In [135]:
y_pred = model.predict(X)
score = sklearn.metrics.mean_squared_error(y3, y_pred)

In [137]:
print("Root Mean Squared Error = %.3f" % (np.sqrt(score)))

Root Mean Squared Error = 18.537


In [138]:
#just to make sure
np.sqrt(np.mean((y3-y_pred)**2))

18.53739106596391

The truth is that on its own this metric does not tell us much about how succesfull our regression was. Taking that as a value for alpha though we must admit that it seems like quite a large value.

## Make a Prediction

We can use above model to make a prediction of the yearly average alpha of a given asset. This can be used to project alpha values in the future once new ESG measures become available.

In [55]:
##### Rifinitiv #####

# The following values are populated for you by Data Science Accelerator. 
# They represent your demo-level access to the data.
# Please don't share this with anyone

RESOURCE_ENDPOINT = 'https://dsa-stg-edp-api.fr-nonprod.aws.thomsonreuters.com/data/environmental-social-governance/v1/views/scores-full'

# RESOURCE_ENDPOINT = 'https://dsa-stg-edp-api.fr-nonprod.aws.thomsonreuters.com/data/environmental-social-governance/v1/views/measures-full'

access_token = 'uGR7cxvvqJ4mgWwva5pPN184iGGigBhY8g4ThFu0' # personal key for Data Science Accelerator access to ESG
access_token = 'u5dfd3sUDp1tsQrrvEGvU6VXbl9ooVCc49Ry7Lmb'
def get_data_request(url, requestData):
    '''HTTP GET request'''
    dResp = requests.get(url, headers = {'X-api-key': access_token}, params = requestData);       

    if dResp.status_code != 200:
        raise ValueError("Unable to get data. Code %s, Message: %s" % (dResp.status_code, dResp.text));
    else:
        print("Data access successful")
        jResp = json.loads(dResp.text);
        return jResp

def get_data(ric):
    '''Gets ESG scores for a specific RIC (company) code'''
    
    requestData = {
    "universe": ric
    };

    jResp = get_data_request(RESOURCE_ENDPOINT, requestData)

    data = jResp["data"]
    headers = jResp["headers"]    

    names = [headers[x]['title'] for x in range(len(headers))]

    df = pd.DataFrame(data, columns=names )
    
    return df

Below select our asset and get from the refinitiv dataset.

In [68]:
asset = 'BLK' #this will work for an asset in the ESG refinitiv dataset

refinitiv_data = get_data(asset)

Data access successful


In [146]:
#fine latest entry
print(refinitiv_data['Period End Date'][0])

2017-12-31


This means that our prediction will be based on the ESG feautres measured for the chosen asset on the above date.

Below are the feautres used in the regression (in the correct order) and our best estimates for the coefficients.

In [141]:
columns = ['CSR Strategy Score', 'Community Score', 'ESG Combined Score',
       'ESG Controversies Score', 'ESG Score', 'Emissions Score',
       'Environment Pillar Score', 'Governance Pillar Score',
       'Human Rights Score', 'Innovation Score', 'Management Score',
       'Product Responsibility Score', 'Resource Use Score',
       'Shareholders Score', 'Social Pillar Score', 'Workforce Score']
coef = np.array([  0.01957125,   3.52430248,  -0.12811288,  -0.01661968,
         1.69897779,   0.08709694,  -0.80919645,  -0.16139058,
         1.91644184,   0.07498013,  -0.2401559 ,   3.07579314,
         0.13509532,  -0.06393098, -16.16414683,   7.06969857])

In [153]:
predictor = Ridge(fit_intercept=False,alpha = 0.01)
predictor.coef_ = coef
Xpred2 = np.array(refinitiv_data.loc[0][columns])

ypred2 = model.predict(Xpred.reshape(1, -1))
print('This is the estimated yearly average of alpha for the chosen asset: %.3f' %(ypred2[0]))

This is the estimated yearly average of alpha for the chosen asset: 4.329
