# <center> Predicting The Current Year's GHG with the Previous Year's GHG Scope <center/>

In this notebook we are going to try and predict the GHG Scope of 2019 with values from the previous year. We are going to be using both the actual values and the percentage change year-over-year. 

In [100]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_val_predict
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.preprocessing import scale, PolynomialFeatures
from sklearn.feature_selection import RFE
from datetime import datetime, date
import statsmodels.api as sm

stocks = pd.read_csv("/Users/YEET/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/company_data.csv")
sectors = pd.read_csv("/Users/YEET/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/52_tickers_sectors.csv")

stocks['Missing_GHG'] = np.where(stocks['GHG Scope 1'].isna(), 1, 0)
stocks['GHG Scope 1'].fillna(0, inplace = True)
stocks.loc[stocks['GHG Scope 1'].isna(),['GHG Scope 1','Missing_GHG']].head()

stocks = stocks.merge(sectors, on='Ticker')
stocks['GHG Scope 1'] = stocks['GHG Scope 1'].astype(float)
stocks['Percent_Change_GHG'] = (stocks.groupby('Ticker')['GHG Scope 1'].apply(pd.Series.pct_change) + 1)

## Using Last Year's GHG Scope to Predict Next Year (2018 & 2019)

In [102]:
companies_2018 = list(stocks[stocks['Year'] == 2018]['Ticker'])
companies_2019 = list(stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that are in both years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

x = stocks[(stocks['Year'] == 2018) & (stocks['Ticker'].isin(intersection))][['GHG Scope 1']]
y = stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(intersection))][['GHG Scope 1']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,GHG Scope 1,R-squared:,0.963
Model:,OLS,Adj. R-squared:,0.962
Method:,Least Squares,F-statistic:,1294.0
Date:,"Sat, 10 Jul 2021",Prob (F-statistic):,2.1e-37
Time:,07:58:47,Log-Likelihood:,-514.68
No. Observations:,52,AIC:,1033.0
Df Residuals:,50,BIC:,1037.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1319.8486,869.727,1.518,0.135,-427.050,3066.747
GHG Scope 1,0.8891,0.025,35.971,0.000,0.839,0.939

0,1,2,3
Omnibus:,48.626,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,269.437
Skew:,2.343,Prob(JB):,3.1100000000000003e-59
Kurtosis:,13.119,Cond. No.,45000.0


Looking at the regression results, we can see that 2019 data is statistically significant. This means that 2018 values are statistically significant at predicting 2019 values for GHG scope.

## Split by Industry

In [104]:
util_df = stocks[stocks['Sector'] == 'Utilities']
nrg_df = stocks[stocks['Sector'] == 'Energy']

In [105]:
companies_2018 = list(util_df[util_df['Year'] == 2018]['Ticker'])
companies_2019 = list(util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that are in both years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

x = util_df[(util_df['Year'] == 2018) & (util_df['Ticker'].isin(intersection))][['GHG Scope 1']]
y = util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(intersection))][['GHG Scope 1']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()


0,1,2,3
Dep. Variable:,GHG Scope 1,R-squared:,0.979
Model:,OLS,Adj. R-squared:,0.978
Method:,Least Squares,F-statistic:,1209.0
Date:,"Sat, 10 Jul 2021",Prob (F-statistic):,2.4899999999999998e-23
Time:,07:58:53,Log-Likelihood:,-267.69
No. Observations:,28,AIC:,539.4
Df Residuals:,26,BIC:,542.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6.1103,999.523,-0.006,0.995,-2060.658,2048.438
GHG Scope 1,0.8863,0.025,34.777,0.000,0.834,0.939

0,1,2,3
Omnibus:,12.351,Durbin-Watson:,2.113
Prob(Omnibus):,0.002,Jarque-Bera (JB):,38.141
Skew:,-0.02,Prob(JB):,5.22e-09
Kurtosis:,8.718,Cond. No.,58200.0


In [108]:
companies_2018 = list(nrg_df[nrg_df['Year'] == 2018]['Ticker'])
companies_2019 = list(nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that are in both years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

x = nrg_df[(nrg_df['Year'] == 2018) & (nrg_df['Ticker'].isin(intersection))][['GHG Scope 1']]
y = nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(intersection))][['GHG Scope 1']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,GHG Scope 1,R-squared:,0.952
Model:,OLS,Adj. R-squared:,0.95
Method:,Least Squares,F-statistic:,438.3
Date:,"Sat, 10 Jul 2021",Prob (F-statistic):,5.11e-16
Time:,07:58:58,Log-Likelihood:,-240.78
No. Observations:,24,AIC:,485.6
Df Residuals:,22,BIC:,487.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2427.0889,1321.184,1.837,0.080,-312.878,5167.056
GHG Scope 1,0.9282,0.044,20.937,0.000,0.836,1.020

0,1,2,3
Omnibus:,39.051,Durbin-Watson:,1.954
Prob(Omnibus):,0.0,Jarque-Bera (JB):,115.985
Skew:,3.001,Prob(JB):,6.52e-26
Kurtosis:,11.942,Cond. No.,33500.0


Looking at the two regression results above that are split by industry, we see that 2018 values are statistically significant at predicting 2019 values for GHG scope.

## Percent Change

In [109]:
companies_2018 = list(stocks[(stocks['Year'] == 2018) & (np.isfinite(stocks.Percent_Change_GHG))]['Ticker'])
companies_2019 = list(stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that are in both years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

x = stocks[(stocks['Year'] == 2018) & (stocks['Ticker'].isin(intersection))][['Percent_Change_GHG']]
y = stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(intersection))][['Percent_Change_GHG']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,Percent_Change_GHG,R-squared:,0.011
Model:,OLS,Adj. R-squared:,-0.013
Method:,Least Squares,F-statistic:,0.4482
Date:,"Sat, 10 Jul 2021",Prob (F-statistic):,0.507
Time:,07:58:59,Log-Likelihood:,-11.627
No. Observations:,43,AIC:,27.25
Df Residuals:,41,BIC:,30.78
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.0820,0.170,6.380,0.000,0.740,1.424
Percent_Change_GHG,-0.1052,0.157,-0.669,0.507,-0.423,0.212

0,1,2,3
Omnibus:,19.106,Durbin-Watson:,2.241
Prob(Omnibus):,0.0,Jarque-Bera (JB):,55.438
Skew:,0.9,Prob(JB):,9.16e-13
Kurtosis:,8.263,Cond. No.,6.72


When we try to predict the percentage change of this year with the values of last year we see the 2018 values are not statistically significant at predicting 2019 values.

# Percentage Change for Each Industry

In [110]:
companies_2018 = list(util_df[(util_df['Year'] == 2018) & (np.isfinite(util_df.Percent_Change_GHG))]['Ticker'])
companies_2019 = list(util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that are in both years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

x = util_df[(util_df['Year'] == 2018) & (util_df['Ticker'].isin(intersection))][['Percent_Change_GHG']]
y = util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(intersection))][['Percent_Change_GHG']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,Percent_Change_GHG,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.042
Method:,Least Squares,F-statistic:,0.02595
Date:,"Sat, 10 Jul 2021",Prob (F-statistic):,0.873
Time:,07:59:02,Log-Likelihood:,1.9358
No. Observations:,25,AIC:,0.1284
Df Residuals:,23,BIC:,2.566
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.9093,0.238,3.819,0.001,0.417,1.402
Percent_Change_GHG,-0.0390,0.242,-0.161,0.873,-0.540,0.462

0,1,2,3
Omnibus:,28.66,Durbin-Watson:,2.306
Prob(Omnibus):,0.0,Jarque-Bera (JB):,65.515
Skew:,-2.11,Prob(JB):,5.94e-15
Kurtosis:,9.715,Cond. No.,10.1


In [111]:
companies_2018 = list(nrg_df[(nrg_df['Year'] == 2018) & (np.isfinite(nrg_df.Percent_Change_GHG))]['Ticker'])
companies_2019 = list(nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that are in both years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

x = nrg_df[(nrg_df['Year'] == 2018) & (nrg_df['Ticker'].isin(intersection))][['Percent_Change_GHG']]
y = nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(intersection))][['Percent_Change_GHG']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,Percent_Change_GHG,R-squared:,0.091
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,1.592
Date:,"Sat, 10 Jul 2021",Prob (F-statistic):,0.225
Time:,08:00:21,Log-Likelihood:,-6.9137
No. Observations:,18,AIC:,17.83
Df Residuals:,16,BIC:,19.61
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.4213,0.259,5.493,0.000,0.873,1.970
Percent_Change_GHG,-0.2719,0.215,-1.262,0.225,-0.729,0.185

0,1,2,3
Omnibus:,8.978,Durbin-Watson:,1.68
Prob(Omnibus):,0.011,Jarque-Bera (JB):,6.05
Skew:,1.095,Prob(JB):,0.0486
Kurtosis:,4.808,Cond. No.,5.75


The results don't change when we split by industry and run a regression for each industry. The 2018 percentage change values are not statistically significant at predicting 2019 values for both industries.

## Conclusion

We have seen that when we use actual values GHG Scope of 2018 is statistically significant at predicting 2019 values. However, when we try to predict the change of GHG Scope in 2018-2019, using 2017-2018 values is not statistically significant.

We should also note that in the first part we are only working with 52 stocks total and for percentage change analysis our number of observations is 43 due to some companies not reporting their GHG Scope in 2017 which is why we cannot calculate the GHG Scope percentage change from 2017 to 2018.