# <center> Predicting The Current Year's GHG with the Previous Year's GHG Scope <center/>

In this notebook we are going to try and predict the GHG Scope of 2019 with values from the previous year. We are going to be using both the actual values and the percentage change year-over-year. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_val_predict
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.preprocessing import scale, PolynomialFeatures
from sklearn.feature_selection import RFE
from datetime import datetime, date
import statsmodels.api as sm

stocks = pd.read_csv("/Users/YEET/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/company_data.csv")
sectors = pd.read_csv("/Users/YEET/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/52_tickers_sectors.csv")

stocks['Missing_GHG'] = np.where(stocks['GHG Scope 1'].isna(), 1, 0)
stocks['GHG Scope 1'].fillna(0, inplace = True)
stocks.loc[stocks['GHG Scope 1'].isna(),['GHG Scope 1','Missing_GHG']].head()

stocks = stocks.merge(sectors, on='Ticker')
stocks['GHG Scope 1'] = stocks['GHG Scope 1'].astype(float)
stocks['Percent_Change_GHG'] = (stocks.groupby('Ticker')['GHG Scope 1'].apply(pd.Series.pct_change) + 1)

  import pandas.util.testing as tm


## Using Average of 2016, 2017, and 2018 GHG Scope to Predict 2019

In [23]:
companies_2018 = list(stocks[(stocks['Year'] == 2018) & (stocks['GHG Scope 1'] != 0)]['Ticker'])
companies_2019 = list(stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that have reported for 2016,2017, and 2018 in a years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

companies_2017 = list(stocks[(stocks['Year'] == 2017) & (stocks['GHG Scope 1'] != 0) ]['Ticker'])
list2017_as_set = set(companies_2017)
intersection2= list2017_as_set.intersection(intersection)

companies_2016 = list(stocks[(stocks['Year'] == 2016) & (stocks['GHG Scope 1'] != 0) ]['Ticker'])
list2016_as_set = set(companies_2016)
intersection2= list2016_as_set.intersection(intersection2)

x = stocks[(stocks['Year'].isin([2016, 2017,2018])) & (stocks['Ticker'].isin(intersection2))][['Ticker', 'GHG Scope 1']]
x = x.groupby('Ticker').mean()[['GHG Scope 1']]

y = stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(intersection2))][['GHG Scope 1']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,GHG Scope 1,R-squared:,0.954
Model:,OLS,Adj. R-squared:,0.953
Method:,Least Squares,F-statistic:,772.2
Date:,"Wed, 14 Jul 2021",Prob (F-statistic):,2.18e-26
Time:,11:16:09,Log-Likelihood:,-389.51
No. Observations:,39,AIC:,783.0
Df Residuals:,37,BIC:,786.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1027.9536,1191.479,0.863,0.394,-1386.213,3442.120
GHG Scope 1,0.8581,0.031,27.788,0.000,0.796,0.921

0,1,2,3
Omnibus:,9.938,Durbin-Watson:,1.738
Prob(Omnibus):,0.007,Jarque-Bera (JB):,14.732
Skew:,-0.588,Prob(JB):,0.000633
Kurtosis:,5.772,Cond. No.,53100.0


Looking at the regression results, we can see that 2019 data is statistically significant. This means that the average of 2016, 2017, 2018 values are statistically significant at predicting 2019 values for GHG scope.

## Split by Industry

In [13]:
util_df = stocks[stocks['Sector'] == 'Utilities']
nrg_df = stocks[stocks['Sector'] == 'Energy']

In [25]:
companies_2018 = list(util_df[(util_df['Year'] == 2018) & (util_df['GHG Scope 1'] != 0)]['Ticker'])
companies_2019 = list(util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that have reported for 2016,2017, and 2018 in a years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

companies_2017 = list(util_df[(util_df['Year'] == 2017) & (util_df['GHG Scope 1'] != 0) ]['Ticker'])
list2017_as_set = set(companies_2017)
intersection = list2017_as_set.intersection(intersection)

companies_2016 = list(util_df[(util_df['Year'] == 2016) & (util_df['GHG Scope 1'] != 0) ]['Ticker'])
list2016_as_set = set(companies_2016)
intersection = list2016_as_set.intersection(intersection)

x = util_df[(util_df['Year'].isin([2016, 2017,2018])) & (util_df['Ticker'].isin(intersection))][['Ticker', 'GHG Scope 1']]
x = x.groupby('Ticker').mean()[['GHG Scope 1']]

y = util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(intersection))][['GHG Scope 1']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()


0,1,2,3
Dep. Variable:,GHG Scope 1,R-squared:,0.942
Model:,OLS,Adj. R-squared:,0.939
Method:,Least Squares,F-statistic:,357.7
Date:,"Wed, 14 Jul 2021",Prob (F-statistic):,4.27e-15
Time:,11:17:26,Log-Likelihood:,-238.88
No. Observations:,24,AIC:,481.8
Df Residuals:,22,BIC:,484.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1155.0450,1660.278,0.696,0.494,-2288.160,4598.250
GHG Scope 1,0.8019,0.042,18.913,0.000,0.714,0.890

0,1,2,3
Omnibus:,8.346,Durbin-Watson:,2.021
Prob(Omnibus):,0.015,Jarque-Bera (JB):,8.645
Skew:,-0.623,Prob(JB):,0.0133
Kurtosis:,5.663,Cond. No.,60000.0


In [29]:
companies_2018 = list(nrg_df[(nrg_df['Year'] == 2018) & (nrg_df['GHG Scope 1'] != 0)]['Ticker'])
companies_2019 = list(nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that have reported for 2016,2017, and 2018 in a years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

companies_2017 = list(nrg_df[(nrg_df['Year'] == 2017) & (nrg_df['GHG Scope 1'] != 0) ]['Ticker'])
list2017_as_set = set(companies_2017)
intersection = list2017_as_set.intersection(intersection)

companies_2016 = list(nrg_df[(nrg_df['Year'] == 2016) & (nrg_df['GHG Scope 1'] != 0) ]['Ticker'])
list2016_as_set = set(companies_2016)
intersection = list2016_as_set.intersection(intersection)

x = nrg_df[(nrg_df['Year'].isin([2016, 2017,2018])) & (nrg_df['Ticker'].isin(intersection))][['Ticker','GHG Scope 1']]
x = x.groupby('Ticker').mean()[['GHG Scope 1']]

y = nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(intersection))][['GHG Scope 1']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,GHG Scope 1,R-squared:,0.987
Model:,OLS,Adj. R-squared:,0.986
Method:,Least Squares,F-statistic:,1004.0
Date:,"Wed, 14 Jul 2021",Prob (F-statistic):,1.08e-13
Time:,11:21:39,Log-Likelihood:,-142.84
No. Observations:,15,AIC:,289.7
Df Residuals:,13,BIC:,291.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1713.3395,1117.744,1.533,0.149,-701.400,4128.079
GHG Scope 1,0.9409,0.030,31.685,0.000,0.877,1.005

0,1,2,3
Omnibus:,21.312,Durbin-Watson:,1.491
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.753
Skew:,2.042,Prob(JB):,1.15e-05
Kurtosis:,7.441,Cond. No.,45900.0


Looking at the two regression results above that are split by industry, we see that the average of 2016, 2017, 2018 values are statistically significant at predicting 2019 values for GHG scope.

## Percent Change

In [34]:
companies_2018 = list(stocks[(stocks['Year'] == 2018) & (np.isfinite(stocks.Percent_Change_GHG))]['Ticker'])
companies_2019 = list(stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that have reported for 2016,2017, and 2018 in a years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

companies_2017 = list(stocks[(stocks['Year'] == 2017) & (np.isfinite(stocks.Percent_Change_GHG)) ]['Ticker'])
list2017_as_set = set(companies_2017)
intersection = list2017_as_set.intersection(intersection)

companies_2016 = list(stocks[(stocks['Year'] == 2016) & (np.isfinite(stocks.Percent_Change_GHG)) ]['Ticker'])
list2016_as_set = set(companies_2016)
intersection = list2016_as_set.intersection(intersection)

x = stocks[(stocks['Year'].isin([2016, 2017,2018])) & (stocks['Ticker'].isin(intersection))][['Ticker', 'Percent_Change_GHG']]
x = x.groupby('Ticker').mean()[['Percent_Change_GHG']]
                             
y = stocks[(stocks['Year'] == 2019) & (stocks['Ticker'].isin(intersection))][['Percent_Change_GHG']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,Percent_Change_GHG,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.03
Method:,Least Squares,F-statistic:,0.0006303
Date:,"Wed, 14 Jul 2021",Prob (F-statistic):,0.98
Time:,11:22:52,Log-Likelihood:,-12.137
No. Observations:,35,AIC:,28.27
Df Residuals:,33,BIC:,31.38
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.9746,0.455,2.140,0.040,0.048,1.901
Percent_Change_GHG,0.0116,0.461,0.025,0.980,-0.927,0.950

0,1,2,3
Omnibus:,17.21,Durbin-Watson:,2.343
Prob(Omnibus):,0.0,Jarque-Bera (JB):,38.676
Skew:,0.971,Prob(JB):,4e-09
Kurtosis:,7.77,Cond. No.,15.2


When we try to predict the percentage change of this year with the values of last year we see the 2018 values are not statistically significant at predicting 2019 values.

# Percentage Change for Each Industry

In [36]:
companies_2018 = list(util_df[(util_df['Year'] == 2018) & (np.isfinite(util_df.Percent_Change_GHG))]['Ticker'])
companies_2019 = list(util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that have reported for 2016, 2017, and 2018 in a years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

companies_2017 = list(util_df[(util_df['Year'] == 2017) & (np.isfinite(util_df.Percent_Change_GHG)) ]['Ticker'])
list2017_as_set = set(companies_2017)
intersection = list2017_as_set.intersection(intersection)

companies_2016 = list(util_df[(util_df['Year'] == 2016) & (np.isfinite(util_df.Percent_Change_GHG)) ]['Ticker'])
list2016_as_set = set(companies_2016)
intersection = list2016_as_set.intersection(intersection)

x = util_df[(util_df['Year'].isin([2016, 2017,2018])) & (util_df['Ticker'].isin(intersection))][['Ticker', 'Percent_Change_GHG']]
x = x.groupby('Ticker').mean()[['Percent_Change_GHG']]
                             
y = util_df[(util_df['Year'] == 2019) & (util_df['Ticker'].isin(intersection))][['Percent_Change_GHG']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,Percent_Change_GHG,R-squared:,0.017
Model:,OLS,Adj. R-squared:,-0.033
Method:,Least Squares,F-statistic:,0.3382
Date:,"Wed, 14 Jul 2021",Prob (F-statistic):,0.567
Time:,11:47:57,Log-Likelihood:,1.659
No. Observations:,22,AIC:,0.682
Df Residuals:,20,BIC:,2.864
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.1233,0.426,2.637,0.016,0.235,2.012
Percent_Change_GHG,-0.2596,0.446,-0.582,0.567,-1.191,0.671

0,1,2,3
Omnibus:,28.888,Durbin-Watson:,2.072
Prob(Omnibus):,0.0,Jarque-Bera (JB):,62.831
Skew:,-2.221,Prob(JB):,2.27e-14
Kurtosis:,9.986,Cond. No.,16.9


In [38]:
companies_2018 = list(nrg_df[(nrg_df['Year'] == 2018) & (np.isfinite(nrg_df.Percent_Change_GHG))]['Ticker'])
companies_2019 = list(nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(companies_2018))]['Ticker'])

#Getting companies that have reported for 2016, 2017, and 2018 in a years
list2018_as_set = set(companies_2018)
intersection = list2018_as_set.intersection(companies_2019)

companies_2017 = list(nrg_df[(nrg_df['Year'] == 2017) & (np.isfinite(nrg_df.Percent_Change_GHG)) ]['Ticker'])
list2017_as_set = set(companies_2017)
intersection = list2017_as_set.intersection(intersection)

companies_2016 = list(nrg_df[(nrg_df['Year'] == 2016) & (np.isfinite(nrg_df.Percent_Change_GHG)) ]['Ticker'])
list2016_as_set = set(companies_2016)
intersection = list2016_as_set.intersection(intersection)

x = nrg_df[(nrg_df['Year'].isin([2016, 2017,2018])) & (nrg_df['Ticker'].isin(intersection))][['Ticker', 'Percent_Change_GHG']]
x = x.groupby('Ticker').mean()[['Percent_Change_GHG']]
y = nrg_df[(nrg_df['Year'] == 2019) & (nrg_df['Ticker'].isin(intersection))][['Percent_Change_GHG']]
x.index = y.index

# x_train, x_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42)
x = sm.add_constant(x)
sm.OLS(y, x).fit().summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,Percent_Change_GHG,R-squared:,0.026
Model:,OLS,Adj. R-squared:,-0.063
Method:,Least Squares,F-statistic:,0.2916
Date:,"Wed, 14 Jul 2021",Prob (F-statistic):,0.6
Time:,11:48:43,Log-Likelihood:,-6.943
No. Observations:,13,AIC:,17.89
Df Residuals:,11,BIC:,19.02
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.6708,0.936,1.785,0.102,-0.389,3.731
Percent_Change_GHG,-0.4857,0.899,-0.540,0.600,-2.465,1.494

0,1,2,3
Omnibus:,5.256,Durbin-Watson:,1.563
Prob(Omnibus):,0.072,Jarque-Bera (JB):,2.234
Skew:,0.909,Prob(JB):,0.327
Kurtosis:,3.907,Cond. No.,15.0


The results don't change when we split by industry and run a regression for each industry. The 2016, 2017, and 2018 percentage change values are not statistically significant at predicting 2019 values for both industries.

## Conclusion

We have seen that when we use actual values GHG Scope of 2016, 2017, 2018 is statistically significant at predicting 2019 values. However, when we try to predict the precentage change of GHG Scope in 2018-2019, using 2015-2018 values is not statistically significant.