## Environmental Intensity Level Regressions and Dickey–Fuller test

In the notebook 'PredictingTimeSeries_&_PilotStock_CompDescription', we ran several linear regressions combining different features and we saw it yield a high R2 score. In this notebook, we will run the regressions but focus on their coefficients. 

Also, we are going to do some Dickey-Fuller test

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import warnings
from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import requests
from bs4 import BeautifulSoup
import os
import statsmodels.api as sm
warnings.filterwarnings('ignore')

In [2]:
df=pd.read_csv('/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/Environmental_impact_cleaned.csv')
df.head()

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,...,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,Env_intensity,industry_avg,industry_avg_year,Industry_indicator_year,Environmental_Growth
0,DE0005545503,2016,1&1 DRILLISCH AG,Germany,Post and telecommunications (64),-0.07%,-0.82%,-539318,-525027,-169,...,-6,67,67,-22,23%,-0.0007,-0.020506,-0.02074,1,
1,GB00B1YW4409,2010,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance and...",-0.12%,-0.11%,-1055812,-1032103,-277,...,-4,51,51,-43,10%,-0.0012,-0.028537,-0.006402,1,
2,GB00B1YW4409,2011,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance and...",-0.16%,-0.16%,-961875,-940402,-246,...,-3,38,38,-39,9%,-0.0016,-0.028537,-0.009838,1,33.333333
3,GB00B1YW4409,2012,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance and...",-0.15%,,-722999,-706893,-183,...,-2,27,27,-30,8%,-0.0015,-0.028537,-0.024437,1,-6.25
4,US88579Y1010,2010,3M COMPANY,United States,Activities of membership organisation n.e.c. (91),-7.90%,-35.45%,-2105919763,-1924672080,-439506,...,-423,3772,3772,-79722,1%,-0.079,-0.175838,-0.084583,1,


In [4]:
years = [2016, 2017, 2018]
df_industry = df.groupby('Industry(Exiobase)').count()['CompanyName'].reset_index()
industries = df_industry[df_industry['CompanyName'] > 3]['Industry(Exiobase)']
df_industry_count4 = df[df['Industry(Exiobase)'].isin(industries)]
df_c = df_industry_count4.copy()
def predictiveModel(outcomeYear, pastYears, df_c):
    years.sort()
    for year in years:
        data = df_c[df_c['Year'] == year]
        data = data.loc[:,['CompanyName','Env_intensity','industry_avg_year']]
        data.rename(columns={'Env_intensity': f'Env_intensity_{year}','industry_avg_year':f'industry_avg_year_{year}'}, inplace=True) 
        if(year == min(years)):
            data1 = pd.DataFrame(data)
        else:
            data2 = pd.merge(data1, data, on=["CompanyName"])
            data1 = data2.copy()
    data3 = df_c[df_c['Year'] == outcomeYear]
    data3 = data3[['CompanyName','Env_intensity','industry_avg_year']]
    data3.rename(columns={'Env_intensity': f'Env_intensity_{outcomeYear}','industry_avg_year':f'industry_avg_year_{outcomeYear}'}, inplace=True) 
    data3 = pd.merge(data3, data2, on=["CompanyName"])
    
    filter_col = [col for col in data3 if ((col.startswith('Env_intensity') and not(col.endswith(f'{outcomeYear}')))) or ((col.startswith('industry_avg_year') and not(col.endswith(f'{outcomeYear}'))))]
    outcome_col = [col for col in data3 if (col.startswith('Env_intensity') and col.endswith(f'{outcomeYear}'))]
    X=data3[filter_col]
    y=data3[outcome_col]
    
    x = sm.add_constant(X)
    print(sm.OLS(y, x).fit().summary())

In [5]:
predictiveModel(2019, years, df_c)

                            OLS Regression Results                            
Dep. Variable:     Env_intensity_2019   R-squared:                       0.884
Model:                            OLS   Adj. R-squared:                  0.883
Method:                 Least Squares   F-statistic:                     1340.
Date:                Thu, 15 Jul 2021   Prob (F-statistic):               0.00
Time:                        22:30:25   Log-Likelihood:                 1179.1
No. Observations:                1065   AIC:                            -2344.
Df Residuals:                    1058   BIC:                            -2309.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     -0

In [6]:
def predictiveModel_v2(outcomeYear, pastYears, df_c):
    years.sort()
    for year in years:
        data = df_c[df_c['Year'] == year]
        data = data.loc[:,['CompanyName','Env_intensity','industry_avg_year','Industry_indicator_year']]
        data.rename(columns={'Env_intensity': f'Env_intensity_{year}','industry_avg_year':f'industry_avg_year_{year}', 'Industry_indicator_year' : f'Industry_indicator_year_{year}'}, inplace=True) 
        if(year == min(years)):
            data1 = pd.DataFrame(data)
        else:
            data2 = pd.merge(data1, data, on=["CompanyName"])
            data1 = data2.copy()
    data3 = df_c[df_c['Year'] == outcomeYear]
    data3 = data3[['CompanyName','Env_intensity']]
    data3.rename(columns={'Env_intensity': f'Env_intensity_{outcomeYear}'}, inplace=True) 
    data3 = pd.merge(data3, data2, on=["CompanyName"])
    
    filter_col = [col for col in data3 if ((col.startswith('Env_intensity') and not(col.endswith(f'{outcomeYear}')))) or ((col.startswith('industry_avg_year') and not(col.endswith(f'{outcomeYear}')))) or ((col.startswith('Industry_indicator_year_') and not(col.endswith(f'{outcomeYear}'))))]            
    outcome_col = [col for col in data3 if (col.startswith('Env_intensity') and col.endswith(f'{outcomeYear}'))]
    X=data3[filter_col]
    y=data3[outcome_col]
    
    x = sm.add_constant(X)
    print(sm.OLS(y, x).fit().summary())

In [7]:
predictiveModel_v2(2019, years, df_c)

                            OLS Regression Results                            
Dep. Variable:     Env_intensity_2019   R-squared:                       0.884
Model:                            OLS   Adj. R-squared:                  0.883
Method:                 Least Squares   F-statistic:                     893.7
Date:                Thu, 15 Jul 2021   Prob (F-statistic):               0.00
Time:                        22:31:46   Log-Likelihood:                 1180.6
No. Observations:                1065   AIC:                            -2341.
Df Residuals:                    1055   BIC:                            -2291.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const           

In [8]:
def predictiveModel_v3(outcomeYear, pastYears, df_c):
    years.sort()
    for year in years:
        data = df_c[df_c['Year'] == year]
        data = data.loc[:,['CompanyName','Env_intensity','industry_avg_year','Industry_indicator_year','Environmental_Growth']]
        data.rename(columns={'Env_intensity': f'Env_intensity_{year}','industry_avg_year':f'industry_avg_year_{year}', 'Industry_indicator_year' : f'Industry_indicator_year_{year}', 'Environmental_Growth': f'Environmental_Growth_{year}'}, inplace=True) 
        if(year == min(years)):
            data1 = pd.DataFrame(data)
        else:
            data2 = pd.merge(data1, data, on=["CompanyName"])
            data1 = data2.copy()
    data2.dropna(inplace=True)
    data3 = df_c[df_c['Year'] == outcomeYear]
    data3 = data3[['CompanyName','Env_intensity']]
    data3.rename(columns={'Env_intensity': f'Env_intensity_{outcomeYear}'}, inplace=True) 
    data3 = pd.merge(data3, data2, on=["CompanyName"])
    
    filter_col = [col for col in data3 if ((col.startswith('Env_intensity') and not(col.endswith(f'{outcomeYear}')))) or ((col.startswith('industry_avg_year') and not(col.endswith(f'{outcomeYear}')))) or ((col.startswith('Industry_indicator_year_') and not(col.endswith(f'{outcomeYear}')))) or ((col.startswith('Environmental_Growth_') and not(col.endswith(f'{outcomeYear}'))))]            
    outcome_col = [col for col in data3 if (col.startswith('Env_intensity') and col.endswith(f'{outcomeYear}'))]
    X=data3[filter_col]
    y=data3[outcome_col]
    
    x = sm.add_constant(X)
    print(sm.OLS(y, x).fit().summary())

In [9]:
predictiveModel_v3(2019, years, df_c)

                            OLS Regression Results                            
Dep. Variable:     Env_intensity_2019   R-squared:                       0.878
Model:                            OLS   Adj. R-squared:                  0.876
Method:                 Least Squares   F-statistic:                     596.2
Date:                Thu, 15 Jul 2021   Prob (F-statistic):               0.00
Time:                        22:36:03   Log-Likelihood:                 1109.2
No. Observations:                1010   AIC:                            -2192.
Df Residuals:                     997   BIC:                            -2128.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const           

## Dickey–Fuller test

In [None]:
from statsmodels.tsa.stattools import adfuller

ind = df.copy()
y2018=list(ind[ind['Year'] == 2018]['Env_intensity'])
y2018=pd.Series(y2018)
X = y2018.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))


In [None]:
y2017=list(ind[ind['Year'] == 2017]['Env_intensity'])
y2017=pd.Series(y2017)
from statsmodels.tsa.stattools import adfuller
X = y2017.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2016=list(ind[ind['Year'] == 2016]['Env_intensity'])
y2016=pd.Series(y2016)
from statsmodels.tsa.stattools import adfuller
X = y2016.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2015=list(ind[ind['Year'] == 2015]['Env_intensity'])
y2015=pd.Series(y2015)
from statsmodels.tsa.stattools import adfuller
X = y2015.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2014=list(ind[ind['Year'] == 2014]['Env_intensity'])
y2014=pd.Series(y2014)
from statsmodels.tsa.stattools import adfuller
X = y2014.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

Summary: The environmental intensity from 2014 to 2018 for each year the data is stational

Next,let's exam the industry average for each year to see whether they are stational

In [None]:
ind.info()

Test whether the industry average for each year we used is stationary :


In [None]:
y2018=list(ind[ind['Year'] == 2018]['industry_avg_year'])
y2018=pd.Series(y2018)
from statsmodels.tsa.stattools import adfuller
X = y2018.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2017=list(ind[ind['Year'] == 2017]['industry_avg_year'])
y2017=pd.Series(y2017)
from statsmodels.tsa.stattools import adfuller
X = y2017.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2016=list(ind[ind['Year'] == 2016]['industry_avg_year'])
y2016=pd.Series(y2016)
from statsmodels.tsa.stattools import adfuller
X = y2016.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2015=list(ind[ind['Year'] == 2015]['industry_avg_year'])
y2015=pd.Series(y2015)
from statsmodels.tsa.stattools import adfuller
X = y2015.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

Summary:Each year from 2015 to 2018, the industry average is stational. And the model we used which includes the the industry average is predictable.

Let's take a look at top five industries

In [None]:
ind.groupby('Industry(Exiobase)')['Env_intensity'].count().sort_values()

Top five industries:

Retail trade, except of motor vehicles and motorcycles; repair of personal and household goods (52)                    
Real estate activities(70)                                                                                    
Construction (45)                                                 
Manufacture of electrical machinery and apparatus n.e.c. (31)                                                 
Financial intermediation, except insurance and pension funding (65) 

In [None]:
listind=['Retail trade, except of motor vehicles and motorcycles; repair of personal and household goods (52)',
'Real estate activities(70)',
'Construction (45)',
'Manufacture of electrical machinery and apparatus n.e.c. (31)',
'Financial intermediation, except insurance and pension funding (65)']
num_order_new = ind[(ind['Industry(Exiobase)']=='Construction (45)')|(ind['Industry(Exiobase)'] == 'Financial intermediation, except insurance and pension funding (65)')|(ind['Industry(Exiobase)'] == 'Manufacture of electrical machinery and apparatus n.e.c. (31)')
|(ind['Industry(Exiobase)'] == 'Real estate activities(70)')|(ind['Industry(Exiobase)'] == 'Retail trade, except of motor vehicles and motorcycles; repair of personal and household goods (52)')]
num_order_new  



In [None]:
y2018=list(num_order_new[num_order_new['Year'] == 2018]['Env_intensity'])
y2018=pd.Series(y2018)
from statsmodels.tsa.stattools import adfuller
X = y2018.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2017=list(num_order_new[num_order_new['Year'] == 2017]['Env_intensity'])
y2017=pd.Series(y2017)
from statsmodels.tsa.stattools import adfuller
X = y2017.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2016=list(num_order_new[num_order_new['Year'] == 2016]['Env_intensity'])
y2016=pd.Series(y2016)
from statsmodels.tsa.stattools import adfuller
X = y2016.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2015=list(num_order_new[num_order_new['Year'] == 2015]['Env_intensity'])
y2015=pd.Series(y2015)
from statsmodels.tsa.stattools import adfuller
X = y2015.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

Let's check the top 3 industries

In [None]:
num_order_new = ind[(ind['Industry(Exiobase)']=='Construction (45)')|(ind['Industry(Exiobase)'] == 'Financial intermediation, except insurance and pension funding (65)')|(ind['Industry(Exiobase)'] == 'Manufacture of electrical machinery and apparatus n.e.c. (31)')]

In [None]:
y2018=list(num_order_new[num_order_new['Year'] == 2018]['Env_intensity'])
y2018=pd.Series(y2018)
from statsmodels.tsa.stattools import adfuller
X = y2018.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2017=list(num_order_new[num_order_new['Year'] == 2017]['Env_intensity'])
y2017=pd.Series(y2017)
from statsmodels.tsa.stattools import adfuller
X = y2017.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
y2016=list(num_order_new[num_order_new['Year'] == 2016]['Env_intensity'])
y2016=pd.Series(y2016)
from statsmodels.tsa.stattools import adfuller
X = y2016.values
result = adfuller(X)
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

## Conclusion

Summary: All of the data that we used is stational.



Next, we will continue our analysis in the 'DistilBERT_CompaniesDescription' notebook