## Predicting Environmental Intensity by industry

For this notebook, we want to see whether there are different situations when applying time series model to predict environmental intensity by industry, and also combining the Environmental intensity growth rate to check whether there will be an improvement in regression performance.

First, let's import some libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import warnings
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from math import sqrt
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('Environmental_impact_cleaned.csv')
df.head(3)

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,CropProductionCapacity,MeatProductionCapacity,Biodiversity,AbioticResources,Waterproductioncapacity(Drinkingwater&IrrigationWater),WoodProductionCapacity,SDG1.5,SDG2.1,SDG2.2,SDG2.3,SDG2.4,SDG3.3,SDG3.4,SDG3.9,SDG6,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,Env_intensity,industry_avg,industry_avg_year,Industry_indicator_year,Environmental_Growth
0,DE0005545503,2016,1&1 DRILLISCH AG,Germany,Post and telecommunications (64),-0.07%,-0.82%,-539318,-525027,-169,-7009,-1630,-27,-878,-4714,135,-234989,-166914,-166795,-1752,-1752,-27366,65960,-142,-4714,-878,-5,-1,-77,-6,67,67,-22,23%,-0.0007,-0.020506,-0.02074,1,
1,GB00B1YW4409,2010,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance and...",-0.12%,-0.11%,-1055812,-1032103,-277,-13751,-3221,-47,-562,-5953,102,-463300,-295103,-294949,-3438,-3438,-47957,59044,-74,-5953,-562,-4,0,-133,-4,51,51,-43,10%,-0.0012,-0.028537,-0.006402,1,
2,GB00B1YW4409,2011,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance and...",-0.16%,-0.16%,-961875,-940402,-246,-12525,-2935,-42,-424,-5378,77,-421928,-264714,-264579,-3131,-3131,-42961,44515,-56,-5378,-424,-3,0,-119,-3,38,38,-39,9%,-0.0016,-0.028537,-0.009838,1,33.333333


In [None]:
df = df.loc[:,['CompanyName','Year','Industry(Exiobase)','Env_intensity','Environmental_Growth']]
df.head()

Unnamed: 0,CompanyName,Year,Industry(Exiobase),Env_intensity,Environmental_Growth
0,1&1 DRILLISCH AG,2016,Post and telecommunications (64),-0.0007,
1,3I GROUP PLC,2010,"Financial intermediation, except insurance and...",-0.0012,
2,3I GROUP PLC,2011,"Financial intermediation, except insurance and...",-0.0016,33.333333
3,3I GROUP PLC,2012,"Financial intermediation, except insurance and...",-0.0015,-6.25
4,3M COMPANY,2010,Activities of membership organisation n.e.c. (91),-0.079,



### Industries Regressions-- Past years Environment Intensity



First, we filter industries with more than 3 companies. We are doing this to be able to analyze and create insighful regressions. 

In [None]:
df_industry = df.groupby('Industry(Exiobase)').count()['CompanyName'].reset_index()
industries = df_industry[df_industry['CompanyName'] > 3]['Industry(Exiobase)']
df_industry_count4 = df[df['Industry(Exiobase)'].isin(industries)]
df_industry_count4.head()

Unnamed: 0,CompanyName,Year,Industry(Exiobase),Env_intensity,Environmental_Growth
0,1&1 DRILLISCH AG,2016,Post and telecommunications (64),-0.0007,
1,3I GROUP PLC,2010,"Financial intermediation, except insurance and...",-0.0012,
2,3I GROUP PLC,2011,"Financial intermediation, except insurance and...",-0.0016,33.333333
3,3I GROUP PLC,2012,"Financial intermediation, except insurance and...",-0.0015,-6.25
4,3M COMPANY,2010,Activities of membership organisation n.e.c. (91),-0.079,


We created the following function to create a regression for each unique industry in the dataset. The regression features will be Environmental Intensity values from previous years. 

In [None]:
df_c = df_industry_count4.copy()
def calculateIndustriesRegressions(outcomeYear, pastYears, df_c):
    industry_regressions = {}
    for i in np.unique(df_c['Industry(Exiobase)']):
        for year in years:
            years.sort()
            data = df_c[(df_c['Industry(Exiobase)'] == i) & (df_c['Year'] == year)]
            data = data.loc[:,['CompanyName','Env_intensity']]
            data.rename(columns={'Env_intensity': f'Env_intensity_{year}'}, inplace=True) 
            if(year == min(years)):
                data1 = pd.DataFrame(data)
            else:
                data2 = pd.merge(data1, data, on=["CompanyName"])
                data1 = data2.copy()
        if (data2.shape[0])>10:
            data3 = df_c[df_c['Year'] == outcomeYear]
            data3 = data3[['CompanyName','Env_intensity']]
            data3.rename(columns={'Env_intensity': f'Env_intensity_{outcomeYear}'}, inplace=True) 
            data3 = pd.merge(data3, data2, on=["CompanyName"])

            filter_col = [col for col in data3 if ((col.startswith('Env_intensity') and not(col.endswith(f'{outcomeYear}'))))]
            outcome_col = [col for col in data3 if (col.startswith('Env_intensity') and col.endswith(f'{outcomeYear}'))]
            X=data3[filter_col]
            y=data3[outcome_col]


            regr = linear_model.LinearRegression()
            X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
            regr.fit(X_train, y_train)
            y_train_pred = regr.predict(X_train)
            y_pred = regr.predict(X_test)
        else: 
            continue
        industry_regressions[i] = {'OutcomeYear':outcomeYear, 'MSE_test': metrics.mean_squared_error(y_test, y_pred),'RMSE_test':sqrt(metrics.mean_squared_error(y_test, y_pred)),'Intercept': regr.intercept_,'Coefficient': regr.coef_, 
                                   'R2_score':metrics.r2_score(y_test, y_pred)}
    data_items = industry_regressions.items()
    data_list = list(data_items)

    df = pd.DataFrame(data_list)
    df=pd.concat([df.drop(columns=df.columns[1], axis=1), df.iloc[:,1].apply(pd.Series)], axis=1)
    return df

### Input data from 2016 to 2018 for independent variable

At first , we tried filter indutries with more than 0 company from 2016 to 2018. However, the result showed kinds of NAN for R Squared. Then, we tried filter more. Finally, we decided to filter 10 to make sure get R Squared value. We converted results into the dataframe and adjusted the dataframe format to get a clean dataset about regression results.

In [None]:
years=[2016,2017,2018]
indf=calculateIndustriesRegressions(2019, years, df_c)
arr1=np.array(indf['Coefficient'].to_list())[:,0][:,0]
arr2=np.array(indf['Coefficient'].to_list())[:,0][:,1]
arr3=np.array(indf['Coefficient'].to_list())[:,0][:,2]
indf['Coefficient2016'] = arr1.tolist()
indf['Coefficient2017'] = arr2.tolist()
indf['Coefficient2018'] = arr3.tolist()
indf['Industries']=indf.iloc[:,0]
first_column = indf.pop('Industries')
indf.insert(0, 'Industries', first_column)
indf['Intercept']=np.array(indf['Intercept'].to_list())[:,0].tolist()
indf=indf.drop(indf.columns[1],axis=1)#remove repeated industries column
indf=indf.drop(columns=['Coefficient'])#remove coefficient column
indf

Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2016,Coefficient2017,Coefficient2018
0,Activities auxiliary to financial intermediati...,2019,0.001666,0.040811,0.036721,-368.570788,17.976546,-36.236407,17.399672
1,Activities of membership organisation n.e.c. (91),2019,0.00021,0.014482,0.028009,0.981829,-0.133093,0.940968,0.532472
2,Air transport (62),2019,0.079894,0.282656,0.000122,0.243137,0.367418,-0.563799,1.192029
3,Chemicals nec,2019,0.005145,0.071727,0.00515,0.793452,0.372398,0.437839,0.173081
4,Computer and related activities (72),2019,3e-06,0.00176,-0.000111,0.807159,0.903596,-0.338364,0.298088
5,Construction (45),2019,0.000307,0.017532,-0.000133,0.937546,0.363022,-0.87922,1.339938
6,Extraction of crude petroleum and services rel...,2019,0.011216,0.105908,-0.018099,0.061144,0.02064,0.179937,0.608204
7,"Financial intermediation, except insurance and...",2019,4.3e-05,0.00652,-0.000111,0.974641,-0.307046,0.922647,0.273847
8,Manufacture of beverages,2019,0.017007,0.130413,0.00843,0.930428,-0.1173,0.298025,1.101787
9,Manufacture of electrical machinery and appara...,2019,0.000402,0.020052,-0.000177,0.891901,0.320314,-0.212506,0.837339


Check with one industry's regression, 'Activities auxiliary to financial intermediation (67)', to have a look of the regression accuracy 

In [None]:
years = [2016, 2017, 2018,2019]
df_c = df_industry_count4.copy()
for year in years:
  data = df_c[df_c['Year'] == year]
  data = data.loc[:,['CompanyName','Env_intensity','Industry(Exiobase)']]
  data.rename(columns={'Env_intensity': f'Env_intensity_{year}'}, inplace=True) 
  if(year == min(years)):
    data1 = pd.DataFrame(data)
  else:
    data2 = pd.merge(data1, data, on=["CompanyName",'Industry(Exiobase)'])
    data1 = data2.copy()
    
data = data2[data2['Industry(Exiobase)'] == 'Activities auxiliary to financial intermediation (67)']
X=data[['Env_intensity_2016','Env_intensity_2017','Env_intensity_2018']]
y=data['Env_intensity_2019']
regr = linear_model.LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
regr.fit(X_train, y_train)
y_train_pred = regr.predict(X_train)
y_pred = regr.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (metrics.mean_squared_error(y_train, y_train_pred),
                metrics.mean_squared_error(y_test, y_pred)))
print('R2 score:', metrics.r2_score(y_test, y_pred))
print('Model intercept: ', regr.intercept_)
print('Model coefficients: ', regr.coef_)


MSE train: 0.002, test: 0.002
R2 score: -368.570788210546
Model intercept:  0.0367213032370731
Model coefficients:  [ 17.97654644 -36.23640739  17.3996719 ]


In [None]:
years = [2016, 2017, 2018,2019]
df_c = df_industry_count4.copy()
for year in years:
  data = df_c[df_c['Year'] == year]
  data = data.loc[:,['CompanyName','Env_intensity','Industry(Exiobase)']]
  data.rename(columns={'Env_intensity': f'Env_intensity_{year}'}, inplace=True) 
  if(year == min(years)):
    data1 = pd.DataFrame(data)
  else:
    data2 = pd.merge(data1, data, on=["CompanyName",'Industry(Exiobase)'])
    data1 = data2.copy()
    
data = data2[data2['Industry(Exiobase)'] == 'Activities auxiliary to financial intermediation (67)']
X=data[['Env_intensity_2016','Env_intensity_2017','Env_intensity_2018']]
y=data['Env_intensity_2019']
regr = linear_model.LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
regr.fit(X_train, y_train)
y_train_pred = regr.predict(X_train)
y_pred = regr.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (metrics.mean_squared_error(y_train, y_train_pred),
                metrics.mean_squared_error(y_test, y_pred)))
print('R2 score:', metrics.r2_score(y_test, y_pred))
print('Model intercept: ', regr.intercept_)
print('Model coefficients: ', regr.coef_)


MSE train: 0.002, test: 0.002
R2 score: -368.570788210546
Model intercept:  0.0367213032370731
Model coefficients:  [ 17.97654644 -36.23640739  17.3996719 ]


The result is consistent with function result.
Then, we have a look at the result table.

In [None]:
indf['R2_score'].describe()

count     31.000000
mean     -12.010988
std       66.266901
min     -368.570788
25%        0.318431
50%        0.807159
75%        0.917063
max        0.981829
Name: R2_score, dtype: float64

From the distribution of regressions' R square, we can find that there is a huge difference in the performance for different industries 

In [None]:
indf=indf.set_index('Industries')
maxValuesObj = indf['R2_score'].max()
print('Maximum value in each column : ')
print(maxValuesObj)
maxValueIndexObj = indf['R2_score'].idxmax()
print("Max values of columns are at row index position :")
print(maxValueIndexObj)
minValuesObj = indf['R2_score'].min()
print('Minimum value in each column : ')
print(minValuesObj)
minValueIndexObj = indf['R2_score'].idxmin()
print("Min values of columns are at row index position :")
print(minValueIndexObj)

Maximum value in each column : 
0.9818286119890922
Max values of columns are at row index position :
Activities of membership organisation n.e.c. (91)
Minimum value in each column : 
-368.570788210546
Min values of columns are at row index position :
Activities auxiliary to financial intermediation (67)


For Activities of membership organisation n.e.c. (91), we have the highest R2_score(0.98), which indicates about 98% variance in the Environmental intensity for 2019 that the independent variables explain correctively.


If the fit is actually worse than just fitting a horizontal line then R-square is negative. Accordingly, the regression does not work for Activities auxiliary to financial intermediation (67).


In [None]:
# five largest values of indutries for R square
indf5=indf.nlargest(5, ['R2_score'])
indf5

Unnamed: 0_level_0,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2016,Coefficient2017,Coefficient2018
Industries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Activities of membership organisation n.e.c. (91),2019,0.00021,0.014482,0.028009,0.981829,-0.133093,0.940968,0.532472
Research and development (73),2019,2.4e-05,0.004931,-0.003852,0.975646,0.805671,-0.298986,0.064527
"Financial intermediation, except insurance and pension funding (65)",2019,4.3e-05,0.00652,-0.000111,0.974641,-0.307046,0.922647,0.273847
"Manufacture of fabricated metal products, except machinery and equipment (28)",2019,2e-05,0.004507,0.001019,0.969378,-0.207075,0.769293,0.456634
Manufacture of machinery and equipment n.e.c. (29),2019,7.5e-05,0.008643,0.002781,0.958841,0.382328,-0.201888,0.978355


The first five industries ordered by R square are Activities of membership organisation n.e.c. (91), Research and development (73), Financial intermediation, except insurance and pension funding (65), Manufacture of fabricated metal products, except machinery and equipment (28), and Manufacture of machinery and equipment n.e.c. (29). 

Due to limited industries left when using years from 2016 to 2018 to forecast 2019 environmental intensity, we decide to have a look at years from 2017 and 2018 to forecast 2019 environmental intensity.

### Input data from 2017 and 2018 for independent variable

In [None]:
years=[2017,2018]
years.sort()
indf2=calculateIndustriesRegressions(2019, years, df_c)
arr1=np.array(indf2['Coefficient'].to_list())[:,0][:,0]
arr2=np.array(indf2['Coefficient'].to_list())[:,0][:,1]
indf2['Coefficient2017'] = arr1.tolist()
indf2['Coefficient2018'] = arr2.tolist()
indf2['Industries']=indf2.iloc[:,0]
first_column = indf2.pop('Industries')
indf2.insert(0, 'Industries', first_column)
indf2['Intercept']=np.array(indf2['Intercept'].to_list())[:,0].tolist()
indf2=indf2.drop(indf2.columns[1],axis=1)#remove repeated industries column
indf2=indf2.drop(columns=['Coefficient'])#remove coefficient column
indf2


Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2017,Coefficient2018
0,Activities auxiliary to financial intermediati...,2019,0.000243,0.015601,0.022774,-53.299135,4.261583,-2.454473
1,Activities of membership organisation n.e.c. (91),2019,0.000174,0.013206,0.028261,0.979978,0.865499,0.414375
2,Air transport (62),2019,0.05253,0.229194,-0.002812,0.055293,-0.416125,1.417321
3,Chemicals nec,2019,0.000208,0.014438,0.001932,0.949105,-0.103712,1.103262
4,Computer and related activities (72),2019,0.000876,0.02959,0.000253,-39.568634,0.503347,0.462031
5,Construction (45),2019,0.000324,0.017998,0.000525,0.932321,-0.728185,1.704921
6,Extraction of crude petroleum and services rel...,2019,0.003437,0.058622,-0.03554,0.793735,0.123274,0.648872
7,"Financial intermediation, except insurance and...",2019,2.2e-05,0.004709,-0.001817,0.973692,0.332456,0.500322
8,Manufacture of beverages,2019,0.005129,0.071619,0.009014,0.979018,0.068502,1.207932
9,Manufacture of electrical machinery and appara...,2019,8.7e-05,0.009308,0.000343,0.964291,0.458504,0.555336


In [None]:
indf2['R2_score'].describe()

count    32.000000
mean     -3.423974
std      13.222294
min     -53.299135
25%       0.551183
50%       0.856592
75%       0.952902
max       0.991924
Name: R2_score, dtype: float64

In [None]:
# five largest values of indutries for R square
indf2.nlargest(5, ['R2_score'])

Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2017,Coefficient2018
10,"Manufacture of fabricated metal products, exce...",2019,7e-06,0.002585,0.000973,0.991924,0.400094,0.604175
15,"Manufacture of radio, television and communica...",2019,7e-06,0.002671,7.8e-05,0.98533,-0.182957,1.172883
1,Activities of membership organisation n.e.c. (91),2019,0.000174,0.013206,0.028261,0.979978,0.865499,0.414375
8,Manufacture of beverages,2019,0.005129,0.071619,0.009014,0.979018,0.068502,1.207932
27,Real estate activities (70),2019,5e-05,0.007097,-0.001203,0.976835,-0.529953,1.350316


In [None]:
indf25=indf2.nlargest(5, ['R2_score'])
indf25['Industries'].tolist()

['Manufacture of fabricated metal products, except machinery and equipment (28)',
 'Manufacture of radio, television and communication equipment and apparatus (32)',
 'Activities of membership organisation n.e.c. (91)',
 'Manufacture of beverages',
 'Real estate activities (70)']

The situation is a little bit different with the regression for 2016 to 2018 to researsh 2019 env intensity. However, regressions for Activities of membership organisation n.e.c. (91), and Manufacture of fabricated metal products, except machinery and equipment (28) both perform good.

Now, let's check whether model with 2016 data will be better for top 5 industries ordered by R square.

In [None]:
print(indf5['R2_score'].mean())
print(indf25['R2_score'].mean())

0.9720669285619052
0.9826170978729676


In [None]:
print(indf5['MSE_test'].mean())
print(indf25['MSE_test'].mean())

7.431229530659072e-05
0.0010735819826167424


With the higher R2 score and lower MSE for test data, regressions with 2017 and 2018 data is better than with data 2016 to 2018 for the first five industries.

# Industry regressions--Past years Evironmental Intensity with growth rate

Now, we did the regression for environmental intensity with growth rate to check whether the regression performs better

In [None]:
df_c = df_industry_count4.copy()
def calculateIndustriesGrowthRegressions(outcomeYear, pastYears, df_c):
    industry_regressions = {}
    for i in np.unique(df_c['Industry(Exiobase)']):
        for year in years:
            years.sort()
            data = df_c[(df_c['Industry(Exiobase)'] == i) & (df_c['Year'] == year)]
            data = data.loc[:,['CompanyName','Env_intensity','Environmental_Growth']]
            data.rename(columns={'Env_intensity': f'Env_intensity_{year}','Environmental_Growth': f'Environmental_Growth_{year}'}, inplace=True) 
            if(year == min(years)):
                data1 = pd.DataFrame(data)
            else:
                data2 = pd.merge(data1, data, on=["CompanyName"])
                data1 = data2.copy()
        data2.dropna(inplace=True)
        data2 = data2.fillna("", inplace=False)
        if (data2.shape[0])> 10:
            data3 = df_c[df_c['Year'] == outcomeYear]
            data3 = data3[['CompanyName','Env_intensity']]
            data3.rename(columns={'Env_intensity': f'Env_intensity_{outcomeYear}'}, inplace=True) 
            data3 = pd.merge(data3, data2, on=["CompanyName"])

            filter_col = [col for col in data3 if ((col.startswith('Env_intensity') and not(col.endswith(f'{outcomeYear}'))))or ((col.startswith('Environmental_Growth_') and not(col.endswith(f'{outcomeYear}'))))]
            outcome_col = [col for col in data3 if (col.startswith('Env_intensity') and col.endswith(f'{outcomeYear}'))]
            X=data3[filter_col]
            y=data3[outcome_col]

            regr = linear_model.LinearRegression()
            X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
            regr.fit(X_train, y_train)
            y_train_pred = regr.predict(X_train)
            y_pred = regr.predict(X_test)
        else: 
            continue
        industry_regressions[i] = {'OutcomeYear':outcomeYear, 'MSE_test': metrics.mean_squared_error(y_test, y_pred),'RMSE_test':sqrt(metrics.mean_squared_error(y_test, y_pred)),'Intercept': regr.intercept_,'Coefficient': regr.coef_, 
                                   'R2_score':metrics.r2_score(y_test, y_pred)}

    data_items = industry_regressions.items()
    data_list = list(data_items)

    df = pd.DataFrame(data_list)

    df=pd.concat([df.drop(columns=df.columns[1], axis=1), df.iloc[:,1].apply(pd.Series)], axis=1)

    return df

## Input data from 2016 to 2018

In [None]:
years = [2016, 2017, 2018]
gindf=calculateIndustriesGrowthRegressions(2019, years, df_c)
arr1=np.array(gindf['Coefficient'].to_list())[:,0][:,0]
arr2=np.array(gindf['Coefficient'].to_list())[:,0][:,1]
arr3=np.array(gindf['Coefficient'].to_list())[:,0][:,2]
arr4=np.array(gindf['Coefficient'].to_list())[:,0][:,3]
arr5=np.array(gindf['Coefficient'].to_list())[:,0][:,4]
arr6=np.array(gindf['Coefficient'].to_list())[:,0][:,5]
gindf['Coefficient2016'] = arr1.tolist()
gindf['Coefficient2017'] = arr2.tolist()
gindf['Coefficient2018'] = arr3.tolist()
gindf['Coefficientgrowth2016'] = arr4.tolist()
gindf['Coefficientgrowth2017'] = arr5.tolist()
gindf['Coefficientgrowth2018'] = arr6.tolist()
gindf['Industries']=gindf.iloc[:,0]
first_column = gindf.pop('Industries')
gindf.insert(0, 'Industries', first_column)
gindf['Intercept']=np.array(gindf['Intercept'].to_list())[:,0].tolist()
gindf=gindf.drop(gindf.columns[1],axis=1)#remove repeated industries column
gindf=gindf.drop(columns=['Coefficient'])#remove coefficient column
gindf

Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2016,Coefficient2017,Coefficient2018,Coefficientgrowth2016,Coefficientgrowth2017,Coefficientgrowth2018
0,Activities auxiliary to financial intermediati...,2019,0.008146,0.090253,0.184351,-1806.457281,-76.671009,0.005231518,299.823319,0.01188214,-208.056443,-0.009757
1,Activities of membership organisation n.e.c. (91),2019,0.000182,0.013481,-0.014931,0.984255,-1.441925,-0.0005700398,2.637142,3.600161e-05,-0.193111,-0.001095
2,Air transport (62),2019,0.156553,0.395668,0.004236,-0.483075,3.718195,-0.008604827,-4.32746,-0.007325014,1.546423,-3.9e-05
3,Chemicals nec,2019,0.00052,0.02281,-0.00251,0.743259,-0.791492,-7.529472e-05,0.360464,0.001375325,1.417159,0.000169
4,Computer and related activities (72),2019,4e-06,0.001926,-0.00016,0.715955,0.29275,-5.419177e-06,0.824918,3.782132e-05,-0.269905,-3.6e-05
5,Construction (45),2019,3.3e-05,0.005708,0.002028,0.967457,-0.236089,-7.235401e-06,-0.335368,0.0001002582,1.524905,0.000108
6,Extraction of crude petroleum and services rel...,2019,0.003995,0.063207,-0.088699,0.481081,0.718391,-0.0005110948,-0.124293,-0.002503166,-0.038587,-0.002001
7,"Financial intermediation, except insurance and...",2019,2e-05,0.004494,0.000522,0.596149,-0.48005,2.765622e-05,0.499465,2.569836e-05,0.862389,0.00022
8,Manufacture of beverages,2019,0.010042,0.100209,0.000235,0.958872,0.09205,0.0002111523,-0.651669,-4.401866e-05,1.602156,0.000207
9,Manufacture of electrical machinery and appara...,2019,0.000452,0.021261,-0.004139,0.878469,0.529052,-9.172356e-05,-0.204613,-0.0003618167,0.586138,-0.000247


check with one industry whether the function is accurate

In [None]:
years = [2016, 2017, 2018,2019]
df_c = df_industry_count4.copy()
for year in years:
  data = df_c[df_c['Year'] == year]
  data = data.loc[:,['CompanyName','Env_intensity','Industry(Exiobase)','Environmental_Growth']]
  data.rename(columns={'Env_intensity': f'Env_intensity_{year}','Environmental_Growth': f'Environmental_Growth_{year}'}, inplace=True) 
  if(year == min(years)):
    data1 = pd.DataFrame(data)
  else:
    data2 = pd.merge(data1, data, on=["CompanyName",'Industry(Exiobase)'])
    data1 = data2.copy()
data = data2[data2['Industry(Exiobase)'] == 'Air transport (62)']
X=data.drop(columns=['CompanyName','Env_intensity_2019','Industry(Exiobase)','Environmental_Growth_2019'])
X=data[['Env_intensity_2016','Environmental_Growth_2016','Env_intensity_2017','Environmental_Growth_2017','Env_intensity_2018','Environmental_Growth_2018']]
y=data['Env_intensity_2019']
regr = linear_model.LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
regr.fit(X_train, y_train)
y_train_pred = regr.predict(X_train)
y_pred = regr.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (metrics.mean_squared_error(y_train, y_train_pred),
                metrics.mean_squared_error(y_test, y_pred)))
print('R2 score:', metrics.r2_score(y_test, y_pred))
print('Model intercept: ', regr.intercept_)
print('Model coefficients: ', regr.coef_)


MSE train: 0.000, test: 0.157
R2 score: -0.4830752229570636
Model intercept:  0.004235999337293073
Model coefficients:  [ 3.71819540e+00 -8.60482710e-03 -4.32745978e+00 -7.32501446e-03
  1.54642276e+00 -3.88414716e-05]


The result is consistent with the table

In [None]:
gindf['R2_score'].describe()

count      30.000000
mean      -60.406912
std       329.784213
min     -1806.457281
25%        -0.002568
50%         0.729607
75%         0.873619
max         0.985070
Name: R2_score, dtype: float64

In [None]:
# five largest values of indutries for R square
gindf5=gindf.nlargest(5, ['R2_score'])
gindf5

Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2016,Coefficient2017,Coefficient2018,Coefficientgrowth2016,Coefficientgrowth2017,Coefficientgrowth2018
27,"Recreational, cultural and sporting activities...",2019,3e-06,0.001857,0.000945,0.98507,0.239191,6e-05,-0.377784,1.4e-05,1.078239,1.2e-05
1,Activities of membership organisation n.e.c. (91),2019,0.000182,0.013481,-0.014931,0.984255,-1.441925,-0.00057,2.637142,3.6e-05,-0.193111,-0.001095
5,Construction (45),2019,3.3e-05,0.005708,0.002028,0.967457,-0.236089,-7e-06,-0.335368,0.0001,1.524905,0.000108
10,"Manufacture of fabricated metal products, exce...",2019,2.6e-05,0.005104,-0.000575,0.961553,-0.256612,9.9e-05,1.883722,0.000161,-0.649369,-0.000307
8,Manufacture of beverages,2019,0.010042,0.100209,0.000235,0.958872,0.09205,0.000211,-0.651669,-4.4e-05,1.602156,0.000207


In [None]:
gindf5['Industries'].tolist()

['Recreational, cultural and sporting activities (92)',
 'Activities of membership organisation n.e.c. (91)',
 'Construction (45)',
 'Manufacture of fabricated metal products, except machinery and equipment (28)',
 'Manufacture of beverages']

In [None]:
print(gindf5['R2_score'].mean())
print(indf5['R2_score'].mean())

0.9714415807968415
0.9720669285619052


In [None]:
print(gindf5['MSE_test'].mean())
print(indf5['MSE_test'].mean())

0.002057148325698516
7.431229530659072e-05


When comparing with the first 5 industries ordered by R2 score, it seems the regression with growth rate performs better because of the similar R squared but lower MSE to test data.

Then we are going to compare with the regression by using data from 2017 and 2018 to forecast

## Input data from 2017 and 2018

In [None]:
years = [2017, 2018]
gindf2=calculateIndustriesGrowthRegressions(2019, years, df_c)
arr1=np.array(gindf2['Coefficient'].to_list())[:,0][:,0]
arr2=np.array(gindf2['Coefficient'].to_list())[:,0][:,1]
arr3=np.array(gindf2['Coefficient'].to_list())[:,0][:,2]
arr4=np.array(gindf2['Coefficient'].to_list())[:,0][:,3]
gindf2['Coefficient2017'] = arr1.tolist()
gindf2['Coefficient2018'] = arr2.tolist()
gindf2['Coefficientgrowth2017'] = arr3.tolist()
gindf2['Coefficientgrowth2018'] = arr4.tolist()
gindf2['Industries']=gindf2.iloc[:,0]
first_column = gindf2.pop('Industries')
gindf2.insert(0, 'Industries', first_column)
gindf2['Intercept']=np.array(gindf2['Intercept'].to_list())[:,0].tolist()
gindf2=gindf2.drop(gindf2.columns[1],axis=1)#remove repeated industries column
gindf2=gindf2.drop(columns=['Coefficient'])#remove coefficient column
gindf2


Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2017,Coefficient2018,Coefficientgrowth2017,Coefficientgrowth2018
0,Activities auxiliary to financial intermediati...,2019,0.003648323,0.060401,0.07374,-808.539089,-1.821584,0.004127,2.688033,-0.002260783
1,Activities of membership organisation n.e.c. (91),2019,0.02300809,0.151684,0.003164,-1.641548,1.866384,7.9e-05,-0.724753,-0.002402852
2,Air transport (62),2019,0.0526227,0.229396,-0.010415,-0.217167,0.133816,-0.000368,0.841349,-0.001390586
3,Chemicals nec,2019,0.0003869982,0.019672,0.007529,0.830493,0.011047,0.000518,0.98701,-1.476433e-05
4,Computer and related activities (72),2019,1.92434e-06,0.001387,-0.002482,0.903975,2.89421,-1.1e-05,-2.455119,-0.0001511477
5,Construction (45),2019,0.0003310563,0.018195,0.000552,0.932729,-1.04467,5.8e-05,1.89511,0.0002202825
6,Extraction of crude petroleum and services rel...,2019,0.007255946,0.085182,-0.085129,0.749181,0.346909,-9.6e-05,0.185284,-0.001494215
7,"Financial intermediation, except insurance and...",2019,0.0001593699,0.012624,0.000447,0.984304,0.269207,-1.7e-05,0.623544,0.0001470719
8,Manufacture of beverages,2019,0.0004419836,0.021023,0.007999,0.998192,0.015528,1e-05,1.241038,3.335071e-05
9,Manufacture of electrical machinery and appara...,2019,0.000120417,0.010973,0.002729,0.965995,0.461843,0.000133,0.580764,-0.0001329785


In [None]:
gindf2['R2_score'].describe()

count     31.000000
mean     -26.029181
std      145.256397
min     -808.539089
25%        0.375208
50%        0.831757
75%        0.961589
max        0.998192
Name: R2_score, dtype: float64

In [None]:
# five largest values of indutries for R square
gindf25=gindf2.nlargest(5, ['R2_score'])
gindf25

Unnamed: 0,Industries,OutcomeYear,MSE_test,RMSE_test,Intercept,R2_score,Coefficient2017,Coefficient2018,Coefficientgrowth2017,Coefficientgrowth2018
8,Manufacture of beverages,2019,0.0004419836,0.021023,0.007999,0.998192,0.015528,1e-05,1.241038,3.335071e-05
28,Renting of machinery and equipment without ope...,2019,3.115047e-07,0.000558,-0.000465,0.99179,-1.200695,-1e-05,2.116699,0.0001007476
11,Manufacture of machinery and equipment n.e.c. ...,2019,1.706126e-05,0.004131,0.003181,0.990599,0.594594,5.6e-05,0.561005,-4.094574e-05
15,"Manufacture of radio, television and communica...",2019,3.195515e-06,0.001788,-0.000127,0.989351,-0.004038,-2.2e-05,0.981032,8.940465e-07
19,Other service activities (93),2019,0.0001165047,0.010794,-0.006368,0.986606,-0.089004,0.0002,0.956525,0.000199027


In [None]:
gindf25['Industries'].tolist()

['Manufacture of beverages',
 'Renting of machinery and equipment without operator and of personal and household goods (71)',
 'Manufacture of machinery and equipment n.e.c. (29)',
 'Manufacture of radio, television and communication equipment and apparatus (32)',
 'Other service activities (93)']

In [None]:
print(gindf25['R2_score'].mean())
print(gindf5['R2_score'].mean())

0.9913073627104925
0.9714415807968415


In [None]:
print(gindf25['MSE_test'].mean())
print(gindf5['MSE_test'].mean())

0.00011581132909609158
0.002057148325698516


Accordingly, for the first five industries, the regression with data in 2017 and 2018 performs better than the regression with data from 2016 to 2018.

In [None]:
print(gindf25['R2_score'].mean())
print(indf25['R2_score'].mean())

0.9913073627104925
0.9826170978729676


In [None]:
print(gindf25['MSE_test'].mean())
print(indf25['MSE_test'].mean())

0.00011581132909609158
0.0010735819826167424


For first five indutries ordered by R2 score, it seems be the same situation that regressions perfrom better when adding growth rates with the higher R2 score and MSE to test data.

# Conclusion

* Among the first five industries ordered by R2 scores, the average result from regressions for industries with data from 2017 and 2018 perform better than regressions with data from 2016 to 2018 to forecast 2019.

* When combining growth rates into independent variables, the performance of regressions will be better.

*  There is a huge different performance of regressions for different industries. The time series model for the industry "Manufacture of fabricated metal products, except machinery and equipment (28)" performs better than other industries that all regressions have a high R square.

We will continue our analysis in 'PredictingTimeSeries - GHG Scope' notebook