# Fixed Effect Regressions (Industry and Country) to Explain Environmental Intensity (Environmental Costs/Sales) in Python


In this notebook, we will present three fixed effects based on:

1) Industry
2) Country
3) Industry and Country

First, we will clean this new dataset and gather the necessary columns for our analysis. Let's start by importating necessary libraries

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn import datasets, linear_model
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
import seaborn as sns
import matplotlib.pyplot as plt

We read in the dataframe and clean out the column names

In [2]:
df = pd.read_csv('/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Sprint #5 - Midterm presentations/Final-Sample-External-with-ISINs.csv')

column_list = []
for column in df.columns:
    column_list.append(column.replace(' ', ''))
df.columns = column_list
df.head()

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,...,SDG6,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed
0,GB00BMX64W89,2019,Saga plc,United Kingdom,Activities auxiliary to financial intermediati...,-2.89%,-13.03%,"(31,842,309)","(31,150,754)","(7,184)",...,"(170,776)","(1,059)",(5),(1),"(3,585)",(6),71,71,"(1,297)",1%
1,MYL1818OO003,2019,BURSA MALAYSIA BHD,Malaysia,Activities auxiliary to financial intermediati...,-1.68%,-3.47%,"(1,968,379)","(1,924,910)",(451),...,"(11,502)",(168),(1),(1),(222),(2),10,10,(79),4%
2,GB0031638363,2019,INTERTEK GROUP PLC,United Kingdom,Activities auxiliary to financial intermediati...,-1.53%,-9.49%,"(60,599,272)","(59,281,663)","(13,774)",...,"(324,960)","(3,804)",(17),(4),"(6,861)",(20),254,254,"(2,470)",1%
3,ZAE000079711,2019,JSE LIMITED,South Africa,Activities auxiliary to financial intermediati...,-1.46%,,"(2,290,124)","(2,239,814)",(510),...,"(12,200)",(901),(0),(1),(253),(0),(3),(3),(93),2%
4,FR0006174348,2019,BUREAU VERITAS SA,France,Activities auxiliary to financial intermediati...,-0.70%,-5.10%,"(39,978,650)","(39,107,612)","(9,330)",...,"(214,438)","(4,116)",(38),(9),"(4,607)",(45),586,586,"(1,633)",3%


We are only interested in the columns Year, Country, Industry(Exiobase), Environmental Intensity (Sales), Total Enviromental Cost. Also, we will format these last two columns to float.

In [3]:
df = df[['Year', 'Country', 'Industry(Exiobase)', 'EnvironmentalIntensity(Sales)', 'TotalEnvironmentalCost']]

def percent_to_float(s):
    return float(s.strip('%')) / 100.0

replace_dict = {'(':'',')':'', ' ' : '', ',' : ''}
def paranthesis_to_minus(value):
    for i, j in replace_dict.items():
        value = value.replace(i, j)
    value = int(f'-{value}')
    return value

df['EnvironmentalIntensity(Sales)'] = df['EnvironmentalIntensity(Sales)'].apply(percent_to_float)
df['TotalEnvironmentalCost'] = df['TotalEnvironmentalCost'].apply(paranthesis_to_minus)
df['Revenue'] = df['TotalEnvironmentalCost'] / df['EnvironmentalIntensity(Sales)']
df.rename(columns={'Industry(Exiobase)':'Ind','EnvironmentalIntensity(Sales)' : 'Environmental_Intensity'},inplace=True)


print(f'The dataset contains information for {len(df.Country.unique())} countries and {len(df.Ind.unique())} industries from {df.Year.min()} to {df.Year.max()}')
df.head()

The dataset contains information for 129 countries and 111 industries from 2010 to 2019


Unnamed: 0,Year,Country,Ind,Environmental_Intensity,TotalEnvironmentalCost,Revenue
0,2019,United Kingdom,Activities auxiliary to financial intermediati...,-0.0289,-31842309,1101810000.0
1,2019,Malaysia,Activities auxiliary to financial intermediati...,-0.0168,-1968379,117165400.0
2,2019,United Kingdom,Activities auxiliary to financial intermediati...,-0.0153,-60599272,3960737000.0
3,2019,South Africa,Activities auxiliary to financial intermediati...,-0.0146,-2290124,156857800.0
4,2019,France,Activities auxiliary to financial intermediati...,-0.007,-39978650,5711236000.0


We proceed to create dummy variables for industry and country.

In [4]:
df.Ind = df.Ind.astype('category')
df.Country = df.Country.astype('category')
df['Ind_cat'] = df.Ind.cat.codes
df['Country_cat'] = df.Country.cat.codes
df.head(3)

Unnamed: 0,Year,Country,Ind,Environmental_Intensity,TotalEnvironmentalCost,Revenue,Ind_cat,Country_cat
0,2019,United Kingdom,Activities auxiliary to financial intermediati...,-0.0289,-31842309,1101810000.0,55,127
1,2019,Malaysia,Activities auxiliary to financial intermediati...,-0.0168,-1968379,117165400.0,55,98
2,2019,United Kingdom,Activities auxiliary to financial intermediati...,-0.0153,-60599272,3960737000.0,55,127


We concluded with a dataset with the following variables:

* Year = Year of data
* Country_cat = Country Code created to represent each unique country
    - We will drop the 'Country' column and use this one instead
* Ind_cat = Industry Code (1-50) created to represent
    - We will drop the 'Ind' column and use this one instead
* Enviromental_Intensity = Environmental Costs/Sales in 2019
* TotalEnvironmentalCost = Total Environmental Costs in US dollars
* Revenue = Sales in US dollars for each year (derived from "Enviromental_Intensity" and "Total Enviromental Cost")



We need to consider to have at least three companies for each industry. The main reason is to make the fixed effect representative and useful. By measuring only one company in one industry, the fixed effect for industry won't be really aplicable. 

In [5]:
grouped_by = df.groupby('Ind')[['Ind_cat']].count().reset_index()
grouped_by.columns = ['Industry','Number_of_companies']
data = grouped_by.loc[grouped_by.Number_of_companies <=  3, ].sort_values(by='Number_of_companies')
data.head()

Unnamed: 0,Industry,Number_of_companies
63,Cultivation of cereal grains nec,1
68,"Forestry, logging and related service activiti...",1
64,Education (80),2
107,Sea and coastal water transport,2


We won't consider this four industries in our model which means that we are only including 107 industries. 

In [6]:
industries = ['Cultivation of cereal grains nec','Forestry, logging and related service activities (02)','Education (80)','Sea and coastal water transport']
df = df[~df.Ind.isin(industries)]
print(f'Now, we are only considering {len(df.Ind.unique())} industries for fixed effects')

Now, we are only considering 107 industries for fixed effects


Now, we will proceed to work with the fixed effects models. 

## Baseline Regression
* Estimate a regression: *er = constant*
* The coefficient estimate will give the average Environmental Intenisty for the 14,509 firms

In [7]:
df['constant'] = 1

X = df[['constant']]
y = df['Environmental_Intensity']

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,Environmental_Intensity,R-squared:,-0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,
Date:,"Wed, 05 May 2021",Prob (F-statistic):,
Time:,16:12:29,Log-Likelihood:,-1651.2
No. Observations:,14509,AIC:,3304.0
Df Residuals:,14508,BIC:,3312.0
Df Model:,0,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
constant,-0.1131,0.002,-50.235,0.000,-0.117,-0.109

0,1,2,3
Omnibus:,10336.188,Durbin-Watson:,0.515
Prob(Omnibus):,0.0,Jarque-Bera (JB):,178604.594
Skew:,-3.29,Prob(JB):,0.0
Kurtosis:,18.879,Cond. No.,1.0


As expected, the R-squared and Adj. R-squared are zero because we don't have any features that could explain the environmental intesity. 

From this regression, we are interested to about the constant coefficient which tells us that the average environmental intesity is around -0.1131 when everything else is constant and we don't have any explanatory variables. 

## Industry fixed effect

Now, we are interested in the industry fixed effect and see how much of the environmental intensity variation can be explained based on the company industry. 


We will drop unnecessary columns, and create some dummy variables for the countries. 

In [8]:
df.drop(columns=['Country','Ind'], inplace=True)
df_ind = df.copy()
df_ind = pd.get_dummies(df_ind, columns=['Ind_cat'])
df_ind

Unnamed: 0,Year,Environmental_Intensity,TotalEnvironmentalCost,Revenue,Country_cat,constant,Ind_cat_0,Ind_cat_1,Ind_cat_2,Ind_cat_3,...,Ind_cat_100,Ind_cat_101,Ind_cat_102,Ind_cat_103,Ind_cat_104,Ind_cat_105,Ind_cat_106,Ind_cat_108,Ind_cat_109,Ind_cat_110
0,2019,-0.0289,-31842309,1.101810e+09,127,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2019,-0.0168,-1968379,1.171654e+08,98,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2019,-0.0153,-60599272,3.960737e+09,127,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2019,-0.0146,-2290124,1.568578e+08,116,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2019,-0.0070,-39978650,5.711236e+09,78,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14510,2010,-0.0171,-259674701,1.518566e+10,28,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14511,2010,-0.0139,-164612070,1.184259e+10,28,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14512,2010,-0.0101,-38125940,3.774846e+09,61,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14513,2010,-0.0042,-21863235,5.205532e+09,50,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
X = df_ind.iloc[:,6:]
y = df_ind['Environmental_Intensity']


sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,Environmental_Intensity,R-squared:,0.431
Model:,OLS,Adj. R-squared:,0.427
Method:,Least Squares,F-statistic:,103.1
Date:,"Wed, 05 May 2021",Prob (F-statistic):,0.0
Time:,16:12:29,Log-Likelihood:,2444.7
No. Observations:,14509,AIC:,-4675.0
Df Residuals:,14402,BIC:,-3864.0
Df Model:,106,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Ind_cat_0,-0.0052,0.016,-0.329,0.742,-0.036,0.026
Ind_cat_1,-0.1176,0.018,-6.456,0.000,-0.153,-0.082
Ind_cat_2,-0.1955,0.019,-10.306,0.000,-0.233,-0.158
Ind_cat_3,-0.1418,0.013,-10.639,0.000,-0.168,-0.116
Ind_cat_4,-0.1125,0.036,-3.102,0.002,-0.184,-0.041
Ind_cat_5,-0.0075,0.012,-0.636,0.525,-0.030,0.016
Ind_cat_6,-0.0424,0.010,-4.335,0.000,-0.062,-0.023
Ind_cat_7,-0.3120,0.059,-5.267,0.000,-0.428,-0.196
Ind_cat_8,-0.0162,0.103,-0.158,0.875,-0.217,0.185

0,1,2,3
Omnibus:,7318.999,Durbin-Watson:,0.836
Prob(Omnibus):,0.0,Jarque-Bera (JB):,299578.443
Skew:,-1.743,Prob(JB):,0.0
Kurtosis:,24.986,Cond. No.,14.1


From this summary results, we can see that 43% of the environmental intensity can be explained by company industry. Based on these results, we notice the importance between the firm business operations and their environmental intensity. 

It suggests that the industry the company operates is a primary factor in determing the amount of pollution it puts out in the environment. 

## Country fixed effect

Now, we are interested in the country fixed effect and see how much of the environmental intensity variation can be explained based on Country the companies is located on. 

In [10]:
df_ctry = df.copy()
df_ctry = pd.get_dummies(df_ctry, columns=['Country_cat'])
df_ctry

Unnamed: 0,Year,Environmental_Intensity,TotalEnvironmentalCost,Revenue,Ind_cat,constant,Country_cat_0,Country_cat_1,Country_cat_2,Country_cat_3,...,Country_cat_119,Country_cat_120,Country_cat_121,Country_cat_122,Country_cat_123,Country_cat_124,Country_cat_125,Country_cat_126,Country_cat_127,Country_cat_128
0,2019,-0.0289,-31842309,1.101810e+09,55,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,2019,-0.0168,-1968379,1.171654e+08,55,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2019,-0.0153,-60599272,3.960737e+09,55,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,2019,-0.0146,-2290124,1.568578e+08,55,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2019,-0.0070,-39978650,5.711236e+09,55,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14510,2010,-0.0171,-259674701,1.518566e+10,53,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14511,2010,-0.0139,-164612070,1.184259e+10,53,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14512,2010,-0.0101,-38125940,3.774846e+09,54,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14513,2010,-0.0042,-21863235,5.205532e+09,54,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
X = df_ctry.iloc[:,6:]
y = df_ctry['Environmental_Intensity']


sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,Environmental_Intensity,R-squared:,0.078
Model:,OLS,Adj. R-squared:,0.07
Method:,Least Squares,F-statistic:,9.524
Date:,"Wed, 05 May 2021",Prob (F-statistic):,1.0299999999999999e-168
Time:,16:12:29,Log-Likelihood:,-1060.9
No. Observations:,14509,AIC:,2380.0
Df Residuals:,14380,BIC:,3358.0
Df Model:,128,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Country_cat_0,-0.0139,0.092,-0.150,0.880,-0.195,0.167
Country_cat_1,-0.1433,0.015,-9.802,0.000,-0.172,-0.115
Country_cat_2,-0.1110,0.038,-2.941,0.003,-0.185,-0.037
Country_cat_3,-0.0748,0.044,-1.717,0.086,-0.160,0.011
Country_cat_4,0.3784,0.099,3.829,0.000,0.185,0.572
Country_cat_5,-0.0928,0.029,-3.155,0.002,-0.150,-0.035
Country_cat_6,-0.1295,0.017,-7.577,0.000,-0.163,-0.096
Country_cat_7,-0.4778,0.056,-8.570,0.000,-0.587,-0.369
Country_cat_8,-0.1421,0.047,-3.026,0.002,-0.234,-0.050

0,1,2,3
Omnibus:,9958.732,Durbin-Watson:,0.619
Prob(Omnibus):,0.0,Jarque-Bera (JB):,174659.273
Skew:,-3.107,Prob(JB):,0.0
Kurtosis:,18.821,Cond. No.,38.4


From this summary results, we can see that only 7% of the environmental intensity can be explained by company country. Based on this results, the Country the company operates is a not primary factor in determing the amount of pollution it puts out in the environment. 

However, what if we consider the Country and Industry together? Can these two explanatory variables explain better the firms environmental intensity? Let's see!

## Industry and Country fixed effect - Combined

In [12]:
df_merged = df.copy()
df_merged = pd.get_dummies(df_merged, columns=['Country_cat'])
df_merged = pd.get_dummies(df_merged, columns=['Ind_cat'])
df_merged.head()

Unnamed: 0,Year,Environmental_Intensity,TotalEnvironmentalCost,Revenue,constant,Country_cat_0,Country_cat_1,Country_cat_2,Country_cat_3,Country_cat_4,...,Ind_cat_100,Ind_cat_101,Ind_cat_102,Ind_cat_103,Ind_cat_104,Ind_cat_105,Ind_cat_106,Ind_cat_108,Ind_cat_109,Ind_cat_110
0,2019,-0.0289,-31842309,1101810000.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2019,-0.0168,-1968379,117165400.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2019,-0.0153,-60599272,3960737000.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2019,-0.0146,-2290124,156857800.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2019,-0.007,-39978650,5711236000.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
X = df_merged.iloc[:,5:]
y = df_merged['Environmental_Intensity']

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,Environmental_Intensity,R-squared:,0.474
Model:,OLS,Adj. R-squared:,0.465
Method:,Least Squares,F-statistic:,55.17
Date:,"Wed, 05 May 2021",Prob (F-statistic):,0.0
Time:,16:12:30,Log-Likelihood:,3007.2
No. Observations:,14509,AIC:,-5546.0
Df Residuals:,14275,BIC:,-3772.0
Df Model:,233,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Country_cat_0,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_1,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_2,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_3,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_4,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_5,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_6,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_7,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11
Country_cat_8,1.307e+11,4.25e+11,0.307,0.759,-7.03e+11,9.65e+11

0,1,2,3
Omnibus:,6845.795,Durbin-Watson:,0.914
Prob(Omnibus):,0.0,Jarque-Bera (JB):,279762.037
Skew:,-1.576,Prob(JB):,0.0
Kurtosis:,24.28,Cond. No.,926000000000000.0


From this summary results, we can see that 47% of the environmental intensity can be explained by company industry and country. Based on these results, we noticed the importance between the firm business operations, location and their environmental intensity. 

It suggests that the industry and country the company operates are primary factor in determing the amount of pollution it puts out in the environment. 

### We can add in this area the other mdoels you guys worked on