## Environmental Impact dataset - Data Cleaning

In this notebook, we are cleaning the format for a few columns in the Environmental Impact dataset retrieved from the Harvard Study. 

We will continue using this dataset in several notebooks

In [1]:
import pandas as pd
df = pd.read_csv('/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/Final-Sample-External-with-ISINs.csv')
column_list = []
for column in df.columns:
    column_list.append(column.replace(' ', ''))
df.columns = column_list
df.head()

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,...,SDG6,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed
0,GB00BMX64W89,2019,Saga plc,United Kingdom,Activities auxiliary to financial intermediati...,-2.89%,-13.03%,-31842309,-31150754,-7184,...,-170776,-1059,-5,-1,-3585,-6,71,71,-1297,1%
1,MYL1818OO003,2019,BURSA MALAYSIA BHD,Malaysia,Activities auxiliary to financial intermediati...,-1.68%,-3.47%,-1968379,-1924910,-451,...,-11502,-168,-1,-1,-222,-2,10,10,-79,4%
2,GB0031638363,2019,INTERTEK GROUP PLC,United Kingdom,Activities auxiliary to financial intermediati...,-1.53%,-9.49%,-60599272,-59281663,-13774,...,-324960,-3804,-17,-4,-6861,-20,254,254,-2470,1%
3,ZAE000079711,2019,JSE LIMITED,South Africa,Activities auxiliary to financial intermediati...,-1.46%,,-2290124,-2239814,-510,...,-12200,-901,0,-1,-253,0,-3,-3,-93,2%
4,FR0006174348,2019,BUREAU VERITAS SA,France,Activities auxiliary to financial intermediati...,-0.70%,-5.10%,-39978650,-39107612,-9330,...,-214438,-4116,-38,-9,-4607,-45,586,586,-1633,3%


Noticed the company name and country values have spaces at the beginning and end.

In [2]:
companies = df.CompanyName.tolist()
com = []

countries = df.Country.tolist()
country = []

for c in companies:
    com.append(c.lstrip().rstrip())

for c in countries:
    country.append(c.lstrip().rstrip())
    
df['CompanyName'] = com
df['Country'] = country

Also, we will fix that in the next cells.

In [3]:
def percent_to_float(s):
    return float(s.strip('%')) / 100.0

replace_dict = {'(':'',')':'', ' ' : '', ',' : ''}
def paranthesis_to_minus(value):
    for i, j in replace_dict.items():
        value = value.replace(i, j)
    value = int(f'-{value}')
    return value

df['Env_intensity'] = df['EnvironmentalIntensity(Sales)'].apply(percent_to_float)

In [4]:
df.head()

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,...,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,Env_intensity
0,GB00BMX64W89,2019,Saga plc,United Kingdom,Activities auxiliary to financial intermediati...,-2.89%,-13.03%,-31842309,-31150754,-7184,...,-1059,-5,-1,-3585,-6,71,71,-1297,1%,-0.0289
1,MYL1818OO003,2019,BURSA MALAYSIA BHD,Malaysia,Activities auxiliary to financial intermediati...,-1.68%,-3.47%,-1968379,-1924910,-451,...,-168,-1,-1,-222,-2,10,10,-79,4%,-0.0168
2,GB0031638363,2019,INTERTEK GROUP PLC,United Kingdom,Activities auxiliary to financial intermediati...,-1.53%,-9.49%,-60599272,-59281663,-13774,...,-3804,-17,-4,-6861,-20,254,254,-2470,1%,-0.0153
3,ZAE000079711,2019,JSE LIMITED,South Africa,Activities auxiliary to financial intermediati...,-1.46%,,-2290124,-2239814,-510,...,-901,0,-1,-253,0,-3,-3,-93,2%,-0.0146
4,FR0006174348,2019,BUREAU VERITAS SA,France,Activities auxiliary to financial intermediati...,-0.70%,-5.10%,-39978650,-39107612,-9330,...,-4116,-38,-9,-4607,-45,586,586,-1633,3%,-0.007


Now, the dataset looks way better and we can continue our analysis by using this cleaned dataset. We will proceed to export the dataframe to the dataset folder from where we are going to access it in future notebooks. 

We are also going to add new columns to this dataset. 

## Creating the Industry Average Year column

In [5]:
industry_avg = df.groupby('Industry(Exiobase)')[['Env_intensity']].mean().reset_index()
df['industry_avg'] = df['Env_intensity'].groupby(df['Industry(Exiobase)']).transform('mean')

In [6]:
def create_ind_year(df):
    if(df['Env_intensity'] > df['industry_avg_year']):
        return 1
    elif (df['Env_intensity'] == df['industry_avg_year']):
        return 0
    elif (df['Env_intensity'] < df['industry_avg_year']):
        return -1

df['industry_avg_year'] = df.groupby(['Industry(Exiobase)','Year']).transform('mean')[['Env_intensity']]

df['Industry_indicator_year'] = df.apply(create_ind_year, axis = 1)
df.head()

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,...,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,Env_intensity,industry_avg,industry_avg_year,Industry_indicator_year
0,GB00BMX64W89,2019,Saga plc,United Kingdom,Activities auxiliary to financial intermediati...,-2.89%,-13.03%,-31842309,-31150754,-7184,...,-3585,-6,71,71,-1297,1%,-0.0289,-0.004417,0.002943,-1
1,MYL1818OO003,2019,BURSA MALAYSIA BHD,Malaysia,Activities auxiliary to financial intermediati...,-1.68%,-3.47%,-1968379,-1924910,-451,...,-222,-2,10,10,-79,4%,-0.0168,-0.004417,0.002943,-1
2,GB0031638363,2019,INTERTEK GROUP PLC,United Kingdom,Activities auxiliary to financial intermediati...,-1.53%,-9.49%,-60599272,-59281663,-13774,...,-6861,-20,254,254,-2470,1%,-0.0153,-0.004417,0.002943,-1
3,ZAE000079711,2019,JSE LIMITED,South Africa,Activities auxiliary to financial intermediati...,-1.46%,,-2290124,-2239814,-510,...,-253,0,-3,-3,-93,2%,-0.0146,-0.004417,0.002943,-1
4,FR0006174348,2019,BUREAU VERITAS SA,France,Activities auxiliary to financial intermediati...,-0.70%,-5.10%,-39978650,-39107612,-9330,...,-4607,-45,586,586,-1633,3%,-0.007,-0.004417,0.002943,-1


### Creating Environmental growth

Environmental Intensity Growth : ((Environmental Intensity in Current Year / Environmental Intensity Last Year) - 1) * 100

In [7]:
df = df.sort_values(by=['CompanyName','Year'], ascending = True)
df['Environmental_Growth'] = df.groupby(['CompanyName'])['Env_intensity'].apply(lambda x: x.pct_change()).to_numpy() * 100
df.head()

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,...,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,Env_intensity,industry_avg,industry_avg_year,Industry_indicator_year,Environmental_Growth
6369,DE0005545503,2016,1&1 DRILLISCH AG,Germany,Post and telecommunications (64),-0.07%,-0.82%,-539318,-525027,-169,...,-6,67,67,-22,23%,-0.0007,-0.018382,-0.01164,1,
13777,GB00B1YW4409,2010,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance an...",-0.12%,-0.11%,-1055812,-1032103,-277,...,-4,51,51,-43,10%,-0.0012,-0.020072,-0.006402,1,
12690,GB00B1YW4409,2011,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance an...",-0.16%,-0.16%,-961875,-940402,-246,...,-3,38,38,-39,9%,-0.0016,-0.020072,-0.009838,1,33.333333
11504,GB00B1YW4409,2012,3I GROUP PLC,United Kingdom,"Financial intermediation, except insurance an...",-0.15%,,-722999,-706893,-183,...,-2,27,27,-30,8%,-0.0015,-0.020072,-0.024437,1,-6.25
13501,US88579Y1010,2010,3M COMPANY,United States,Activities of membership organisation n.e.c. ...,-7.90%,-35.45%,-2105919763,-1924672080,-439506,...,-423,3772,3772,-79722,1%,-0.079,-0.117561,-0.084583,1,


In [8]:
df.to_csv('/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/Environmental_impact_cleaned.csv', index=False)