# World Bank's World Development Indicators vs. Environment Analysis

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_colwidth', None)

In [None]:
%run ./help_funcs.ipynb

## Load all datasets

In [None]:
co2_descdf, co2_df = world_bank_file_to_df('../../01_DataSources/01_Raw/World_Bank_CO2_World_Development_Indicators/World_Bank_CO2_World_Development_Indicators_Data.csv')
othergas_descdf, othergas_df = world_bank_file_to_df('../../01_DataSources/01_Raw/World_Bank_Othergas_World_Development_Indicators/World_Bank_Othergas_World_Development_Indicators_Data.csv')
pollution_descdf, pollution_df = world_bank_file_to_df('../../01_DataSources/01_Raw/World_Bank_Pollution_World_Development_Indicators/World_Bank_Pollution_World_Development_Indicators_Data.csv')
gdp_descdf, gdp_df = world_bank_file_to_df('../../01_DataSources/01_Raw/World_Bank_GDP_World_Development_Indicators/World_Bank_GDP_World_Development_Indicators_Data.csv')
landuse_descdf, landuse_df = world_bank_file_to_df('../../01_DataSources/01_Raw/World_Bank_LandUse_World_Development_Indicators/World_Bank_LandUse_World_Development_Indicators_Data.csv')
urb_descdf, urb_df = world_bank_file_to_df('../../01_DataSources/01_Raw/World_Bank_Urbanization_World_Development_Indicators/World_Bank_Urbanization_World_Development_Indicators_Data.csv')


## Preparation/Exploratory Data Analysis (EDA)

### Climate Change/Pollution Indicators

We now check out the data from the World Banks that we know direcly impact the environment adversely

#### CO2 Emissions

In [None]:
display(co2_descdf.sort_values('Series Code'))

In [None]:
visualize_nulls(co2_df)

* World Bank data only has data from 2007 to 2019
* Only CO2 Emissions and CO2 Emissions per capita has enough data after 2014 while others have a large amount of null values

##### CO2 Emissions for each country

In [None]:
# remove records with year 2020 and 2021
co2_df = co2_df[~co2_df['Year'].isin([2020, 2021])]

In [None]:
# Get the the top 10 countries with the highest average C02 emissions 2007-2019
co2_em_df = co2_df[['Country', 'Year', 'EN.ATM.CO2E.KT']].dropna()

co2_em_df_top10_list = co2_em_df.groupby('Country')\
                                .mean()\
                                .reset_index()\
                                .sort_values('EN.ATM.CO2E.KT',ascending=False)\
                                .head(10)['Country']\
                                .tolist()

co2_em_df_top10 = co2_em_df[co2_em_df['Country'].isin(co2_em_df_top10_list)]

print(co2_em_df_top10_list)

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('Top 10 Countries on C02 emissions')
plt.ylabel('kilotons')
ax = sns.lineplot(x='Year', y='EN.ATM.CO2E.KT', data=co2_em_df_top10, hue='Country', hue_order=co2_em_df_top10_list)
plt.show()

##### CO2 Emissions per capita for each country

In [None]:
# Get the the top 10 countries with the highest average C02 emissions per capita 2007-2019
co2_em_pc_df = co2_df[['Country', 'Year', 'EN.ATM.CO2E.PC']].dropna()

co2_em_pc_df_top10_list = co2_em_pc_df.groupby('Country')\
                                      .mean()\
                                      .reset_index()\
                                      .sort_values('EN.ATM.CO2E.PC',ascending=False)\
                                      .head(10)['Country']\
                                      .tolist()

co2_em_pc_df_top10 = co2_em_pc_df[co2_em_pc_df['Country'].isin(co2_em_pc_df_top10_list)]

print(co2_em_pc_df_top10_list)

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('Top 10 Countries on C02 emissions per capita')
plt.ylabel('Metric Tons')
ax = sns.lineplot(x='Year', y='EN.ATM.CO2E.PC', data=co2_em_pc_df_top10, hue='Country', hue_order=co2_em_pc_df_top10_list)
plt.show()

On the other hand, with regards to CO2 emissions per capita, Qatar has the most with significant margin against the other top countries.  Also, the top 2 countries, Qatar and Kuwait, have their CO2 emissions per capita trending downwards.

##### CO2 emissions Summary

Overall, based the viable CO2 data obtained, we can take some of the top countries on CO2 emissions and analyze those closer compared to the other data we have.

Based on CO2 emissions data, we can take the following countries:
* United States
* China
* India
* Russia
* Japan

These countries are large in terms of land area and population to have significant impact to the environment for us to look into further.

In [None]:
countries_list = ['China', 'United States', 'India', 'Russian Federation', 'Japan']

In [None]:
co2_df = co2_df[co2_df['Country'].isin(countries_list)]

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('CO2 Emission')
plt.ylabel('kilotons')
ax = sns.lineplot(x='Year', y='EN.ATM.CO2E.KT', data=co2_df, hue='Country')
plt.show()

These countries have their emissions either trending upwards or somewhat steady with China clearly increaing.

#### Other Gas Emissions

In [None]:
display(othergas_descdf.sort_values('Series Code'))

In [None]:
othergas_df = othergas_df[othergas_df['Country'].isin(countries_list)]
visualize_nulls(othergas_df)

* Data is only up to 2019
* Only Green house gases, Methane and Nitrous Oxide have significant amount of data on this set

In [None]:
othergas_df = othergas_df[~othergas_df['Year'].isin([2020,2021])]

##### Green house gases

We check green house gases further as these do have some impact on global warming.

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('GHG Emission')
plt.ylabel('kilotons (CO2 equivalent)')
ax = sns.lineplot(x='Year', y='EN.ATM.GHGT.KT.CE', data=othergas_df, hue='Country')
plt.show()

##### Methane

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('Methane Emission')
plt.ylabel('kilotons (CO2 equivalent)')
ax = sns.lineplot(x='Year', y='EN.ATM.METH.KT.CE', data=othergas_df, hue='Country')
plt.show()

##### Nitrous Oxide

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('Nitrous Oxide Emission')
plt.ylabel('kilotons (CO2 equivalent)')
ax = sns.lineplot(x='Year', y='EN.ATM.NOXE.KT.CE', data=othergas_df, hue='Country')
plt.show()

##### Other Gas Emissions Summary

These countries do have a lot in common with the CO2 emissions data so we can use these together.  China and the United States are in the top 2 where China is mostly trending upwards on emission.

In [None]:
countries_list = ['China', 'United States', 'India', 'Russian Federation', 'Japan']

#### Air Pollution

In [None]:
display(pollution_descdf.sort_values('Series Code'))

In [None]:
pollution_df = pollution_df[pollution_df['Country'].isin(countries_list)]
visualize_nulls(pollution_df)

We can use data from 2010 to 2017

In [None]:
pollution_df = pollution_df[pollution_df['Year'].isin(range(2010,2018))]

In [None]:
plt.clf()
fig, axes = plt.subplots(1, 2, figsize=(24,12))
axes[0].set_ylabel('micrograms per cubic meter')
axes[1].set_ylabel('% population')

axes[0].set_title('PM2.5 air pollution, mean annual exposure')
axes[1].set_title('PM2.5 air pollution, population exposed to levels exceeding WHO guideline value')

sns.lineplot(x='Year', y='EN.ATM.PM25.MC.M3', data=pollution_df, hue='Country', ax=axes[0])
sns.lineplot(x='Year', y='EN.ATM.PM25.MC.ZS', data=pollution_df, hue='Country', ax=axes[1])
plt.show()

China and India are already at 100% on population exposed to pollution levels.  However, for all 5 countries, the mean annual exposure is decreasing or level (none are increasing).  Clearly, United States and Japan have severely reduced the amount of their population subjected to significant air pollution.

#### Climate Change/Pollution Indicators Summary

On the other hand, we can take the gas emissions data (CO2, GHG, Methane, Nitrous Oxide) and air pollution data.

| Dataset | Series Code | Series Description/Name |
| --- | --- | --- |
|CO2| EN.ATM.CO2E.KT | CO2 emissions (kt)|
|Other Gas | EN.ATM.GHGT.KT.CE | Total greenhouse gas emissions (kt of CO2 equivalent) |
|Other Gas | EN.ATM.METH.KT.CE | Methane emissions (kt of CO2 equivalent)|
|Other Gas | EN.ATM.NOXE.KT.CE | Nitrous oxide emissions (thousand metric tons of CO2 equivalent)|
|Pollution | EN.ATM.PM25.MC.M3 | PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)|
|Pollution | EN.ATM.PM25.MC.ZS | PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)|

In general, even if gas emissions have been level or increasing for the 5 countries, air pollution has been decreasing.

### World Development Indicators

We check the available data in the World Development Indicators datasets for the countries determined earlier.

#### GDP

In [None]:
gdp_df = gdp_df[gdp_df['Country'].isin(countries_list)]

In [None]:
display(gdp_descdf.sort_values('Series Code'))

In [None]:
visualize_nulls(gdp_df)

In [None]:
gdp_df = gdp_df[~gdp_df['Year'].isin([2020,2021])]

In [None]:
plt.clf()
fig, axes = plt.subplots(1, 2, figsize=(24,12))
axes[0].set_ylabel('% change')
axes[1].set_ylabel('% change')

axes[0].set_title('GDP growth (annual %)')
axes[1].set_title('GDP per capita growth (annual %)')

sns.lineplot(x='Year', y='NY.GDP.MKTP.KD.ZG', data=gdp_df, hue='Country', ax=axes[0])
sns.lineplot(x='Year', y='NY.GDP.PCAP.KD.ZG', data=gdp_df, hue='Country', ax=axes[1])
plt.show()

In [None]:
plt.clf()
fig, axes = plt.subplots(1, 2, figsize=(24,12))
axes[0].set_ylabel('US $')
axes[1].set_ylabel('US $')

axes[0].set_title('GDP')
axes[1].set_title('GDP per capita')

sns.lineplot(x='Year', y='NY.GDP.MKTP.CD', data=gdp_df, hue='Country', ax=axes[0])
sns.lineplot(x='Year', y='NY.GDP.PCAP.CD', data=gdp_df, hue='Country', ax=axes[1])
plt.show()

##### GDP Dataset Findings

We can use the GDP metrics graphs are basically the same as expectated (per capita is just divided by population on that year).  Growth numbers show that it is stabilizing out through the years.  In GDP/GDP per capita, aside from Russia and Japan which have up and downs, GDP is trending upwards.

#### Land Use

In [None]:
landuse_df = landuse_df[landuse_df['Country'].isin(countries_list)]

In [None]:
display(landuse_descdf.sort_values('Series Code'))

In [None]:
visualize_nulls(landuse_df)

In [None]:
landuse_df = landuse_df[~landuse_df['Year'].isin([2020,2021])]

In [None]:
plt.clf()
fig, axes = plt.subplots(1, 2, figsize=(24,12))
axes[0].set_ylabel('sq. km')
axes[1].set_ylabel('% land area')

axes[0].set_title('Forest area (sq. km)')
axes[1].set_title('Forest area (% of land area)')

sns.lineplot(x='Year', y='AG.LND.FRST.K2', data=landuse_df, hue='Country', ax=axes[0])
sns.lineplot(x='Year', y='AG.LND.FRST.ZS', data=landuse_df, hue='Country', ax=axes[1])
plt.show()

We get the year on year change to see how these increase/decrease.

In [None]:
forest_change_df = get_yoy_change_df(landuse_df, 'AG.LND.FRST.K2', 'AG.LND.FRST.K2.CHG')

In [None]:
plt.clf()
plt.figure(figsize=(12,12))

plt.title('Forest Area % Change')
plt.ylabel('% change')
sns.lineplot(x='Year', y='AG.LND.FRST.K2.CHG', data=forest_change_df, hue='Country')
plt.show()

##### Land Use Dataset Findings

When the annual change graph was derived, it showed that all 5 countries don't have much change in their forest area but are trending upwards in general.

#### Urbanization

##### Loading and checking the dataset

In [None]:
urb_df = urb_df[urb_df['Country'].isin(countries_list)]

In [None]:
display(urb_descdf.sort_values('Series Code'))

In [None]:
visualize_nulls(urb_df)

In [None]:
urb_df = urb_df[~urb_df['Year'].isin([2020,2021])]

In [None]:
plt.clf()
fig, axes = plt.subplots(1, 3, figsize=(36,12))
axes[0].set_ylabel('% change')
axes[1].set_ylabel('% change')
axes[2].set_ylabel('% change')

axes[0].set_title('Population growth (annual %)')
axes[1].set_title('Rural population growth (annual %)')
axes[2].set_title('Urban population growth (annual %)')

sns.lineplot(x='Year', y='SP.POP.GROW', data=urb_df, hue='Country', ax=axes[0])
sns.lineplot(x='Year', y='SP.RUR.TOTL.ZG', data=urb_df, hue='Country', ax=axes[1])
sns.lineplot(x='Year', y='SP.URB.GROW', data=urb_df, hue='Country', ax=axes[2])

plt.show()

##### Urbanization Dataset Findings

We can use the all metrics here for the given countries.  However, we can use the annual growth percentages to see some trends.  Based on annual growth, population growth is slowing down with Japan shows increase in rural population and decrease in urban population.

#### World Development Indicators Summary

On checking the datasets, we decided to take these mertics/indicators and how they relate to the Climate Change data we checked before this.

| Dataset | Series Code | Series Description/Name |
| --- | --- | --- |
| GDP | NY.GDP.MKTP.KD.ZG | GDP growth (annual %) |
| GDP | NY.GDP.PCAP.CD | GDP per capita (current US$) |
| Land Use | AG.LND.FRST.ZS | Forest area (% of land area) |
| Land Use | AG.LND.FRST.K2.CHG | Forest area annual change |
| Urbanization | SP.POP.GROW | Population growth (annual %) |
| Urbanization | SP.RUR.TOTL.ZG | Rural population growth (annual %) |
| Urbanization | SP.URB.GROW | Urban population growth (annual %) |

### EDA Summary

We know have an initial list of metrics both from Gas Emissions/Pollution Data and World Development Indicators.  However, it would be better to normalize the emissions data so that large/small values won't affect further analysis.

In [None]:
landarea_df = landuse_df[['Country', 'Year', 'AG.LND.TOTL.K2']].copy()
landarea_df = landarea_df[landarea_df['Year'].isin(range(2010,2018))]

co2_df = co2_df[co2_df['Country'].isin(countries_list)]
co2_df = co2_df[co2_df['Year'].isin(range(2010,2018))]
co2_df = co2_df[['Country', 'Year', 'EN.ATM.CO2E.KT']]

co2_per_area_df = co2_df.merge(landarea_df, how='left', on=['Country', 'Year'])
co2_per_area_df['EN.ATM.CO2E.KT.AREA'] = co2_per_area_df['EN.ATM.CO2E.KT']/co2_per_area_df['AG.LND.TOTL.K2']

In [None]:
plt.clf()
plt.figure(figsize=(12,10))
plt.title('CO2 Emission per land area')
plt.ylabel('kilotons per sq. km')
ax = sns.lineplot(x='Year', y='EN.ATM.CO2E.KT.AREA', data=co2_per_area_df, hue='Country')
plt.show()

In [None]:
othergas_per_area_df = othergas_df.merge(landarea_df, how='left', on=['Country', 'Year'])
othergas_per_area_df['EN.ATM.GHGT.KT.CE.AREA'] = othergas_per_area_df['EN.ATM.GHGT.KT.CE']/othergas_per_area_df['AG.LND.TOTL.K2']
othergas_per_area_df['EN.ATM.METH.KT.CE.AREA'] = othergas_per_area_df['EN.ATM.METH.KT.CE']/othergas_per_area_df['AG.LND.TOTL.K2']
othergas_per_area_df['EN.ATM.NOXE.KT.CE.AREA'] = othergas_per_area_df['EN.ATM.NOXE.KT.CE']/othergas_per_area_df['AG.LND.TOTL.K2']

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(36,12))
axes[0].set_ylabel('kilotons per sq. km (CO2 equivalent)')
axes[1].set_ylabel('kilotons per sq. km (CO2 equivalent')
axes[2].set_ylabel('kilotons per sq. km (CO2 equivalent')

axes[0].set_title('Greenhouse Gas Emissions per land area')
axes[1].set_title('Methane Emissions per land area')
axes[2].set_title('Nitrous Oxide Emissions per land area')

sns.lineplot(x='Year', y='EN.ATM.GHGT.KT.CE.AREA', data=othergas_per_area_df, hue='Country', ax=axes[0])
sns.lineplot(x='Year', y='EN.ATM.METH.KT.CE.AREA', data=othergas_per_area_df, hue='Country', ax=axes[1])
sns.lineplot(x='Year', y='EN.ATM.NOXE.KT.CE.AREA', data=othergas_per_area_df, hue='Country', ax=axes[2])
plt.show()

Now that emissions data has been normalized, it now shows Japan being top in emission per their land area with China coming up next.

Overall, we would use the following:

Gas Emissions/Pollution Indicators

| Dataset | Series Code | Series Description/Name |
| --- | --- | --- |
|CO2| EN.ATM.CO2E.KT.AREA | CO2 emissions per land area (kt/sq. km)|
|Other Gas | EN.ATM.GHGT.KT.CE.AREA | Total greenhouse gas emissions per land area (kt/sq. km of CO2 equivalent) |
|Other Gas | EN.ATM.METH.KT.CE.AREA | Methane emissions per land area (kt/sq. km of CO2 equivalent)|
|Other Gas | EN.ATM.NOXE.KT.CE.AREA | Nitrous oxide emissions per land area (thousand metric tons/sq. km of CO2 equivalent)|
|Pollution | EN.ATM.PM25.MC.M3 | PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)|
|Pollution | EN.ATM.PM25.MC.ZS | PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total) |

World Development Indicators

| Dataset | Series Code | Series Description/Name |
| --- | --- | --- |
| GDP | NY.GDP.MKTP.KD.ZG | GDP growth (annual %) |
| GDP | NY.GDP.PCAP.CD | GDP per capita (current US$) |
| Land Use | AG.LND.FRST.ZS | Forest area (% of land area) |
| Land Use | AG.LND.FRST.K2.CHG | Forest area annual change |
| Urbanization | SP.POP.GROW | Population growth (annual %) |
| Urbanization | SP.RUR.TOTL.ZG | Rural population growth (annual %) |
| Urbanization | SP.URB.GROW | Urban population growth (annual %) |

Year 2010 to 2017 can be used

## Analysis Proper

### Data Frame Assembly

In [None]:
co2_per_area_df = co2_per_area_df[co2_per_area_df['Country'].isin(countries_list)]
co2_per_area_df = co2_per_area_df[co2_per_area_df['Year'].isin(range(2010,2018))]
co2_per_area_df = co2_per_area_df[['Country', 'Year', 'EN.ATM.CO2E.KT.AREA']]
print('CO2 per area df shape', co2_per_area_df.shape)

In [None]:
othergas_per_area_df = othergas_per_area_df[othergas_per_area_df['Country'].isin(countries_list)]
othergas_per_area_df = othergas_per_area_df[othergas_per_area_df['Year'].isin(range(2010,2018))]
othergas_per_area_df = othergas_per_area_df[['Country', 'Year', 'EN.ATM.GHGT.KT.CE.AREA', 'EN.ATM.METH.KT.CE.AREA', 'EN.ATM.NOXE.KT.CE.AREA']]
print('Other Gas per area df shape', othergas_per_area_df.shape)

In [None]:
pollution_df = pollution_df[pollution_df['Country'].isin(countries_list)]
pollution_df = pollution_df[pollution_df['Year'].isin(range(2010,2018))]
pollution_df = pollution_df[['Country', 'Year', 'EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']]
print('Pollution df shape', pollution_df.shape)

In [None]:
gdp_df = gdp_df[gdp_df['Country'].isin(countries_list)]
gdp_df = gdp_df[gdp_df['Year'].isin(range(2010,2018))]
gdp_df = gdp_df[['Country', 'Year', 'NY.GDP.MKTP.KD.ZG', 'NY.GDP.PCAP.KD.ZG']]
print('GDP df shape', gdp_df.shape)

In [None]:
forest_change_df = forest_change_df[forest_change_df['Country'].isin(countries_list)]
forest_change_df = forest_change_df[forest_change_df['Year'].isin(range(2010,2018))]
forest_change_df = forest_change_df[['Country', 'Year', 'AG.LND.FRST.K2.CHG']]
print('Forest df shape', forest_change_df.shape)

In [None]:
landuse_df = landuse_df[landuse_df['Country'].isin(countries_list)]
landuse_df = landuse_df[landuse_df['Year'].isin(range(2010,2018))]
landuse_df = landuse_df[['Country', 'Year', 'AG.LND.FRST.ZS']]
print('Forest df shape', landuse_df.shape)

In [None]:
urb_df = urb_df[urb_df['Country'].isin(countries_list)]
urb_df = urb_df[urb_df['Year'].isin(range(2010,2018))]
urb_df = urb_df[['Country', 'Year', 'SP.POP.GROW', 'SP.RUR.TOTL.ZG', 'SP.URB.GROW']]
print('Urbanization df shape', urb_df.shape)

In [None]:
resdf = co2_per_area_df.merge(othergas_per_area_df, how='left', on=['Country', 'Year'])\
        .merge(pollution_df, how='left', on=['Country', 'Year'])\
        .merge(forest_change_df, how='left', on=['Country', 'Year'])\
        .merge(gdp_df, how='left', on=['Country', 'Year'])\
        .merge(landuse_df, how='left', on=['Country', 'Year'])\
        .merge(urb_df, how='left', on=['Country', 'Year'])

print('Result df shape =', resdf.shape)

In [None]:
value_cols = list(resdf.columns)[2:]
wdi_cols = list(resdf.columns)[8:]

resdf[value_cols].describe()

### Correlation

We take the correlation of the metrics we have gathered and check how the gas emissions data relate to the given World Development Indicators.

In [None]:
corrdata = resdf[value_cols].corr()
plt.figure(figsize=(12,10))
sns.heatmap(data=corrdata, annot=True)

#### CO2 Emissions

In [None]:
plot_corr_on_col(resdf, 'EN.ATM.CO2E.KT.AREA', wdi_cols)

#### Green House Gases Emissions

In [None]:
plot_corr_on_col(resdf, 'EN.ATM.GHGT.KT.CE.AREA', wdi_cols)

#### Methane Emissions

In [None]:
plot_corr_on_col(resdf, 'EN.ATM.METH.KT.CE.AREA', wdi_cols)

#### Nitrous Oxide Emissions

In [None]:
plot_corr_on_col(resdf, 'EN.ATM.NOXE.KT.CE.AREA', wdi_cols)

#### Air Pollution Mean Annual Exposure

In [None]:
plot_corr_on_col(resdf, 'EN.ATM.PM25.MC.M3', wdi_cols)

#### Air Pollution Percent Population exposed to levels exceeding WHO guideline

In [None]:
plot_corr_on_col(resdf, 'EN.ATM.PM25.MC.ZS', wdi_cols)

#### Summary

For the 5 countries:
* Gas Emissions are in general positively correlated with population growth and GDP
* CO2 emissions showed to be negatively correlated with population growth and positively correlated with Forest Area
* Air Pollution Mean Exposure is negatively correlated with the Forest Land Area

These seems to be counter-intuitive to common knowledge as these imply that as the population rate increase would reduce CO2 and increasing forest area will also increase CO2 emissions.  Though correlation does not equate to causation, maybe this relation is not as simple as being shown.

However, we may see correlation changes per country, specifically for India, Japan and the United States and just take the CO2 emission and Air Pollution Mean Annual Exposure.

### Correlation - United States

In [None]:
res_usa_df = resdf[resdf['Country'] == 'United States']

corrdata_usa = res_usa_df[value_cols].corr()
plt.figure(figsize=(12,10))
sns.heatmap(data=corrdata_usa, annot=True)

In [None]:
plot_corr_on_col(res_usa_df, 'EN.ATM.CO2E.KT.AREA', wdi_cols)

In [None]:
plot_corr_on_col(res_usa_df, 'EN.ATM.PM25.MC.M3', wdi_cols)

#### Summary (United States)

CO2 Emissions and Air Pollution are correlated with relatively the same items.  However, this showed that both are negatively correlated with Forest Area.

### Correlation - Japan

In [None]:
res_jpn_df = resdf[resdf['Country'] == 'Japan']

corrdata_jpn = res_jpn_df[value_cols].corr()
plt.figure(figsize=(12,10))
sns.heatmap(data=corrdata_jpn, annot=True)

In [None]:
plot_corr_on_col(res_jpn_df, 'EN.ATM.CO2E.KT.AREA', wdi_cols)

In [None]:
plot_corr_on_col(res_jpn_df, 'EN.ATM.PM25.MC.M3', wdi_cols)

#### Summary (Japan)

For Japan, it showed CO2 emissions and air pollution to be negatively correlated with Rural Population Growth.  Based on earlier graphs, Japan did show a significant increase in rural population for a time that reflected on this correlation.

### Correlation - India

In [None]:
res_ind_df = resdf[resdf['Country'] == 'India']

corrdata_ind = res_ind_df[value_cols].corr()
plt.figure(figsize=(12,10))
sns.heatmap(data=corrdata_ind, annot=True)

In [None]:
plot_corr_on_col(res_ind_df, 'EN.ATM.CO2E.KT.AREA', wdi_cols)

In [None]:
plot_corr_on_col(res_ind_df, 'EN.ATM.PM25.MC.M3', wdi_cols)

#### Summary (India)

For India, it is only on Air Pollution that showed the negative correlation with Forest Area.

## Analysis Summary

In general, the data shows that gas emissions and air pollution increase along with the development indicators like population growth and GDP.  These 5 countries are among the most developed in the world thus the GDP and population.  Also, the data only span from 2010 to 2017 as this the time period that the data gathered have in common but seems not enough to find a significant trend.

On the other hand, Forest Area being negatively correlated with gas emissions and pollution may be something worth looking further into as most countries in the set have this trend.

## Takeaways

Overall, it may be better to find trends on this on a per country basis instead of finding out trends on multiple countries.  Cultures may play a role on how each country address any environment issues and natural calamities are not the same in all countries.

At this analysis, however, Forest Area seems to be significant to having less gas emissions and pollution.  Looking further into data on reforestation and even tree conditions (e.g. there maybe a lot of forests in California but significant parts of those are "dead trees" that would just be fuel to forest fires) would give further detail if those definitely help address the environment.  This kind of data are usually found in GIS (Geographic Information System) databases which have a different way of processing/preparation before it can be analyzed in this manner.