In [1]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='/kaggle/input/world-polluted-country-report/most-polluted-countries.csv')
df.head()

Unnamed: 0,pollution_2023,pollution_growth_Rate,country_name,ccn3,country_region,united_nation_Member,country_land_Area_in_Km,pollution_density_in_km,pollution_density_per_Mile,share_borders,pollution_Rank,mostPollutedCountries_particlePollution
0,1428627663,0.00808,India,356,Asia,True,2973190.0,480.5033,1244.5036,"AFG, BGD, BTN, MMR, CHN, NPL, PAK, LKA",1,58.08
1,1425671352,-0.00015,China,156,Asia,True,9424702.9,151.2696,391.7884,"AFG, BTN, MMR, HKG, IND, KAZ, PRK, KGZ, LAO, M...",2,39.12
2,339996563,0.00505,United States,840,North America,True,9147420.0,37.1686,96.2666,"CAN, MEX",3,9.04
3,277534122,0.00738,Indonesia,360,Asia,True,1877519.0,147.8196,382.8528,"TLS, MYS, PNG",4,51.71
4,240485658,0.01976,Pakistan,586,Asia,True,770880.0,311.9625,807.9829,"AFG, CHN, IND, IRN",5,65.81


Ranking countries by total pollution doesn't tell the whole story, since countries can be orders of magnitude difference in both area and population. Let's try to make the most of the data we have.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 12 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   pollution_2023                           96 non-null     int64  
 1   pollution_growth_Rate                    96 non-null     float64
 2   country_name                             96 non-null     object 
 3   ccn3                                     96 non-null     int64  
 4   country_region                           96 non-null     object 
 5   united_nation_Member                     96 non-null     bool   
 6   country_land_Area_in_Km                  96 non-null     float64
 7   pollution_density_in_km                  96 non-null     float64
 8   pollution_density_per_Mile               96 non-null     float64
 9   share_borders                            85 non-null     object 
 10  pollution_Rank                           96 non-null

In [3]:
df.nunique()

pollution_2023                             96
pollution_growth_Rate                      95
country_name                               96
ccn3                                       96
country_region                              6
united_nation_Member                        2
country_land_Area_in_Km                    96
pollution_density_in_km                    96
pollution_density_per_Mile                 96
share_borders                              83
pollution_Rank                             96
mostPollutedCountries_particlePollution    93
dtype: int64

In [4]:
iso_df = pd.read_csv(filepath_or_buffer='/kaggle/input/countries-iso-codes/wikipedia-iso-country-codes.csv')
iso_df.head()

Unnamed: 0,English short name lower case,Alpha-2 code,Alpha-3 code,Numeric code,ISO 3166-2
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX
2,Albania,AL,ALB,8,ISO 3166-2:AL
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ
4,American Samoa,AS,ASM,16,ISO 3166-2:AS


We don't have the ISO-3 codes we need for a choropleth so we need to get them from another dataset.

In [5]:
t_df = df.merge(right=iso_df, right_on='Numeric code', left_on='ccn3', how='inner')
t_df.shape

(96, 17)

Fortunately both datasets have a full set of numeric country codes so we don't have to fix any country names.

In [6]:
from plotly.express import choropleth
choropleth(data_frame=t_df, locations='Alpha-3 code', color='pollution_density_in_km').show()
choropleth(data_frame=t_df, locations='Alpha-3 code', color='mostPollutedCountries_particlePollution').show()

We have a problem here; it looks like all of our pollution density values are relatively speaking clustered near zero. The particle pollution data seems more meaningful in this sense.

In [7]:
from plotly.express import scatter
scatter(data_frame=df, x='country_land_Area_in_Km', y='pollution_2023', hover_name='country_name',
       log_x=True, log_y=True, trendline='ols', color='country_region', trendline_scope='overall')

If every country in the dataset polluted about the same we would see all the points clustering around a linear trendline.

In [8]:
scatter(data_frame=df, x='pollution_density_in_km', y='mostPollutedCountries_particlePollution',
        log_x=True, log_y=True, hover_name='country_name', color='pollution_Rank')

In [9]:
scatter(data_frame=df, x='pollution_density_in_km', y='mostPollutedCountries_particlePollution',
        log_x=True, log_y=True, hover_name='country_name', color='united_nation_Member')

The distinction of being a UN member is not particularly meaningful here.

In [10]:
scatter(data_frame=df, x='pollution_density_in_km', y='mostPollutedCountries_particlePollution',
        log_x=True, log_y=True, hover_name='country_name', color='country_region')

This may be the nut graf, as it shows that countries in Asia for the most part are in one part of the X-Y space while countries from Oceania and Europe are for the most part in another; while countries from Africa and North America are some here and some there.