# Countries Explorative Data Analysis (EDA)

In [3]:
from bs4 import BeautifulSoup
import csv
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import requests
import seaborn as sns
import time

## Overview
The dataset used in this project contains information about all the countries in the world. The objective of the analysis is to gain insights about population, area, languages spoken, distribution of wealth, life expectancy and death rate.

## Questions for Analysis

1. What are the 10 most populated countries?
2. What are the top 3 most populated countries by continent?
3. What is the most populated continent?
4. What are the top 20 largest countries?
5. What is the smallest continent?
6. What are the top 5 most common languages?
7. What are the 10 countries with the most inequalities?
8. What is the continent with the highest death rate?
9. - What is the overall average life expectancy?
   - What is the average life expectancy by continent?
10. What variable seems to be correlated to life expectancy the most?

## About Dataset

The dataset comes from 4 different sources
- [REST Countries API](https://restcountries.com/)
- [World Bank Group](https://datacatalog.worldbank.org/search/dataset/0038130) (dowloaded CSV)
- [Worldometer](https://www.worldometers.info/demographics/life-expectancy/) (web scraping)
- [CIA](https://www.cia.gov/the-world-factbook/field/death-rate/country-comparison/) (downloaded CSV)

### Columns description
- ```name```: country name
- ```code```: ISO 3166 country code
- ```continent```: continent of the country
- ```rank```: ranking of the country based on the GDP
- ```population```: population of the country
- ```area```: country area, in square kilometers
- ```languages```: languages spoken
- ```gdp```: Gross Domestic Product (GDP), in millions of US Dollars
- ```gdp_per_capita```: Gross Domestic Product per Capita (GDP/100,000 population)
- ```gini```: GINI Coefficient or GINI Index. It measures inequality on a scale from 0 to 100, 0 being perfect equality and 100 being perfect inequality.
- ```death_rate```: death rate per 1000 population
- ```life_expentancy```: life expentancy for both sexes

## Step 1 - Data Collection

### Create countries CSV file from API

In [9]:
# Handle ChunkedEncodingError by adding 'try except' and adding 1 second pause between each attempt
for attempt in range(5):
    country_list = []
    
    try:
        response = requests.get('https://restcountries.com/v3.1/all')
        
        if response.status_code != 200:
            print('Failed to find data')

        else:
            countriesJson = json.loads(response.content)
    
            # Collect relevant columns
            for item in countriesJson:
                keys = item.keys()
                country = {
                    'name': item['name']['common'],
                    'cca3': item['cca3'],
                    'continent' : item['continents'][0],
                    'population': item['population'],
                    'area': item['area'],
                    'languages': ', '.join(list(item['languages'].values())) if 'languages' in keys else None,
                    'gini': list(item['gini'].values())[0] if 'gini' in keys else None
                }
                country_list.append(country)
    
            header = country_list[0].keys()
    
            ## Create/replace CSV file
            with open('Data/countries.csv', 'w', newline='') as output_file:
                dict_writer = csv.DictWriter(output_file, header)
                dict_writer.writeheader()
                dict_writer.writerows(country_list)
                output_file.close()
                print('CSV file', output_file.name, 'created successfully')
    
        break
    except requests.exceptions.ChunkedEncodingError:
        time.sleep(1)
else:
    print('Failed to fetch data from REST countries API')

CSV file Data/countries.csv created successfully


### Create life expectancy CSV through web scraping

In [11]:
# Scrape the table
response = requests.get('https://www.worldometers.info/demographics/life-expectancy/')

if response.status_code != 200:
    print('Failed to fetch data.')
else:
    soup = BeautifulSoup(response.content)
    life_exp_table = soup.find('table', {'id': 'example2'})
    
    # save the data into a dictionary
    life_exp_list = []
    
    # header
    table_header = life_exp_table.select('thead tr th')
    keys = list(map(lambda x: x.text, table_header))
    
    # body
    table_rows = life_exp_table.find('tbody').find_all('tr')
    
    for row in table_rows:
        table_data = row.select('td')
        row_data = np.array(list(map(lambda x: x.text, table_data)))
        dict_item = {}
        
        for i in range(len(row_data)):
            dict_item[keys[i]] = row_data[i]
        
        life_exp_list.append(dict_item)

  
    # create/replace CSV file
    with open('Data/life_expectancy.csv', 'w') as output_file:
        dict_writer = csv.DictWriter(output_file, fieldnames=keys)
        dict_writer.writeheader()
        dict_writer.writerows(life_exp_list)
        output_file.close()
        print('CSV file', output_file.name, 'created successfully')

CSV file Data/life_expectancy.csv created successfully


### Create dataframe from the CSV files

In [13]:
# Get number of rows/columns of each dataframe
countries = pd.read_csv('Data/countries.csv')
print('countries shape:', countries.shape)

gdp = pd.read_csv('Data/GDP.csv', on_bad_lines='error')
print('gdp shape:', gdp.shape)

life_expectancy = pd.read_csv('Data/life_expectancy.csv')
print('life_expectancy shape:', life_expectancy.shape)

death_rate = pd.read_csv('Data/Death rate.csv')
print('death_rate shape:', death_rate.shape)

countries shape: (250, 7)
gdp shape: (217, 4)
life_expectancy shape: (201, 5)
death_rate shape: (229, 7)


In [14]:
# Merge countries with gdp
countries = countries.merge(gdp, how='left', left_on='cca3', right_on='code')

In [15]:
# There is no code column in life_expectancy and death_rate. The only way to merge the two dataframes 
# is through the country name column. The countries are not named exactly the same therefore we need 
# to rename them so they match the names of the first dataframe,
new_list = []

for country in countries['name'].tolist():
    if country not in life_expectancy['Country'].tolist():
        new_list.append(country)

# print(new_list)

# create dictionary after searching in both CSV files
countries_to_rename = {
    "Côte d'Ivoire": "Ivory Coast",
    "Cabo Verde": "Cape Verde",
    "State of Palestine": "Palestine",
    "U.S. Virgin Islands": "United States Virgin Islands",
    "Sao Tome & Principe": "São Tomé and Príncipe",
    "Macao": "Macau",
    "Czech Republic (Czechia)": "Czechia",
    "St. Vincent & Grenadines": "Saint Vincent and the Grenadines",
    "Brunei ": "Brunei",
    "Samoa": "American Samoa",
    "Congo": "Republic of the Congo"
}

life_expectancy['Country'] = life_expectancy['Country'].replace(countries_to_rename)
countries = countries.merge(life_expectancy, how='left', left_on='name', right_on='Country')

In [16]:
# Same method for death_rate
new_list = []

for country in countries['name'].tolist():
    if country not in death_rate['name'].tolist():
        new_list.append(country)

# print(new_list)

countries_to_rename = {
    "Cote d'Ivoire": "Ivory Coast",
    "Cabo Verde": "Cape Verde",
    "Saint Barthelemy": "Saint Barthélemy",
    "Turkey (Turkiye)": "Turkey",
    "Korea, North": "North Korea",
    "Virgin Islands": "United States Virgin Islands",
    "Sao Tome and Principe": "São Tomé and Príncipe",
    "Saint Helena, Ascension, and Tristan da Cunha": "Saint Helena, Ascension and Tristan da Cunha",
    "Micronesia, Federated States of": "Micronesia",
    "Curacao": "Curaçao",
    "Korea, South": "South Korea",
    "Falkland Islands (Islas Malvinas)": "Falkland Islands",
    "Gambia, The": "Gambia",
    "Bahamas, The": "Bahamas",
    "Congo, Republic of the": "Republic of the Congo",
    "Congo, Democratic Republic of the": "DR Congo"
}

death_rate['name'] = death_rate['name'].replace(countries_to_rename)
countries = countries.merge(death_rate, how='left', on='name')
print('Merged dataframe shape:', countries.shape)

Merged dataframe shape: (251, 22)


## Step 2 - Data Exploration

In [25]:
# Remove scientific notation and keep 2 decimals
pd.options.display.float_format = '{:.2f}'.format

In [27]:
countries.head()

Unnamed: 0,name,cca3,continent,population,area,languages,gini,code,rank,country,...,Country,Life Expectancy (both sexes),Females Life Expectancy,Males Life Expectancy,slug,deaths/1,000 population,date_of_information,ranking,region
0,South Georgia,SGS,Antarctica,30,3903.0,English,,,,,...,,,,,,,,,,
1,Grenada,GRD,North America,112519,344.0,English,,GRD,193.0,Grenada,...,Grenada,75.37,78.5,72.52,grenada,8.4,2024.0,76.0,Central America and the Caribbean,
2,Switzerland,CHE,Europe,8654622,41284.0,"French, Swiss German, Italian, Romansh",33.1,CHE,20.0,Switzerland,...,Switzerland,84.09,85.95,82.17,switzerland,8.5,2024.0,73.0,Europe,
3,Sierra Leone,SLE,Africa,7976985,71740.0,English,35.7,SLE,169.0,Sierra Leone,...,Sierra Leone,61.96,63.7,60.23,sierra-leone,9.0,2024.0,58.0,Africa,
4,Hungary,HUN,Europe,9749763,93028.0,Hungarian,29.6,HUN,56.0,Hungary,...,Hungary,77.18,80.33,73.89,hungary,14.5,2024.0,6.0,Europe,


In [29]:
countries.tail()

Unnamed: 0,name,cca3,continent,population,area,languages,gini,code,rank,country,...,Country,Life Expectancy (both sexes),Females Life Expectancy,Males Life Expectancy,slug,deaths/1,000 population,date_of_information,ranking,region
246,Belgium,BEL,Europe,11555997,30528.0,"German, French, Dutch",27.2,BEL,23.0,Belgium,...,Belgium,82.27,84.45,80.06,belgium,9.5,2024.0,46.0,Europe,
247,Israel,ISR,Asia,9216900,20770.0,"Arabic, Hebrew",39.0,ISR,28.0,Israel,...,Israel,82.73,84.71,80.67,israel,5.2,2024.0,189.0,Middle East,
248,New Zealand,NZL,Oceania,5084300,270467.0,"English, Māori, New Zealand Sign Language",,NZL,51.0,New Zealand,...,New Zealand,82.25,83.89,80.6,new-zealand,6.9,2024.0,126.0,Australia and Oceania,
249,Nicaragua,NIC,North America,6624554,130373.0,Spanish,46.2,NIC,127.0,Nicaragua,...,Nicaragua,75.1,77.58,72.46,nicaragua,5.1,2024.0,191.0,Central America and the Caribbean,
250,Anguilla,AIA,North America,13452,91.0,English,,,,,...,,,,,anguilla,4.7,2024.0,204.0,Central America and the Caribbean,


In [31]:
countries.sample(5)

Unnamed: 0,name,cca3,continent,population,area,languages,gini,code,rank,country,...,Country,Life Expectancy (both sexes),Females Life Expectancy,Males Life Expectancy,slug,deaths/1,000 population,date_of_information,ranking,region
46,Cook Islands,COK,Oceania,18100,236.0,"English, Cook Islands Māori",,,,,...,,,,,cook-islands,9.4,2024.0,48.0,Australia and Oceania,
232,Cayman Islands,CYM,North America,65720,264.0,English,,CYM,160.0,Cayman Islands,...,,,,,cayman-islands,6.1,2024.0,149.0,Central America and the Caribbean,
26,Northern Mariana Islands,MNP,Oceania,57557,464.0,"Carolinian, Chamorro, English",,MNP,,Northern Mariana Islands,...,,,,,northern-mariana-islands,5.7,2024.0,172.0,Australia and Oceania,
238,Ghana,GHA,Africa,31072945,238533.0,English,43.5,GHA,81.0,Ghana,...,Ghana,65.7,68.16,63.31,ghana,5.9,2024.0,162.0,Africa,
225,Belize,BLZ,North America,397621,22966.0,"Belizean Creole, English, Spanish",53.3,BLZ,174.0,Belize,...,Belize,73.74,76.66,71.08,belize,5.0,2024.0,194.0,Central America and the Caribbean,


In [20]:
countries.dtypes

name                               object
cca3                               object
continent                          object
population                          int64
area                              float64
languages                          object
gini                              float64
code                               object
rank                              float64
country                            object
gdp                                object
#                                 float64
Country                            object
Life Expectancy  (both sexes)     float64
Females  Life Expectancy          float64
Males  Life Expectancy            float64
slug                               object
 deaths/1                         float64
000 population                    float64
date_of_information               float64
ranking                            object
region                            float64
dtype: object

In [21]:
countries.describe()

Unnamed: 0,population,area,gini,rank,#,Life Expectancy (both sexes),Females Life Expectancy,Males Life Expectancy,deaths/1,000 population,date_of_information,region
count,251.0,251.0,168.0,207.0,191.0,191.0,191.0,191.0,213.0,213.0,213.0,0.0
mean,30987396.75,598194.84,38.18,104.55,100.12,73.81,76.41,71.22,7.62,2023.99,113.19,
std,129422515.14,1906357.54,7.89,60.56,57.99,7.05,7.19,7.01,2.77,0.21,66.09,
min,0.0,0.44,24.6,1.0,1.0,54.64,54.94,53.36,1.4,2021.0,1.0,
25%,208785.5,1116.0,32.77,52.5,50.5,68.72,71.28,66.44,5.7,2024.0,57.0,
50%,4829764.0,64559.0,36.9,104.0,101.0,74.66,77.91,71.08,7.2,2024.0,114.0,
75%,18935324.5,367522.0,42.45,157.5,150.5,78.95,81.9,76.23,9.1,2024.0,170.0,
max,1402112000.0,17098242.0,63.0,208.0,200.0,85.63,88.26,82.97,18.6,2024.0,229.0,


In [22]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   name                            251 non-null    object 
 1   cca3                            251 non-null    object 
 2   continent                       251 non-null    object 
 3   population                      251 non-null    int64  
 4   area                            251 non-null    float64
 5   languages                       250 non-null    object 
 6   gini                            168 non-null    float64
 7   code                            216 non-null    object 
 8   rank                            207 non-null    float64
 9   country                         216 non-null    object 
 10  gdp                             216 non-null    object 
 11  #                               191 non-null    float64
 12  Country                         191 

## Step 3 - Data Preparation

In [24]:
# Print all the columns
print(countries.columns.values)

['name' 'cca3' 'continent' 'population' 'area' 'languages' 'gini' 'code'
 'rank' 'country' 'gdp' '#' 'Country' 'Life Expectancy  (both sexes) '
 'Females  Life Expectancy ' 'Males  Life Expectancy' 'slug' ' deaths/1'
 '000 population' 'date_of_information' 'ranking' 'region']


In [25]:
# Drop the extra country columns
countries = countries.drop(['country', 'Country'], axis=1)
countries.sample(5)

Unnamed: 0,name,cca3,continent,population,area,languages,gini,code,rank,gdp,#,Life Expectancy (both sexes),Females Life Expectancy,Males Life Expectancy,slug,deaths/1,000 population,date_of_information,ranking,region
134,Tuvalu,TUV,Oceania,11792,26.0,"English, Tuvaluan",39.1,TUV,208.0,62.0,,,,,tuvalu,7.8,2024.0,94.0,Australia and Oceania,
6,Wallis and Futuna,WLF,Oceania,11750,142.0,French,,,,,,,,,wallis-and-futuna,6.0,2024.0,153.0,Australia and Oceania,
55,Réunion,REU,Africa,840974,2511.0,French,,,,,10.0,83.67,86.45,80.67,,,,,,
48,Zambia,ZMB,Africa,18383956,752612.0,English,57.1,ZMB,109.0,28163.0,165.0,66.53,68.87,64.1,zambia,5.9,2024.0,160.0,Africa,
63,Croatia,HRV,Europe,4047200,56594.0,Croatian,29.7,HRV,76.0,82689.0,53.0,78.75,81.82,75.6,croatia,13.1,2024.0,13.0,Europe,


In [26]:
# Drop the second code column and rename the first one
countries = countries.drop('code', axis=1).rename(columns={'cca3': 'code'})
countries.sample(5)

Unnamed: 0,name,code,continent,population,area,languages,gini,rank,gdp,#,Life Expectancy (both sexes),Females Life Expectancy,Males Life Expectancy,slug,deaths/1,000 population,date_of_information,ranking,region
105,Malaysia,MYS,Asia,32365998,330803.0,"English, Malay",41.1,37.0,399649.0,76.0,76.82,79.52,74.45,malaysia,5.8,2024.0,166.0,East and Southeast Asia,
160,Heard Island and McDonald Islands,HMD,Antarctica,0,412.0,English,,,,,,,,,,,,,
100,French Southern and Antarctic Lands,ATF,Antarctica,400,7747.0,French,,,,,,,,,,,,,
230,Jamaica,JAM,North America,2961161,10991.0,"English, Jamaican Patois",45.5,124.0,19423.0,131.0,71.61,74.15,69.08,jamaica,7.5,2024.0,100.0,Central America and the Caribbean,
243,Portugal,PRT,Europe,10305564,92090.0,Portuguese,33.5,48.0,287080.0,22.0,82.55,85.25,79.68,portugal,10.9,2024.0,27.0,Europe,


In [42]:
# Drop all other non relevant columns and rename deaths/1
countries = countries.drop(['#',
                            'Females  Life Expectancy ',
                            'Males  Life Expectancy',
                            'slug', '000 population',
                            'date_of_information',
                            'ranking',
                            'region'], axis=1)
countries = countries.rename(columns={'Life Expectancy  (both sexes) ': 'life_expectancy', ' deaths/1': 'death_rate'})
countries.sample(5)

Unnamed: 0,name,code,continent,population,area,languages,gini,rank,gdp,life_expectancy,death_rate
225,Belize,BLZ,North America,397621,22966.0,"Belizean Creole, English, Spanish",53.3,174.0,3282,73.74,5.0
64,Morocco,MAR,Africa,36910558,446550.0,"Arabic, Berber",39.5,60.0,141109,75.49,6.6
229,Liberia,LBR,Africa,5057677,111369.0,English,35.3,167.0,4332,62.32,8.3
224,Serbia,SRB,Europe,6908224,88361.0,Serbian,36.2,82.0,75187,76.94,14.9
236,Tonga,TON,Oceania,105697,747.0,"English, Tongan",37.6,202.0,500,73.07,5.0


In [44]:
# Remove whitespaces from string columns
countries['name'] = countries['name'].str.strip()
countries['code'] = countries['code'].str.strip()
countries['continent'] = countries['continent'].str.strip()
countries.sample(5)

Unnamed: 0,name,code,continent,population,area,languages,gini,rank,gdp,life_expectancy,death_rate
39,Palestine,PSE,Asia,4803269,6220.0,Arabic,33.7,128.0,17396,,
230,Jamaica,JAM,North America,2961161,10991.0,"English, Jamaican Patois",45.5,124.0,19423,71.61,7.5
62,Greece,GRC,Europe,10715549,131990.0,Greek,32.9,54.0,238206,82.03,12.0
16,Laos,LAO,Asia,7275556,236800.0,Lao,38.8,133.0,15843,69.23,6.2
122,Iceland,ISL,Europe,366425,103000.0,Icelandic,26.1,105.0,31020,83.01,6.6


In [48]:
# Convert gdp column to float, NaN if empty
countries['gdp'] = countries['gdp'] \
    .str.replace(',', '') \
    .replace(r'^\s*$', np.nan, regex=True) \
    .astype('float')
countries.sample(5)

AttributeError: Can only use .str accessor with string values!

In [60]:
# Check for null values
countries[countries.isnull().any(axis=1)].sample(10)

Unnamed: 0,name,code,continent,population,area,languages,gini,rank,gdp,life_expectancy,death_rate
1,Grenada,GRD,North America,112519,344.0,English,,193.0,1320.0,75.37,8.4
226,Myanmar,MMR,Asia,54409794,676578.0,Burmese,30.7,87.0,64815.0,67.1,
117,Macau,MAC,Asia,649342,30.0,"Portuguese, Chinese",,94.0,47062.0,,4.9
141,Palau,PLW,Oceania,18092,459.0,"English, Palauan",,206.0,263.0,,8.4
115,Oman,OMN,Asia,5106622,309500.0,Arabic,,66.0,108192.0,80.25,3.2
205,Saint Vincent and the Grenadines,VCT,North America,110947,389.0,English,,196.0,1066.0,,7.7
172,Antigua and Barbuda,ATG,North America,97928,442.0,English,,186.0,2033.0,77.77,5.7
26,Northern Mariana Islands,MNP,Oceania,57557,464.0,"Carolinian, Chamorro, English",,,,,5.7
218,Kuwait,KWT,Asia,4270563,17818.0,Arabic,,59.0,161772.0,80.6,2.3
52,Tokelau,TKL,Oceania,1411,12.0,"English, Samoan, Tokelauan",,,,,


In [None]:
# Remove rows with null gini and gdp values
countries = countries.dropna(subset=['gini', 'gdp'])

In [50]:
# Inspect data after cleaning
countries.dtypes

name                object
code                object
continent           object
population           int64
area               float64
languages           object
gini               float64
rank               float64
gdp                float64
life_expectancy    float64
death_rate         float64
dtype: object

In [None]:
countries.info()

In [None]:
countries.describe()

In [None]:
# Sort countries in alphabetical order and reset index
countries = countries.sort_values(by='name').reset_index(drop=True)
countries.head()

In [None]:
countries.tail()

In [None]:
# Add new column named GDP per Capita (per 100,000 people)
countries['gdp_per_capita'] = countries['gdp'] / countries['population'] * 100000
countries.sample(5)

In [None]:
# Reorder columns


In [None]:
# Display correlation table
countries.corr(numeric_only=True)

## Step 4 - Data Analysis

#### Q1 - What are the 10 most populated countries?

In [None]:
# Data
top_10_pop_countries = countries \
    .nlargest(n=10, columns='population')[['name', 'population']] \
    .set_index('name')
top_10_pop_countries

In [None]:
# Visualization

# plot
sns.set(rc={'figure.figsize': (10, 5)})
ax = sns.barplot(data=top_10_pop_countries,
                 x=top_10_pop_countries.index,
                 y=top_10_pop_countries['population'],
                 hue=top_10_pop_countries.index,
                 alpha=0.8)

# labels
ax.set_title('Top 10 Most Populated Countries', fontsize=18)
ax.set_xlabel('Country')
ax.set_ylabel('Population (Billions)')
ax.xaxis.label.set_size(15)
ax.yaxis.label.set_size(15)
ax.tick_params(axis='both', labelsize=12)
ax.tick_params(axis='x', rotation=45)

# annotate plot
ax.text(1.5, 1100000000, "India is actually the most populated country as of 2013\nbut the data from the API is not the most recent.")

# remove scientific notation showing at the top of the y axis
ax.yaxis.offsetText.set_visible(False)

plt.show()

#### Q2 - What are the top 3 most populated countries by continent?

In [None]:
countries \
    .sort_values(by=['continent', 'population'], ascending=[True, False]) \
    .groupby('continent') \
    .head(3)[['continent', 'name', 'population']] \
    .reset_index(drop=True)

#### Q3 - What is the most populated continent?

In [None]:
# Data
total_pop_by_continent = countries \
    .groupby('continent')[['population']] \
    .sum() \
    .sort_values(by='population', ascending=False)
total_pop_by_continent

In [None]:
# Answer
most_populated_continent = total_pop_by_continent.head(1)
print('The most populated continent is', most_populated_continent.index[0], 'with', most_populated_continent.iloc[0,0], 'people.')

#### Q4 - What are the top 20 largest countries?

In [None]:
# Data
top_20_largest_countries = countries \
    .nlargest(n=20, columns='area')[['name', 'area']] \
    .set_index('name')
top_20_largest_countries

In [None]:
# Visualization

# plot
sns.set(rc={'figure.figsize': (10, 5)})
ax = sns.barplot(data=top_20_largest_countries,
                 x=top_20_largest_countries['area'],
                 y=top_20_largest_countries.index,
                 hue=top_20_largest_countries.index,
                 alpha=0.8,
                 palette='Paired',
                 orient='h')

# labels
ax.set_title('Top 20 Largest Countries', fontsize=18)
ax.set_xlabel('Area (Tens of Millions $km^2$)')
ax.set_ylabel('Country')
ax.xaxis.label.set_size(15)
ax.yaxis.label.set_size(15)
ax.tick_params(axis='both', labelsize=12)

# remove scientific notation showing at the end of the x axis
ax.xaxis.offsetText.set_visible(False)

plt.show()

#### Q5 - What is the smallest continent?

In [None]:
# Data
total_area_by_continent = countries \
    .groupby('continent')[['area']] \
    .sum() \
    .sort_values(by='area')
total_area_by_continent

In [None]:
# Answer
smallest_continent = total_area_by_continent.head(1)
print('The smallest continent is', smallest_continent.index[0], 'which is', round(smallest_continent.iloc[0,0]), 'square kilometers.')

#### Q6 - What are the top 5 most common languages?

In [None]:
# Data
top_languages = countries['languages'].str.split(', ', expand=True).stack().value_counts().nlargest(n=5)
top_languages = pd.DataFrame(top_languages).rename(columns={'count': 'Number of Countries'})
top_languages

#### Q7 - What are the 10 countries with the most inequalities?

In [None]:
top_5_inequal_countries = countries.nlargest(n=5, columns='gini')[['name', 'continent', 'gini']]
top_5_inequal_countries

#### Q8 - What are the 10 countries with the highest death rate?

In [None]:
top_5_equal_countries = countries.nsmallest(n=5, columns='gini')[['name', 'continent', 'gini']]
top_5_equal_countries

#### Q9A - What is the overall average life expectancy?

#### Q9B - What is the average life expectancy by continent?

In [None]:
# Data
countries.groupby('continent').agg({'life_expectancy': ['mean', 'median']})

In [None]:
# Visualization

# plot
flierprops = dict(marker='d', markersize=3, markerfacecolor='black')
sns.set(rc={'figure.figsize': (15, 4.5)})
ax = sns.boxplot(data=countries,
                 x='continent',
                 y='life_expectancy',
                 hue='continent',
                 flierprops=flierprops,
                 showmeans=True,
                 palette='Set2')

# labels
ax.set_title('Distribution of the Life Expectancy per Continent', fontsize=14)
ax.set_xlabel('Continent')
ax.set_ylabel('Life Expectancy')
ax.xaxis.label.set_size(14)
ax.yaxis.label.set_size(14)
ax.tick_params(axis='both', labelsize=9)

plt.show()

#### Q10 - What variable seems to be correlated to life expectancy the most?

In [None]:
# Display correlation
sns.reset_orig() # prevent plot to be the same size as the previous one
corr_table = countries.corr(numeric_only=True)
sns.heatmap(corr_table, annot=True, cmap='coolwarm', vmin=-1)
plt.show()

There seems to be a moderate (-0.41) correlation between the GINI Index and the Life Expectancy, and a strong (0.65) correlation between the GDP per Capita and the Life Expectancy.

In [None]:
# Display plot of Life Expectancy over the 2 variables

# plot
ax = sns.pairplot(data=countries,
             x_vars=['gini', 'gdp_per_capita'],
             y_vars=['life_expectancy'],
             hue='continent',
             height=7)
ax.fig.suptitle('Life Expectancy vs the GINI Index and the GDP per Capita', fontsize=18)

# define function to plot a single regression line
def regline(x, y, **kwargs):
    sns.regplot(data=kwargs['data'], x=x.name, y=y.name, scatter=False, color=kwargs['color'], ci=None)

# call the function for each non-diagonal subplot within pairplot
ax.map_offdiag(regline, color='red', data=countries)

plt.tight_layout()
plt.show()

## Conclusion

On the left plot we can see that the life expectancy decreases when the GINI index increases, and most countries with the smallest life expectancy are from Africa.

On the right plot, we see that the life expectancy increases when the GDP per capita increases, and the countries with the highest life expectancy are from Europe.

We can say that higher income and better wealth distribution increases quality of life, which usually means a longer life.

## End