# Life Expectancy Analysis

## 1. Introduction

- Data source: https://www.kaggle.com/datasets/fredericksalazar/life-expectancy-1960-to-present-global/data

Column definitions:
- `Country Code`: Unique identifier for each country.
- `Country Name`: Official name of the country.
- `Region`: Broad geographical area (e.g., Asia, Europe, Africa).
- `Sub-Region`: More specific regional classification within the broader region.
- `Intermediate Region`: Additional granular geographical grouping when applicable.
- `Year`: The specific year to which the data pertains.
- `Life Expectancy for Women`: Average years a woman is expected to live in that country and year.
- `Life Expectancy for Men`: Average years a man is expected to live in that country and year.

## 2. Libraries

In [1]:
%run 0.0-data_projects-setup.ipynb
%run pandas-missing-extension.ipynb

In [2]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Cleaning
import janitor

# Missin Values Analysis
import missingno as msno

import warnings
warnings.filterwarnings('ignore')

## 3. Download and Cleaning Data

### 3.1 Download and First Data View

In [3]:
file_zip_path = path.data_raw_dir("life-expectancy.zip")
url = "https://www.kaggle.com/api/v1/datasets/download/fredericksalazar/life-expectancy-1960-to-present-global"

In [4]:
!curl -L -o {file_zip_path} {url}
!unzip -o {file_zip_path} -d {path.data_raw_dir()} && rm {file_zip_path}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  108k  100  108k    0     0   193k      0 --:--:-- --:--:-- --:--:--  193k
Archive:  /home/pahoalapizco/ds-projects/data_projects/data/raw/life-expectancy.zip
  inflating: /home/pahoalapizco/ds-projects/data_projects/data/raw/life_expectancy_dataset.csv  


In [5]:
file_path = path.data_raw_dir("life_expectancy_dataset.csv") 
df = pd.read_csv(file_path, delimiter=";", decimal=",")
df.sample(10, random_state=42)

Unnamed: 0,country_code,country_name,region,sub-region,intermediate-region,year,life_expectancy_women,life_expectancy_men
811,AUT,AUSTRIA,EUROPE,WESTERN EUROPE,,2015,83.7,78.8
13468,ZMB,ZAMBIA,AFRICA,SUB-SAHARAN AFRICA,EASTERN AFRICA,2009,56.62,53.83
12497,TGO,TOGO,AFRICA,SUB-SAHARAN AFRICA,WESTERN AFRICA,1983,54.15,51.13
8456,MNE,MONTENEGRO,EUROPE,SOUTHERN EUROPE,,1974,74.13,68.45
6415,VIR,ISLAS VÍRGENES (EE.UU.),AMERICAS,LATIN AMERICA AND THE CARIBBEAN,CARIBBEAN,2012,82.0,75.2
3434,ECU,ECUADOR,AMERICAS,LATIN AMERICA AND THE CARIBBEAN,SOUTH AMERICA,1992,73.33,67.01
3120,CUW,CURACAO,AMERICAS,LATIN AMERICA AND THE CARIBBEAN,CARIBBEAN,1993,75.08,70.04
7745,MDG,MADAGASCAR,AFRICA,SUB-SAHARAN AFRICA,EASTERN AFRICA,2019,68.21,63.66
4010,SWZ,ESWATINI,AFRICA,SUB-SAHARAN AFRICA,SOUTHERN AFRICA,2001,48.13,43.75
2910,CRI,COSTA RICA,AMERICAS,LATIN AMERICA AND THE CARIBBEAN,CENTRAL AMERICA,1972,69.15,64.46


### 3.2 Cleaning

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13545 entries, 0 to 13544
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country_code           13545 non-null  object 
 1   country_name           13545 non-null  object 
 2   region                 13545 non-null  object 
 3   sub-region             13545 non-null  object 
 4   intermediate-region    5670 non-null   object 
 5   year                   13545 non-null  int64  
 6   life_expectancy_women  13545 non-null  float64
 7   life_expectancy_men    13545 non-null  float64
dtypes: float64(2), int64(1), object(5)
memory usage: 846.7+ KB


#### 3.2.1 Cleaning Names

Standarizing column names:

In [7]:
df = df.clean_names(case_type="snake")
df.columns

Index(['country_code', 'country_name', 'region', 'sub_region',
       'intermediate_region', 'year', 'life_expectancy_women',
       'life_expectancy_men'],
      dtype='object')

Delete accents from country names:

In [8]:
df["country_name"] = df["country_name"].str.normalize("NFKD").str.encode("ascii", errors="ignore").str.decode("utf-8")

#### 3.2.2 Duplicates

In [9]:
df.duplicated().sum()

np.int64(0)

#### 3.2.3 Handling Null Values 

In [10]:
df.isnull().sum()

country_code                0
country_name                0
region                      0
sub_region                  0
intermediate_region      7875
year                        0
life_expectancy_women       0
life_expectancy_men         0
dtype: int64

Missing values analysis by region and sub-region:

In [11]:
regions = list(df["region"].unique())
missing_by_region = {}
for region in regions:
  is_na = df.loc[df["region"] == region, "intermediate_region"].isnull()

  missing_by_region[region] = {
		"total_missing": is_na.sum(),
		"total_complete": len(is_na) - is_na.sum()
	}

pd.DataFrame(missing_by_region)

Unnamed: 0,ASIA,EUROPE,AFRICA,AMERICAS,OCEANIA
total_missing,3150,2898,378,252,1197
total_complete,0,0,3024,2646,0


In [12]:
df.groupby(["region", "sub_region", "intermediate_region"]).size()

region    sub_region                       intermediate_region
AFRICA    SUB-SAHARAN AFRICA               EASTERN AFRICA         1134
                                           MIDDLE AFRICA           567
                                           SOUTHERN AFRICA         315
                                           WESTERN AFRICA         1008
AMERICAS  LATIN AMERICA AND THE CARIBBEAN  CARIBBEAN              1386
                                           CENTRAL AMERICA         504
                                           SOUTH AMERICA           756
dtype: int64

In [13]:
df[df["intermediate_region"].isnull()].groupby(["region", "sub_region"]).size()

region    sub_region               
AFRICA    NORTHERN AFRICA               378
AMERICAS  NORTHERN AMERICA              252
ASIA      CENTRAL ASIA                  315
          EASTERN ASIA                  441
          SOUTH-EASTERN ASIA            693
          SOUTHERN ASIA                 567
          WESTERN ASIA                 1134
EUROPE    EASTERN EUROPE                630
          NORTHERN EUROPE               756
          SOUTHERN EUROPE               945
          WESTERN EUROPE                567
OCEANIA   AUSTRALIA AND NEW ZEALAND     126
          MELANESIA                     315
          MICRONESIA                    441
          POLYNESIA                     315
dtype: int64

The United Nations classifies the world into geographical regions, sub-regions and **some cases** intermediate regions to facilitate global analysis. While other all countries are assigned a sub-region, not all are assigned an intermediate region, this explains the missing values in areas such Asia, Europe, and Oceania. Although Africa appears to lack intermediate regions, its sub-region Sub-Saharan Africa is further divied into Eastern, Middle, Southern, and Western Africa; these divisions are not formally considered intermediate regions.


To handle missing values in `intermediate_regions` vale, we have two options:
1. Inpute using the corresponding `sub-region` value to preserves geographical meaning.
2. Label missing values as "Unknown".


For this analysis, I will pick option 1, it is more appropriate and infomative choice.

In [14]:
df = df.apply(lambda x: x.fillna(x["sub_region"]), axis=1)
df.isnull().sum().sum()

np.int64(0)