# Life Expectancy Analysis

- Data source: https://www.kaggle.com/datasets/fredericksalazar/life-expectancy-1960-to-present-global/code

Column definitions:
- `Country Code`: Unique identifier for each country.
- `Country Name`: Official name of the country.
- `Region`: Broad geographical area (e.g., Asia, Europe, Africa).
- `Sub-Region`: More specific regional classification within the broader region.
- `Intermediate Region`: Additional granular geographical grouping when applicable.
- `Year`: The specific year to which the data pertains.
- `Life Expectancy for Women`: Average years a woman is expected to live in that country and year.
- `Life Expectancy for Men`: Average years a man is expected to live in that country and year.

## Libraries

In [1]:
%run 0.0-data_projects-setup.ipynb
%run pandas-missing-extension.ipynb

In [2]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Cleaning
import janitor

# Missin Values Analysis
import missingno as msno

import warnings
warnings.filterwarnings('ignore')

## Download and Load Data

In [3]:
file_zip_path = path.data_raw_dir("life-expectancy.zip")
url = "https://www.kaggle.com/api/v1/datasets/download/fredericksalazar/life-expectancy-1960-to-present-global"

In [4]:
!curl -L -o {file_zip_path} {url}
!unzip -o {file_zip_path} -d {path.data_raw_dir()} && rm {file_zip_path}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  108k  100  108k    0     0   250k      0 --:--:-- --:--:-- --:--:-- 3450k
Archive:  /home/pahoalapizco/ds-projects/data_projects/data/raw/life-expectancy.zip
  inflating: /home/pahoalapizco/ds-projects/data_projects/data/raw/life_expectancy_dataset.csv  


In [5]:
file_path = path.data_raw_dir("life_expectancy_dataset.csv") 
df = pd.read_csv(file_path, delimiter=";")
df.head()

Unnamed: 0,country_code,country_name,region,sub-region,intermediate-region,year,life_expectancy_women,life_expectancy_men
0,AFG,AFGANISTÁN,ASIA,SOUTHERN ASIA,,1960,3328,3187
1,AFG,AFGANISTÁN,ASIA,SOUTHERN ASIA,,1961,3381,3241
2,AFG,AFGANISTÁN,ASIA,SOUTHERN ASIA,,1962,3430,3288
3,AFG,AFGANISTÁN,ASIA,SOUTHERN ASIA,,1963,3477,3335
4,AFG,AFGANISTÁN,ASIA,SOUTHERN ASIA,,1964,3525,3383


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13545 entries, 0 to 13544
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   country_code           13545 non-null  object
 1   country_name           13545 non-null  object
 2   region                 13545 non-null  object
 3   sub-region             13545 non-null  object
 4   intermediate-region    5670 non-null   object
 5   year                   13545 non-null  int64 
 6   life_expectancy_women  13545 non-null  object
 7   life_expectancy_men    13545 non-null  object
dtypes: int64(1), object(7)
memory usage: 846.7+ KB


In [7]:
df.isnull().sum()

country_code                0
country_name                0
region                      0
sub-region                  0
intermediate-region      7875
year                        0
life_expectancy_women       0
life_expectancy_men         0
dtype: int64