### Health and Nutrition Worldwide

This dataset was downloaded from kaggle on 7 feb 23 and this analysis has the following objectives:
- Understand how region and income group influence the health and nutrition indicators
- How the health and nutrition indicators are related between themselves
- Understand if we can group the countries based on the indicators provided
- Understand which zones have been improving and declining their indicators

In [None]:
# !kaggle datasets download -d sivamsinghsh/health-nutrition-and-population-statistics

In [None]:
# import zipfile

# with zipfile.ZipFile('health-nutrition-and-population-statistics.zip', 'r') as zip_ref:
#     zip_ref.extractall()

Analysing the Excel file we can see that there are three sheets we will need to work with:
- _Data_ : The whole dataset with the indicator values
- _Country_ : We will use the columns _Region_ and _Income Group_ from this dataset
- _Series_ : We will use the column _Topic_ for our analysis

In [None]:
import pandas as pd

In [7]:
indicators_dataset = pd.read_excel('HNP_StatsEXCEL.xlsx', sheet_name='Data')
country_dataset = pd.read_excel('HNP_StatsEXCEL.xlsx', sheet_name='Country')
series_dataset = pd.read_excel('HNP_StatsEXCEL.xlsx', sheet_name='Series')

Having our datasets imported, let's filter the relevant columns for our analysis

In [9]:
country_dataset = country_dataset[['Country Code','Region','Income Group']]
series_dataset = series_dataset[['Series Code','Topic']]

Let's take a look at our dataset now

In [10]:
indicators_dataset.head(2)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Africa Eastern and Southern,AFE,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,141.457567,141.603817,141.796749,141.651778,141.595374,141.593273,...,105.321998,103.629032,101.905042,100.133826,98.367869,96.574004,95.011793,93.43222,91.845198,
1,Africa Eastern and Southern,AFE,Adults (ages 15+) and children (0-14 years) li...,SH.HIV.TOTL,,,,,,,...,,,,,,,,,,


Right away we can see that the dataset is organized in a way which has a column for each year represented, which might not be suitable for our analysis. For now, let's reshape the dataframe, creating a single column for the year and a column for the value.

Also, as a good pratice, let's rename the column names to lowercase and separated by an underscore

In [13]:
column_mapper = {
    'Country Name':'country_name',
    'Country Code':'country_code',
    'Indicator Name':'indicator_name',
    'Indicator Code':'indicator_code'
}

indicators_dataset = indicators_dataset.rename(columns=column_mapper)

In [14]:
indicators_dataset = indicators_dataset.melt(id_vars=['country_name','country_code','indicator_name','indicator_code'], var_name='year' )

Unnamed: 0,country_name,country_code,indicator_name,indicator_code,year,value
0,Africa Eastern and Southern,AFE,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1960,141.457567
1,Africa Eastern and Southern,AFE,Adults (ages 15+) and children (0-14 years) li...,SH.HIV.TOTL,1960,
2,Africa Eastern and Southern,AFE,Adults (ages 15+) and children (ages 0-14) new...,SH.HIV.INCD.TL,1960,
3,Africa Eastern and Southern,AFE,Adults (ages 15+) living with HIV,SH.DYN.AIDS,1960,
4,Africa Eastern and Southern,AFE,Adults (ages 15-49) newly infected with HIV,SH.HIV.INCD,1960,
...,...,...,...,...,...,...
7305951,Zimbabwe,ZWE,Wanted fertility rate (births per woman),SP.DYN.WFRT,2021,
7305952,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,2021,
7305953,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,2021,
7305954,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,2021,60.700000
