### Health and Nutrition Worldwide

This dataset was downloaded from kaggle on 7 feb 23 and this analysis has the following objectives:
- Understand how region and income group influence the health and nutrition indicators
- How the health and nutrition indicators are related between themselves
- Understand if we can group the countries based on the indicators provided
- Understand which zones have been improving and declining their indicators

In [None]:
# !kaggle datasets download -d sivamsinghsh/health-nutrition-and-population-statistics

In [None]:
# import zipfile

# with zipfile.ZipFile('health-nutrition-and-population-statistics.zip', 'r') as zip_ref:
#     zip_ref.extractall()

Analysing the Excel file we can see that there are three sheets we will need to work with:
- _Data_ : The whole dataset with the indicator values
- _Country_ : We will use the columns _Region_ and _Income Group_ from this dataset
- _Series_ : We will use the column _Topic_ for our analysis

In [None]:
import pandas as pd

In [None]:
indicators_dataset = pd.read_excel('HNP_StatsEXCEL.xlsx', sheet_name='Data')
country_dataset = pd.read_excel('HNP_StatsEXCEL.xlsx', sheet_name='Country')
series_dataset = pd.read_excel('HNP_StatsEXCEL.xlsx', sheet_name='Series')

Having our datasets imported, let's filter the relevant columns for our analysis

In [None]:
country_dataset = country_dataset[['Country Code','Region','Income Group']]
series_dataset = series_dataset[['Series Code','Topic']]

Let's take a look at our datasets now

In [None]:
indicators_dataset.head(2)

In [None]:
country_dataset.head(2)

In [None]:
series_dataset.head(10)

Right away we can see that the dataset is organized in a way which has a column for each year represented, which might not be suitable for our analysis. For now, let's reshape the dataframe, creating a single column for the year and a column for the value. We will also create a new dataset with only the most recent year, in order to assess and analyse the current state of the indicators.

Also, as a good pratice, let's rename the column names to lowercase and separated by an underscore

In [None]:
column_mapper = {
    'Country Name':'country_name',
    'Country Code':'country_code',
    'Indicator Name':'indicator_name',
    'Indicator Code':'series_code',
    'Region':'region',
    'Income Group':'income_group',
    'Series Code':'series_code',
    'Topic':'topic'
}

indicators_dataset = indicators_dataset.rename(columns=column_mapper)
country_dataset = country_dataset.rename(columns=column_mapper)
series_dataset = series_dataset.rename(columns=column_mapper)

In [None]:
indicators_dataset = indicators_dataset.melt(id_vars=['country_name','country_code','indicator_name','series_code'], var_name='year' )

In [None]:
indicators_dataset_2021 = indicators_dataset.loc[indicators_dataset['year']=='2021'].copy()

Looking at the _series_dataset_, we can also verify that the _Topic_ column seems to have a main category and a secondary category. It might become helpfull to have this categories separated.

In [None]:
series_dataset[['main_topic','secondary_topic']] = series_dataset['topic'].str.split(':',expand=True)

Let's start out analysis! Since we have a large number of countries and indicators represented, we will start by analysing the data by the top level categories (_Region_, _Income Group_, _Main Topic_) and drill down as we see fit. We will also start by 2021 only and later we will analyse the evolution of the indicators throughout time.

In [None]:
dataset_2021 = indicators_dataset_2021\
                .merge(country_dataset, how='left', on='country_code', suffixes=['_d','_c'])\
                .merge(series_dataset, how='left', on='series_code', suffixes=['_d','_s'])

In [44]:
dataset_2021.drop(['country_code','series_code'], axis=1, inplace=True)

In [45]:
dataset_2021.head(3)

Unnamed: 0,country_name,indicator_name,year,value,region,income_group,topic,main_topic,secondary_topic
0,Africa Eastern and Southern,"Adolescent fertility rate (births per 1,000 wo...",2021,,,,Reproductive health,Reproductive health,
1,Africa Eastern and Southern,Adults (ages 15+) and children (0-14 years) li...,2021,,,,HIV/AIDS,HIV/AIDS,
2,Africa Eastern and Southern,Adults (ages 15+) and children (ages 0-14) new...,2021,,,,HIV/AIDS,HIV/AIDS,


By looking at the result table we can see that we are missing some values in the columns _value_, _region_ and _income_group_. Although we cannot attribute any values to _value_ since these are missing due to the data not being collected, we can assign a category of 'No Information' to _income_group_. We can also verify if the countries without region aren't a region per se, like 'Africa Eastern and Southern'.

In [46]:
dataset_2021.loc[dataset_2021['income_group'].isna(), 'income_group'] = 'No Information'

In [48]:
dataset_2021.loc[dataset_2021['region'].isna(), 'country_name'].unique()

array(['Africa Eastern and Southern', 'Africa Western and Central',
       'Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
       'IDA total', 'Late-demographic dividend',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & the Caribbean (IDA & IBRD countries)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle in