# Project:  Human Development Index (HDI) Indicators from Gapminder

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
        <li><a href="#common_countries_python">Common Countries (Python)</a></li>
    </ul>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

The [Human Development Index](https://en.wikipedia.org/wiki/Human_Development_Index) (HDI) represents the standard-of-living in a country based on three indicators:
    - Life expectancy at birth (LEI),
    - Gross national income per capita (GNI),
    - Mean years of education (MEI & WEI for men and women).

The Gapminder database provides datasets for each of these indicators; however, the education level is given by two datasets based on gender rather than a single composite indicator as used in the calculation of the HDI.  The HDI ranges between 0 and 1.  The higher these indicators, the higher the HDI.  Accordingly, exploratory data analysis should reveal fairly strong correlations among them.  It would be interesting to see the relative strengths of these indicators with respect to the HDI.

In [1]:
import pandas as pd
import csv
from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### General Properties

The datasets from Gapminder are
    - HDI:  Human Development Index
    - GNI:  Total GNI (PPP, current international $)
    - MEI:  Mean years in school (men 25 years and older)
    - WEI:  Mean years in school (women 25 years and older)
    - LEI:  Life expectancy at birth
converted from excel to csv format.

Rows for each dataset are a list of countries, and columns are years.  Which countries and years are included varied among the datasets, so an immediate task is to standardize these metrics for analysis.



In [2]:
def print_head(filename):
    """
    Print head of dataframe from filename.
    """
    indicator = filename.split('.')[1].split('/')[1]
    print('\nIndicator: {}'.format(indicator))
    df = pd.read_csv(filename)
    return (indicator,df)

data_files = ['./HDI.csv',  ## Human Development Index
              './LEI.csv',  ## Life Expectancy at birth
              './MEI.csv',  ## Mean Years in School (Men)
              './WEI.csv',  ## Mean Years in School (Women)
              './GNI.csv']  ## Gross National Income

example_indicators = {}
for data_file in data_files:
    indicator, df = print_head(data_file)
    example_indicators[indicator] = df
    print(df.head())


Indicator: HDI
                     HDI   1980   1990   2000   2005   2006   2007   2008  \
0            Afghanistan  0.198  0.246  0.230  0.340  0.354  0.363  0.370   
1  Akrotiri and Dhekelia    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2                Albania    NaN  0.656  0.691  0.721  0.724  0.729  0.733   
3                Algeria  0.454  0.551  0.624  0.667  0.673  0.680  0.686   
4         American Samoa    NaN    NaN    NaN    NaN    NaN    NaN    NaN   

    2009   2011  
0  0.387  0.398  
1    NaN    NaN  
2  0.734  0.739  
3  0.691  0.698  
4    NaN    NaN  

Indicator: LEI
  Life expectancy with projections  1765  1766  1767  1768  1769  1770  1771  \
0                      Afghanistan   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1                          Albania   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2                          Algeria   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3                   American Samoa   NaN   NaN   NaN   NaN   NaN   NaN   NaN 

In [3]:
def print_first_line(filename):
    """
    Print Header and first row of filename.
    """
    
    indicator = filename.split('.')[1].split('/')[1]
    print('\nIndicator: {}'.format(indicator))
    with open(filename,'r') as f_in:
        file_reader = csv.DictReader(f_in)
        
        first_row = file_reader.__next__()
        
        pprint(first_row)
        
    return (indicator,first_row)

data_files = ['./HDI.csv',  ## Human Development Index
              './LEI.csv',  ## Life Expectancy at birth
              './MEI.csv',  ## Mean Years in School (Men)
              './WEI.csv',  ## Mean Years in School (Women)
              './GNI.csv']  ## Gross National Income

example_indicators = {}
for data_file in data_files:
    indicator, first_line = print_first_line(data_file)
    example_indicators[indicator] = first_line


Indicator: HDI
OrderedDict([('HDI', 'Afghanistan'),
             ('1980', '0.198'),
             ('1990', '0.246'),
             ('2000', '0.23'),
             ('2005', '0.34'),
             ('2006', '0.354'),
             ('2007', '0.363'),
             ('2008', '0.37'),
             ('2009', '0.387'),
             ('2011', '0.398')])

Indicator: LEI
OrderedDict([('Life expectancy with projections', 'Afghanistan'),
             ('1765', ''),
             ('1766', ''),
             ('1767', ''),
             ('1768', ''),
             ('1769', ''),
             ('1770', ''),
             ('1771', ''),
             ('1772', ''),
             ('1773', ''),
             ('1774', ''),
             ('1775', ''),
             ('1776', ''),
             ('1777', ''),
             ('1778', ''),
             ('1779', ''),
             ('1780', ''),
             ('1781', ''),
             ('1782', ''),
             ('1783', ''),
             ('1784', ''),
             ('1785', ''),
            

Initial exploration of the data revealed that most of the countries in the datasets had data, although the years for which that data was available varied quite a bit.

We use data from the year 2009 because that is the latest given in the "Mean years in school" datasets.  It is also one of the most populated among the datasets, providing more data to work with.  We also print out the number of countries in each dataset for informational purposes because we will want a common set of countries for all data.

<a id='common_countries_pandas'></a>
### Common Countries (Pandas)
Here I am trying to get a dataframe that has the row format "Country Indicator1 Indicator2 Indicator3 ...".  I want a common set of countries so that all the data spots are filled with the appropriate indicator data.  I am trying to figure out how to go from the separate dataframes for each indicator (each of which has a different set of countries) to one large concatenated data frame with the countries common to all the indicator data frames.

In [14]:
def indicator_data_file(indicator_file):
    
    indicator = indicator_file.split('.')[1].split('/')[1]
    print('\nIndicator: {}'.format(indicator))
    df = pd.read_csv(indicator_file)
    return df.filter(regex='HDI|Row Labels|Life expectancy|GNI per capita|2009').dropna()  ## Use row labels

data = {}
countries = {}
data['HDI'] = indicator_data_file('./HDI.csv')
data['HDI'].rename(index=str, columns={"HDI": "Country"})
countries['HDI'] = data['HDI'].iloc[:,[0]]
data['LEI'] = indicator_data_file('./LEI.csv')
countries['LEI'] = data['LEI'].iloc[:,[0]]
data['MEI'] = indicator_data_file('./MEI.csv')
countries['MEI'] = data['MEI'].iloc[:,[0]]
data['WEI'] = indicator_data_file('./WEI.csv')
countries['WEI'] = data['WEI'].iloc[:,[0]]
data['GNI'] = indicator_data_file('./GNI.csv')
countries['GNI'] = data['GNI'].iloc[:,[0]]

data['HDI']

#common_countries = countries['LEI'] & countries['HDI'] & countries['MEI'] & countries['GNI']
#print("Number of common countries: {}".format(len(common_countries)))
#pd.concat([data['HDI'],data['LEI']], axis=1)


Indicator: HDI

Indicator: LEI

Indicator: MEI

Indicator: WEI

Indicator: GNI


Unnamed: 0,HDI,2009
0,Afghanistan,0.387
2,Albania,0.734
3,Algeria,0.691
6,Angola,0.481
9,Argentina,0.788
10,Armenia,0.712
12,Australia,0.926
13,Austria,0.879
15,Bahamas,0.769
16,Bahrain,0.805


In [6]:
def indicator_file_data(indicator_file):
    
    indicator = indicator_file.split('.')[1].split('/')[1]
    print('\nIndicator: {}'.format(indicator))
    with open(indicator_file,'r') as f_in:
        file_reader = csv.DictReader(f_in)
  
        dct = {}
        cnt = 0
        for row in file_reader:
            if row['2009'] != '':  ## Skip countries (keys) that do not have values for 2009
                if indicator == 'GNI':
                    key = row["GNI per capita, PPP (current international $)"]
                elif indicator == 'LEI':
                    key = row['Life expectancy with projections']
                elif indicator == 'HDI':
                    key = row['HDI']
                else:  ## MEI & WEI datasets have same set of countries
                    key = row['Row Labels']
                dct[key] = row['2009']
                cnt += 1  ## Number of countries in dataset

        print('Cnt: {}'.format(cnt))
        return dct

<a id='common_countries_python'></a>
### Common Countries (Python)
We can use **set intersection** to find the countries that all the datasets have in common, so that we'll have a final dataset with all fields filled in.

With plain Python, I am able to find the set of countries that are common to all the indicator datasets, which I can use to make dictionaries with the same countries, but different data points, and, the, finally, write out to a single summary file in the format (Country Indicator1 Indicator2 Indicator3 ...) shown in <a href="#common_countries_pandas">Common Countries (Pandas)</a> above.

In [4]:
data = {}
countries = {}
data['HDI'] = indicator_file_data('./HDI.csv')
countries['HDI'] = set(data['HDI'].keys())
data['LEI'] = indicator_file_data('./LEI.csv')
countries['LEI'] = set(data['LEI'].keys())
data['MEI'] = indicator_file_data('./MEI.csv')
countries['MEI'] = set(data['MEI'].keys())  
data['WEI'] = indicator_file_data('./WEI.csv')
countries['WEI'] = set(data['WEI'].keys())  
data['GNI'] = indicator_file_data('./GNI.csv')
countries['GNI'] = set(data['GNI'].keys())

common_countries = countries['LEI'] & countries['HDI'] & countries['MEI'] & countries['GNI']
print("Number of common countries: {}".format(len(common_countries)))


Indicator: HDI
Cnt: 174

Indicator: LEI
Cnt: 202

Indicator: MEI
Cnt: 175

Indicator: WEI
Cnt: 175

Indicator: GNI
Cnt: 182
Number of common countries: 155


We use **dictionary completion** to pare down the datasets to data for the countries they share in common.

In [5]:
def get_common_country_data(dataset,common):
    """
    Return a dictionary for common countries.
    """
    return { key: dataset[key] for key in common }

In [6]:
new_data = {}
new_data['HDI'] = get_common_country_data(data['HDI'],common_countries)
new_data['LEI'] = get_common_country_data(data['LEI'],common_countries)
new_data['MEI'] = get_common_country_data(data['MEI'],common_countries)
new_data['WEI'] = get_common_country_data(data['WEI'],common_countries)
new_data['GNI'] = get_common_country_data(data['GNI'],common_countries)
new_data['HDI']

{'Afghanistan': '0.387',
 'Albania': '0.734',
 'Algeria': '0.691',
 'Angola': '0.481',
 'Argentina': '0.788',
 'Armenia': '0.712',
 'Australia': '0.926',
 'Austria': '0.879',
 'Bahamas': '0.769',
 'Bahrain': '0.805',
 'Bangladesh': '0.491',
 'Belarus': '0.746',
 'Belgium': '0.883',
 'Belize': '0.696',
 'Benin': '0.422',
 'Bolivia': '0.656',
 'Bosnia and Herzegovina': '0.73',
 'Botswana': '0.626',
 'Brazil': '0.708',
 'Bulgaria': '0.766',
 'Burkina Faso': '0.326',
 'Burundi': '0.308',
 'Cambodia': '0.513',
 'Cameroon': '0.475',
 'Canada': '0.903',
 'Cape Verde': '0.564',
 'Chad': '0.323',
 'Chile': '0.798',
 'China': '0.674',
 'Colombia': '0.702',
 'Comoros': '0.43',
 'Congo, Dem. Rep.': '0.277',
 'Congo, Rep.': '0.523',
 'Costa Rica': '0.738',
 "Cote d'Ivoire": '0.397',
 'Croatia': '0.793',
 'Cyprus': '0.837',
 'Denmark': '0.891',
 'Djibouti': '0.425',
 'Ecuador': '0.716',
 'Egypt': '0.638',
 'El Salvador': '0.669',
 'Equatorial Guinea': '0.534',
 'Estonia': '0.828',
 'Ethiopia': '0.35

At this point, we write the data from the indicator datasets to a single file `HDI_indicators.csv` with the indicators as fields.

In [10]:
def collate_data(hdi_data, lei_data, mei_data, wei_data, gni_data):
    """
    Collate the indicator datasets into the output file HDI_indicators.csv
    """
    with open('./HDI_indicators.csv', 'w') as f_out:
        out_colnames = ['Country','HDI','LEI','CEI','GNI']
        data_writer = csv.DictWriter(f_out, fieldnames = out_colnames)
        data_writer.writeheader()
        
        ## Merge the data dictionaries on a common key
        dicts = [[],[],[],[],[]]
        dicts[0] = hdi_data
        dicts[1] = lei_data
        dicts[2] = mei_data
        dicts[3] = wei_data
        dicts[4] = gni_data
        
        ## Then, write the list values as fields in data_writer
        dictlist = { k: [d[k] for d in dicts] for k in dicts[0] }
        for key, value in dictlist.items():
            point = {}
            point['Country'] = key
            point['HDI'] = value[0]
            point['LEI'] = value[1]
            point['CEI'] = (float(value[2])+float(value[3]))/2.0  ## Average MEI and CEI Education indicators
            point['GNI'] = value[4]
            data_writer.writerow(point)

In [11]:
## Create collated data file for analysis
collate_data(new_data['HDI'],new_data['LEI'],new_data['MEI'],new_data['WEI'],new_data['GNI'])

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!