# Project:  Human Development Index (HDI) Indicators from Gapminder

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

The [Human Development Index](https://en.wikipedia.org/wiki/Human_Development_Index) (HDI) represents the standard-of-living in a country based on three indicators:
    - Life expectancy at birth (LEI),
    - Gross national income per capita (GNI),
    - Mean years of education (MEI & WEI for men and women).

The Gapminder database provides datasets for each of these indicators; however, the education level is given by two datasets based on gender rather than a single composite indicator as used in the calculation of the HDI.  The HDI ranges between 0 and 1.  The higher these indicators, the higher the HDI.  Accordingly, exploratory data analysis should reveal fairly strong correlations among them.  It would be interesting to see the relative strengths of these indicators with respect to the HDI.

In [1]:
import pandas as pd
import csv
from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### General Properties

The datasets from Gapminder are
    - HDI:  Human Development Index
    - GNI:  Total GNI (PPP, current international $)
    - MEI:  Mean years in school (men 25 years and older)
    - WEI:  Mean years in school (women 25 years and older)
    - LEI:  Life expectancy at birth
converted from excel to csv format.

Rows for each dataset are a list of countries, and columns are years.  Which countries and years are included varied among the datasets, so an immediate task is to standardize these metrics for analysis.



In [2]:
def print_head(filename):
    """
    Print head of dataframe from filename.
    """
    indicator = filename.split('.')[1].split('/')[1]
    print('\nIndicator: {}'.format(indicator))
    df = pd.read_csv(filename)
    return (indicator,df)

data_files = ['./HDI.csv',  ## Human Development Index
              './LEI.csv',  ## Life Expectancy at birth
              './MEI.csv',  ## Mean Years in School (Men)
              './WEI.csv',  ## Mean Years in School (Women)
              './GNI.csv']  ## Gross National Income

example_indicators = {}
for data_file in data_files:
    indicator, df = print_head(data_file)
    example_indicators[indicator] = df
    print(df.head())


Indicator: HDI
                     HDI   1980   1990   2000   2005   2006   2007   2008  \
0            Afghanistan  0.198  0.246  0.230  0.340  0.354  0.363  0.370   
1  Akrotiri and Dhekelia    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2                Albania    NaN  0.656  0.691  0.721  0.724  0.729  0.733   
3                Algeria  0.454  0.551  0.624  0.667  0.673  0.680  0.686   
4         American Samoa    NaN    NaN    NaN    NaN    NaN    NaN    NaN   

    2009   2011  
0  0.387  0.398  
1    NaN    NaN  
2  0.734  0.739  
3  0.691  0.698  
4    NaN    NaN  

Indicator: LEI
  Life expectancy with projections  1765  1766  1767  1768  1769  1770  1771  \
0                      Afghanistan   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1                          Albania   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2                          Algeria   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3                   American Samoa   NaN   NaN   NaN   NaN   NaN   NaN   NaN 

Initial exploration of the data revealed that most of the countries in the datasets had data, although the years for which that data was available varied quite a bit.

We use data from the year 2009 because that is the latest given in the "Mean years in school" datasets.  It is also one of the most populated among the datasets, providing more data to work with.  We also print out the number of countries in each dataset for informational purposes because we will want a common set of countries for all data.

In [30]:
def indicator_data_file(indicator_file):
    
    df = pd.read_csv(indicator_file)
    return df.filter(regex='HDI|Row Labels|Life expectancy|GNI per capita|2009').dropna()  ## Use row labels

data = {}
data['HDI'] = indicator_data_file('./HDI.csv')
data['HDI'].columns = ['Country', 'HDI']
data['LEI'] = indicator_data_file('./LEI.csv')
data['LEI'].columns = ['Country', 'LEI']
data['MEI'] = indicator_data_file('./MEI.csv')
data['MEI'].columns = ['Country', 'MEI']
data['WEI'] = indicator_data_file('./WEI.csv')
data['WEI'].columns = ['Country', 'WEI']
data['GNI'] = indicator_data_file('./GNI.csv')
data['GNI'].columns = ['Country', 'GNI']

df1 = pd.merge(data['HDI'],data['LEI'],on='Country')
df2 = pd.merge(df1,data['MEI'],on='Country')
df3 = pd.merge(df2,data['WEI'],on='Country')
df4 = pd.merge(df3,data['GNI'],on='Country')
df4.to_csv("indicators.csv")

At this point, we write the data from the indicator datasets to a single file `indicators.csv` with the indicators as fields.

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!