# [ERG-190C] Homework 3: EDA Energy Access 
<br>

### Table of Contents
[Introduction](#intro)<br>
1 - [The Data](#data)<br>
2 - [Classifying Countries](#classify)<br>
3 - [Computing HDI](#compute)<br>
4 - [Country Rankings](#rank)<br>

### Introduction <a id='intro'></a>

In this homework, you will investigate the Human Development Index (HDI) and its components. The main goal for this assignment is to understand how various factors such as GNI per capita, life expectancy, and education affect HDI. 

We will accomplish this by analyzing World Bank data and utilizing exploratory data analysis (EDA). To give you a sense of how we think about each discovery we make and what next steps it leads to we will provide comments and insights along the way.

### Topics Covered 

* Work with different file types
* Merge dataframes and perform operations to add new columns
* View data through lens of structure, granularity, scope, temporality and faithfulness
* Understand how HDI is constructed
* Perform basic data cleaning operations with errors we deliberately introduce into the dataset.

**Dependencies:**

In [None]:
# Run this cell to set up your notebook.  Make sure utils.py is in this assignment's folder
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

from IPython.display import display, Latex, Markdown

----
## Section 1: The Data<a id='data'></a>

In this notebook, you'll be working with data from the World Bank on GNI per capita, life expectancy, and education for different countries and regions around the world. Feel free to visit the links below to play around with the visualizations on the World Bank website as well: <br>
A. GNI per capita data: https://data.worldbank.org/indicator/NY.GNP.PCAP.PP.CD <br>
B. Life expectancy data: https://data.worldbank.org/indicator/sp.dyn.le00.in. <br>
C. Education data: expected years of schooling: http://hdr.undp.org/en/indicators/69706, mean years of schooling: http://hdr.undp.org/en/indicators/103006

<br>**Question 1.1:** Look through the `data` folder and then load the csv or tsv files into the homework so we can easily work with the data. The first example has been done for you.

Load the GNI metadata .csv into a dataframe.

In [None]:
#Due to World Bank's data layout we have to download an table on GNI info of different countries and GNI yearly data
#GNI information and income groups
gni_info = pd.read_csv("data/GNI_country_metadata.csv")
gni_info.head()

Load the GNI per capita (PPP) .csv into a dataframe.

In [None]:
#GNI per capita, PPP (current international $)
gni_num = ...
gni_num.head()

Let's merge the GNI metadata dataframe with the GNI PPP dataframe.

In [None]:
#Run this cell
#Merging GNI description data with GNI PPP data
#GNI per capita, PPP (current international $)
gni_all = gni_num.merge(gni_info[['Country Code', 'Region', 'Income Group']], on='Country Code')
gni_all.head()

----------

Next, load the .csv files containing information on life expectancy, expected years of schooling, and mean years of schooling. When drawing information from different datasets, it's good practice to have an understanding of how complete your datasets are. Print the number of unique country names included in each dataframe. 

In [None]:
# Life expectancy at birth for both sexes combined (years)
life = ...
life.head()

In [None]:
# Expected years of schooling (years)
expected_edu = ...
expected_edu.head()

In [None]:
# Mean years of schooling (years)
mean_edu = ...
mean_edu.head()

What do you notice?

In [None]:
# your answer here

**Question 1.2:** 
Analyze the loaded tables and see what data types are within the table. 
<br>Then for each of the tables answer, what is the:
1. structure of the data?<br>
2. granularity of the data?<br>
3. scope of the data?<br>
4. temporality of the data?<br>
5. faithfulness of the data?<br>

Reminder:
* Scope - Are all countries included? Within each country can you find information about whether data are derived from a census, random sample, or other? 
* Faithfulness - Where do the data come from? Is there any reason to question it? 
* Granularity - What level of detail is the data? Is the data high granularity or low granularity (e.g. hourly data vs. yearly data)?

Answers:

1) PPP

[<i>fill out here</i>]

2) Life expectancy at birth

[<i>fill out here</i>]

3) Expected years of schooling

[<i>fill out here</i>]

4) Mean years of schooling

[<i>fill out here</i>]

**Question 1.3:** How many regions are in the GNI info table?

In [None]:
count_regions = ...

----
## Section 2: Classifying Countries<a id='classify'></a>

We can see that the GNI per capita data is higher granularity than the life expectancy and education data because it contains the country codes and regions. Let's try to merge these datasets.

<br>**Question 2.1:** Merge the life expectancy table and the GNI info tables. Check that `life_info` has 1503 rows.

In [None]:
# copy column and rename so it can be merged with gni table
life['Country Name'] = life['Country or Area']

# create merged table
life_info = ...

# drop unneeded columns
life_info.drop(['Unnamed: 5', 'Value Footnotes'], axis=1, inplace=True)

**Question 2.2:** The merged shape differs from the two original data frames.  Why?  Use the function `returnNotMatches` below to investigate country names.

Use the function returnNotMatch(a, b) to compare two lists (don't edit this - just run it.)

In [None]:
def returnNotMatches(a, b):
    a = set(a)
    b = set(b)
    return [list(b - a), list(a - b)]

In [None]:
# your code here
diffs = ...
print('Countries and regions only in `life` are \n ', ...,'\n') 

print('Countries and regions only in `gni_info` are:\n ', ...)

**Question 2.3:** Merge the expected years of schooling table and the mean years of schooling table for the year 2012 to agglomerate a table of all the years of schooling. Take out unneeded columns. The final table should be called `education` and have 186 rows and contain:
1. `HDI Rank_mean` (this is the HDI rank as saved in the mean education data)
1. `HDI Rank_expected` (this is the HDI rank as saved in the expected education data)
1.  `Country`	
1. `2012_mean` (this is the education years as saved in the mean education data)
1. `2012_expected` (this is the education years as saved in the expected education data)

*Hint: `loc` and `rename` are helpful to get the required data*

In [None]:
...
education = ...

**Question 2.4:** Create a bar plot of the expected vs. mean years of schooling in 2012 for the United States. Use the method `.plot` on the data frame.

In [None]:
# your code here

**Question 2.5:** Compare the mean years of schooling in 2012 with the expected years of schooling in 2012 for the United States. What factors (i.e. income, health, ect) do you think effect the mean vs. the expected schooling?

Answer: YOUR ANSWER HERE

**Question 2.6:** Create a dataframe called `life_info_2012both` from `merge` where the data is only from the year 2012 with both sexes. The final table should contain 167 rows.

In [None]:
...
life_info_2012both = ...

**Question 2.7:** Merge your `life_info_2012both` table with your `education` table to create a table `life_ed_info` that aggregates almost all the tables we have worked with so far.

In [None]:
#rename columns
life_info_2012both = life_info_2012both.rename(columns={'Country or Area': 'Country'})

#merge tables
life_ed_info = ...

**Question 2.8** Create and display a list of countries that were lost when you merged `life_info_2012both` with `education`.  

In [None]:
#FILL IN ELLIPSES BELOW
diffs = ...

print('Countries and regions only in `life_info_2012both` are \n ', diffs[0],'\n') 
print('Countries and regions only in `education` are:\n ', diffs[1])

**Question 2.9:** According to our aggregate dataframe `life_ed_info`, what is the 
1. structure of the data?
1. granularity of the data?
1. scope of the data?
1. temporality of the data?
1. faithfulness of the data?

Answer: YOUR ANSWER HERE

**Question 2.10:** Group the `'HDI Rank_mean'` data by `'Income Group'`.  Summarize the data in a way that allows you to describe the relationship between HDI and income.

*Hint: the function `groupby` and `.aggregate` are useful*

In [None]:
# RUN THIS CELL (don't change anything)
# The HDI rank data are strings, so we'll convert them to numbers before proceeding.  
life_ed_info['HDI Rank_mean'] = pd.to_numeric(life_ed_info['HDI Rank_mean'])

In [None]:
#Your answer here.  Create (and display) a dataframe that summarizes your results

<br>

----

## Section 3. Computing HDI<a id='compute'></a>

<br>
<img src="hdi.png" width=800>

In this section, we will normalize each individual metric (GNI, life expectancy, education) and compute HDI based on the United Nations guide [here](http://hdr.undp.org/sites/default/files/hdr2016_technical_notes_0.pdf "UNDP HDI Notes"). "The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions."

<br>
The formula for calculating HDI is here:  
<img src="hdicalc.png" width=300>

<br>
Before we proceed, we're going to load in a correctly merged version of the data from the last section (in case you ran into errors).  Note also we've massaged the data a little, so it's important to use this table.

In [None]:
# RUN THIS CELL
life_ed_gni = pd.read_csv('data/life_ed_gni.csv')
life_ed_gni.head()

**Question 3.1:** Define a function that normalizes GNI. Test this function by inputting Afghanistan's GPP in 2012 (use the `gni_all` table).

In [None]:
def normalize_GNI(gni):
    """
    Normalize GNI to get the Income Index.

    Args:
        An integer corresponding to the GNI PPP
        of a country and year

    Returns:
        The Income Index (int)
    """
    numerator = (np.log(...)-np.log(...))
    denominator = (np.log(...)-np.log(...))
    return np.divide(numerator, denominator)

Run the following cell -- if it raises an error, that means there's an error in the function.

In [None]:
#RUN THIS CELL
first_num = life_ed_gni.loc[0,'GNI']

test_gni_ans = normalize_GNI(first_num)

assert test_gni_ans == 0.4447743835010624

**Question 3.2:** Define a function that normalizes life expectancy. Test this function by inputting Afghanistan's life expectancy for both sexes in 2012.

In [None]:
def normalize_life(life):
    """
    Normalize life expectancy to get the Life Expectancy Index.

    Args:
        An integer corresponding to the life
        expectancy for both sexes

    Returns:
        The Life Expectancy Index (int)
    """
    sub = life-20
    constants = 85-20
    return np.divide(..., ...)

Test the function by running the cell below -- if it raises an error, that means there's an error with the function.

In [None]:
#RUN THIS CELL
life_num = life_ed_gni.loc[0,'Life']

test_life_ans = normalize_life(life_num)
assert test_life_ans == 0.6153846153846154

**Question 3.3:** Define a function that calculates the Expected Index. Test this function by inputting Afghanistan's expected years of schooling for 2012 and mean years of schooling for 2012.

In [None]:
def normalize_ed(mean_var, exp_var):
    """
    Normalize years of schooling to get the Years of Schooling Index.

    Args:
        First variable is mean education, second is expected education.

    Returns:
        The Years of Schooling Index (int)
    """ 
    mysi = np.divide(mean_var, ...)
    eysi = np.divide(exp_var, ...)
    add = mysi+eysi
    return np.divide(add, 2)

Test the function by running the cell below -- if it raises an error, that means there's an error with the function.

In [None]:
#RUN THIS CELL
ed_nums = life_ed_gni.loc[0,['Ed_mean', 'Ed_expected']]

test_ed_ans = normalize_ed(ed_nums[0],ed_nums[1])
assert  test_ed_ans == 0.38833333333333336

**Question 3.4:** Define a function that calculates the HDI. Test this function by inputting Afghanistan's normalized GPP in 2012, normalized life expectancy for both sexes in 2012, and normalized expected years of schooling in 2012.

In [None]:
def calc_hdi(gni_var, life_var, ed_var):
    """
    Compute HDI from normalized gni, life and education variables.
    
    Args:
        normalized gni (first entry), life (second entry) and education (third entry).
    
    Returns: 
        The HDI (float)
    """ 
    var = ...
    return var **(np.divide(1,3))

In [None]:
#these three values were calculated using the previous three functions
assert calc_hdi(test_gni_ans, test_life_ans, test_ed_ans) == 0.4736930620781577

**Question 3.5:** Why is it important to normalize each individual metric in the HDI?

[<i>your answer here</i>]

**Question 3.6:** Use .apply() to create three new columns in the life_ed_gni data frame.  

* The first new column will be normalized GNI, called 'GNI_n'
* The second new column will be normalized life, called 'Life_n'
* The third new column will be normalized Education, called 'Ed_n'

In [None]:
#FILL IN THE ELLIPSES BELOW
life_ed_gni['GNI_n']= life_ed_gni['GNI'].apply(...)
life_ed_gni['Life_n']= life_ed_gni[...].apply(...)
life_ed_gni['Ed_n']= life_ed_gni[...].apply(lambda x: normalize_ed(..., ..., axis=1)

**Question 3.7:** Find a way to check that all your normalized variables are in the range you expected.  

*Note, if you've done it right, you'll find a few values are just a little outside the range (a few percent), and that's ok*

In [None]:
# your code here

**Question 3.8**: Add a column to the data frame called 'HDI' that contains HDI values computed with `calc_hdi`.  

Hint: As in Question 3.7, to use `.apply` with a function that takes multiple arguments you'll need to use a lambda function.  You can follow the syntax from Question 3.7 here to make it work.

In [None]:
#YOUR ANSWER HERE
life_ed_gni['...'] = life_ed_gni.apply(...)
life_ed_gni.head()

**Question 3.9:** Some countries have NaN values for the HDI metrics you created.  Why are they there?  

[<i>your answer here</i>]

----

## Section 4: Country Rankings<a id='rank'></a>

<br>
We will examine how country rankings for HDI compare to rankings for individual metrics (health, education, income). 

Hint: the NaNs we discussed in Question 3.10 might get in the way of you displaying sorted data.  The `.dropna()` method might help.  

**Question 4.1:** Display all columns of the `life_ed_gni` data frame for the 10 countries with the lowest HDI.

In [None]:
#Bottom 10

**Question 4.2**: Display all columns of the `life_ed_gni` data frame for the 10 countries with the highest HDI.

In [None]:
#Top 10

**Question 4.3**: GNI, life expectancy and education indices are strongly correlated with each other and with HDI for the bottom and the top countries.  But the correlation is not perfect.  Describe at least two important differences between countries that you think one would miss if they simply compared them by HDI.

[<i>your answer here</i>]

----

## Submission

Congrats, you're done with homework 3!

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.

----

## Bibliography

- United Nations - HDI definition. http://hdr.undp.org/en/content/human-development-index-hdi

---
Notebook developed by: Melissa Ly

Data Science Modules: http://data.berkeley.edu/education/modules