<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-and-cleanup-dataset" data-toc-modified-id="Read-and-cleanup-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read and cleanup dataset</a></span></li></ul></div>

- Choose and download another data set from the UN data http://data.un.org/Explorer.aspx to merge with your data and explore.
- Prepare a short (<5 minute) presentation of your findings.
- Report any interesting correlations you find. 
- Include visualizations and consider adding interactivity with ipywidgets.
- This presentation can be done either in a Jupyter Notebook or using another presentation software, such as PowerPoint. (Check out Jupyter Slides if you have time. This allows you to turn your jupyter notebook into a slideshow.)

In [1]:
# Importing all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm

In [2]:
# Jupyter magic so we don't have to call plt.show() every time
%matplotlib inline

# Have pandas to display up to 500 rows
pd.set_option('display.max_columns', 500)

# Set defaults theme for sns
sns.set()

## Read and cleanup dataset
We are going to look at some education metrics per countries, specifically the level of literacy for young people (15-24 years olds)

In [3]:
# Pandas Dataframe
literacy_df = pd.read_csv("../data/literacy_rate_15_24_year_olds.csv")

# Quick glance
display(literacy_df.head())
display(literacy_df.tail())

Unnamed: 0,Country or Area,Subgroup,Year,Source,Unit,Value,Value Footnotes
0,Afghanistan,Female 15-24 yr,2000.0,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,18.4,1.0
1,Afghanistan,Male 15-24 yr,2000.0,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,50.8,1.0
2,Albania,Female 15-24 yr,2001.0,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,99.5,2.0
3,Albania,Male 15-24 yr,2001.0,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,99.4,2.0
4,Algeria,Female 15-24 yr,2002.0,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,86.1,1.0


Unnamed: 0,Country or Area,Subgroup,Year,Source,Unit,Value,Value Footnotes
433,3,UNESCO Institute of Statistics estimates.,,,,,
434,4,India: data exclude 3 sub-divisions.,,,,,
435,5,Census. Serbia and Montenegro: data exclude Ko...,,,,,
436,6,Sri Lanka: data represent 18 of 25 districts.,,,,,
437,7,Sudan: Data are for North Sudan only.,,,,,


It looks like we have some cleanups to do

In [4]:
# Re-importing and skipping the last 2 rows
# literacy_df = literacy_df[:-8]
literacy_df = pd.read_csv("../data/literacy_rate_15_24_year_olds.csv", skipfooter=8, engine="python")
literacy_df.tail()

Unnamed: 0,Country or Area,Subgroup,Year,Source,Unit,Value,Value Footnotes
425,Zambia,Male 15-24 yr,1990,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,67.3,2
426,Zimbabwe,Female 15-24 yr,2004,UNSD_MDGInfo CDROM_Oct2007 (International esti...,Percent,97.9,3
427,Zimbabwe,Female 15-24 yr,1992,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,94.4,2
428,Zimbabwe,Male 15-24 yr,2004,UNSD_MDGInfo CDROM_Oct2007 (International esti...,Percent,97.5,3
429,Zimbabwe,Male 15-24 yr,1992,UNSD_MDGInfo CDROM_Oct2007 (National figure),Percent,96.5,2


In [5]:
# Drop the Source, Unit, and Value Footnotes
literacy_df = literacy_df.drop(['Source', 'Unit', 'Value Footnotes'], axis = 1)
literacy_df.head()

Unnamed: 0,Country or Area,Subgroup,Year,Value
0,Afghanistan,Female 15-24 yr,2000,18.4
1,Afghanistan,Male 15-24 yr,2000,50.8
2,Albania,Female 15-24 yr,2001,99.5
3,Albania,Male 15-24 yr,2001,99.4
4,Algeria,Female 15-24 yr,2002,86.1


In [8]:
# Renaming the columns
literacy_df.columns = ["Country", "Subgroup", "Year", "Literacy_Pct"]
literacy_df.head()

Unnamed: 0,Country,Subgroup,Year,Literacy_Pct
0,Afghanistan,Female 15-24 yr,2000,18.4
1,Afghanistan,Male 15-24 yr,2000,50.8
2,Albania,Female 15-24 yr,2001,99.5
3,Albania,Male 15-24 yr,2001,99.4
4,Algeria,Female 15-24 yr,2002,86.1


In [10]:
# Splitting Subgroup
literacy_df[['Sex','Year_Group', 'Yr']] = literacy_df['Subgroup'].str.split(expand=True) 

# Drop Yr and Subgroup
literacy_df = literacy_df.drop(['Yr', 'Subgroup'], axis = 1)
literacy_df.head()

Unnamed: 0,Country,Year,Literacy_Pct,Sex,Year_Group
0,Afghanistan,2000,18.4,Female,15-24
1,Afghanistan,2000,50.8,Male,15-24
2,Albania,2001,99.5,Female,15-24
3,Albania,2001,99.4,Male,15-24
4,Algeria,2002,86.1,Female,15-24
