# Project: GapMinder Education Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>add an image 
>
>give a brief intro
Primary schooldefination - https://en.wikipedia.org/wiki/Primary_school

In [455]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### Lets first get data for all the posed research questions! 

### Gender Ratio of enrollment in school
>The data has the gender parity index for gross enrollment in primary and secondary education. This is the ratio of girls to boys enrolled at primary and secondary grades of public and private schools. <br />
>Source : <a href = "https://data.worldbank.org/indicator/SE.ENR.PRSC.FM.ZS">World Bank</a><br />
>No of countries : 204


In [456]:
df_gend_ratio_pr_sec_enrollment = pd.read_csv("./Data/ratio_of_girls_to_boys_in_primary_and_secondary_education_perc.csv")
print("Total No of rows : ", df_gend_ratio_pr_sec_enrollment.shape[0])
print("Total No of columns : ", df_gend_ratio_pr_sec_enrollment.shape[1])
df_gend_ratio_pr_sec_enrollment.head()

Total No of rows :  204
Total No of columns :  53


Unnamed: 0,country,1970,1971,1972,1973,1974,1975,1976,1977,1978,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,,,,,,,,,,...,1.02,,,,,,,,,
1,Afghanistan,0.167,0.161,0.161,0.169,0.167,0.174,0.181,0.192,0.199,...,0.669,0.655,0.654,0.642,0.646,0.642,0.636,,,
2,Angola,,0.64,0.657,,,,,,,...,,,,,,,,,,
3,Albania,,,,,,,0.923,,0.925,...,0.99,0.982,0.977,0.982,0.994,1.0,1.02,1.02,1.02,
4,Andorra,,,,,,1.15,,,,...,,,,,,,,,,


### Gender Ratio of Number of Years in School for 188 countries
>The data consists of percentage of ratio of years spent by females to males in primary,secondary and tertiary education. It is collected for people of ages in the range of 25 to 34.  <br />
>Source : <a href = "http://www.healthmetricsandevaluation.org/">Institute for Health Metrics and Evaluation (IHME), University of Washington</a><br />
>No of countries : 188


In [457]:
df_gend_ratio_pr_sec_ter_yrs = pd.read_csv("./Data/mean_years_in_school_women_percent_men_25_to_34_years.csv")
print("Total No of rows : ", df_gend_ratio_pr_sec_ter_yrs.shape[0])
print("Total No of columns : ", df_gend_ratio_pr_sec_ter_yrs.shape[1])
df_gend_ratio_pr_sec_ter_yrs.head()

Total No of rows :  188
Total No of columns :  47


Unnamed: 0,country,1970,1971,1972,1973,1974,1975,1976,1977,1978,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Afghanistan,15.4,15.8,15.4,15.6,15.9,16.1,16.4,16.6,16.2,...,21.5,21.9,22.2,22.3,22.6,22.9,23.1,23.4,23.5,23.7
1,Angola,51.3,51.4,51.9,52.3,52.8,53.2,53.4,53.8,54.3,...,68.5,68.9,69.5,70.1,70.5,71.2,71.7,72.2,72.9,73.3
2,Albania,87.4,87.9,88.3,88.9,89.2,89.7,90.2,90.6,91.0,...,100.0,101.0,101.0,101.0,101.0,102.0,102.0,102.0,102.0,103.0
3,Andorra,97.0,97.4,97.8,98.1,98.4,98.8,99.1,99.5,99.8,...,105.0,105.0,105.0,105.0,105.0,105.0,106.0,106.0,106.0,106.0
4,United Arab Emirates,90.9,91.4,92.0,92.4,93.0,93.6,94.0,94.5,95.2,...,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,106.0,106.0


###### Some percentage data points were > 100, firstly I concluded that those must be incorrect and need to be cleaned. On delving deep to the source World Bank website, I found and I quote - "There are many reasons why the primary completion rate can exceed 100 percent. The numerator may include late entrants and overage children who have repeated one or more grades of primary education as well as children who entered school early, while the denominator is the number of children at the entrance age for the last grade of primary education." Hence, this isn't incorrect data and needs no corrections.

### Government Expenditure on Education for 195 countries
>Data on education expenditure are received from country governments responding to the annual UIS survey on formal education or to the UNESCO-OECD-Eurostat (UOE) data collection. <br />
>Source : <a href = "http://data.uis.unesco.org/">UNESCO Institute for Statistics</a> <br />
> No of countries : 195

In [458]:
df_govt_gdp_on_edu = pd.read_csv("./Data/NATMON_DS_05102022024513338.csv")
print("Total No of rows : ", df_govt_gdp_on_edu.shape[0])
print("Total No of columns : ", df_govt_gdp_on_edu.shape[1])
df_govt_gdp_on_edu.head()

Total No of rows :  2901
Total No of columns :  9


Unnamed: 0,NATMON_IND,Indicator,LOCATION,Country,TIME,Time,Value,Flag Codes,Flags
0,XGDP_02_FSGOV,Government expenditure on pre-primary educatio...,VAT,Holy See,2016,2016,,a,Category not applicable
1,XGDP_02_FSGOV,Government expenditure on pre-primary educatio...,VAT,Holy See,2017,2017,,a,Category not applicable
2,XGDP_02_FSGOV,Government expenditure on pre-primary educatio...,VAT,Holy See,2018,2018,,a,Category not applicable
3,XGDP_02_FSGOV,Government expenditure on pre-primary educatio...,VAT,Holy See,2019,2019,,a,Category not applicable
4,XGDP_02_FSGOV,Government expenditure on pre-primary educatio...,VAT,Holy See,2020,2020,,a,Category not applicable


###### This data will be cleaned to keep only necessary columns and some manupulations have to be done to the structure of the data to make it have same columns and rows as other dataframes. 

### Employment Rate (15+ aged)
>The data consists the percentage of all 15+ aged people who were employed that year. <br />
>Source : <a href = "https://ilostat.ilo.org/data/#">International Labour Organization</a><br />
>No of countries : 189 

In [459]:
df_emp = pd.read_csv("./Data/aged_15plus_employment_rate_percent.csv")
print("Total No of rows : ", df_emp.shape[0])
print("Total No of columns : ", df_emp.shape[1])
df_emp.head()

Total No of rows :  189
Total No of columns :  31


Unnamed: 0,country,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,42.5,42.5,42.5,42.5,42.4,42.4,42.3,42.2,42.2,...,42.3,42.4,42.5,42.7,42.9,43.0,43.2,43.4,43.5,41.5
1,Angola,75.0,75.0,75.2,75.1,74.9,74.9,74.8,74.7,74.6,...,71.7,71.8,71.8,71.9,71.9,72.0,72.1,72.1,72.1,69.6
2,Albania,57.8,58.2,56.8,55.7,54.1,53.3,54.5,53.8,52.7,...,52.0,49.4,44.7,43.7,46.0,47.9,49.3,52.0,53.4,52.7
3,United Arab Emirates,71.8,72.2,72.9,73.4,73.8,73.3,73.1,73.3,73.7,...,81.7,81.5,81.3,81.3,81.6,81.2,80.3,80.3,80.2,76.9
4,Argentina,57.3,56.9,54.9,54.0,49.5,50.7,52.5,54.1,53.1,...,56.3,56.1,56.0,55.4,55.5,55.5,55.5,55.7,55.5,49.4


### Income per person data
>The data consists of GDP per per capita which is calculated by the total amount (international dollars, fixed to 2017 prices) divided by the total population of the country.The data is adjusted for inflation and differences in the cost of living between countries, known as PPP dollars. <br />
>Source : <a href = "https://www.gapminder.org/data/documentation/gd001/">Gapminder based on World Bank</a> <br />
>No ofcountries : 210

In [460]:
df_income_per_person = pd.read_csv("./Data/gdppercapita_us_inflation_adjusted.csv")
print("Total No of rows : ", df_income_per_person.shape[0])
print("Total No of columns : ", df_income_per_person.shape[1])
df_income_per_person.head()

Total No of rows :  210
Total No of columns :  62


Unnamed: 0,country,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,,,,,,,,,,...,26.6k,26.2k,27.1k,27k,28.4k,28.8k,29.3k,,,
1,Afghanistan,,,,,,,,,,...,512,558,569,565,556,553,553,547,555,530
2,Angola,,,,,,,,,,...,3980,4170,4220,4270,4170,3920,3790,3600,3460,3170
3,Albania,,,,,,,,,,...,3680,3740,3780,3860,3950,4090,4250,4430,4540,4390
4,Andorra,,,,,,,,,,...,35k,33.8k,33.2k,34.7k,35.8k,37.4k,37.7k,38.3k,39k,34.3k


### Life Expectancy data
>The data gives information on the average number of years a newborn child in that particular country lives. <br />
>Source : <a href = "https://www.gapminder.org/data/documentation/gd004/">Gapminder based on World Bank</a> <br />
>No of countries : 195

In [461]:
df_life_exp = pd.read_csv("./Data/life_expectancy_years.csv")
print("Total No of rows : ", df_life_exp.shape[0])
print("Total No of columns : ", df_life_exp.shape[1])
df_life_exp.head()

Total No of rows :  195
Total No of columns :  302


Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,28.2,28.2,28.2,28.2,28.2,28.2,28.1,28.1,28.1,...,75.5,75.7,75.8,76.0,76.1,76.2,76.4,76.5,76.6,76.8
1,Angola,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,...,78.8,79.0,79.1,79.2,79.3,79.5,79.6,79.7,79.9,80.0
2,Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,87.4,87.5,87.6,87.7,87.8,87.9,88.0,88.2,88.3,88.4
3,Andorra,,,,,,,,,,...,,,,,,,,,,
4,United Arab Emirates,30.7,30.7,30.7,30.7,30.7,30.7,30.7,30.7,30.7,...,82.4,82.5,82.6,82.7,82.8,82.9,83.0,83.1,83.2,83.3


### Data Cleaning 

Firstly lets look at the datatypes of all dataframes and convert them to proper datatypes

In [462]:
print(df_gend_ratio_pr_sec_enrollment.info())
df_gend_ratio_pr_sec_enrollment.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 53 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  204 non-null    object 
 1   1970     34 non-null     float64
 2   1971     116 non-null    float64
 3   1972     111 non-null    float64
 4   1973     108 non-null    float64
 5   1974     102 non-null    float64
 6   1975     98 non-null     float64
 7   1976     105 non-null    float64
 8   1977     107 non-null    float64
 9   1978     101 non-null    float64
 10  1979     99 non-null     float64
 11  1980     95 non-null     float64
 12  1981     105 non-null    float64
 13  1982     99 non-null     float64
 14  1983     100 non-null    float64
 15  1984     105 non-null    float64
 16  1985     101 non-null    float64
 17  1986     109 non-null    float64
 18  1987     101 non-null    float64
 19  1988     101 non-null    float64
 20  1989     100 non-null    float64
 21  1990     99 non-

Unnamed: 0,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
count,34.0,116.0,111.0,108.0,102.0,98.0,105.0,107.0,101.0,99.0,...,136.0,123.0,132.0,133.0,129.0,127.0,123.0,115.0,56.0,3.0
mean,0.846285,0.774966,0.822279,0.837435,0.833902,0.863673,0.852486,0.858607,0.86295,0.876051,...,0.981993,0.990081,0.986591,0.985662,0.988039,0.990929,0.99326,1.00267,1.010268,1.024333
std,0.268854,0.236902,0.241692,0.23128,0.230254,0.222574,0.216638,0.211052,0.203725,0.196936,...,0.069261,0.066218,0.069373,0.070197,0.065811,0.060272,0.055941,0.043322,0.043691,0.045567
min,0.0527,0.161,0.15,0.169,0.167,0.174,0.181,0.192,0.199,0.436,...,0.669,0.655,0.654,0.642,0.646,0.642,0.636,0.728,0.89,0.973
25%,0.785,0.579,0.6245,0.655,0.657,0.6735,0.667,0.693,0.707,0.713,...,0.9745,0.977,0.97975,0.979,0.982,0.983,0.9855,0.988,0.98975,1.0065
50%,0.924,0.857,0.92,0.937,0.926,0.939,0.932,0.936,0.929,0.949,...,0.9955,0.998,1.0,1.0,1.0,1.0,1.0,1.01,1.01,1.04
75%,1.00725,0.98425,0.9945,1.01,1.0,1.01,1.01,1.01,1.01,1.01,...,1.02,1.02,1.02,1.02,1.02,1.015,1.01,1.02,1.0225,1.05
max,1.45,1.11,1.44,1.42,1.18,1.41,1.4,1.42,1.44,1.43,...,1.13,1.15,1.15,1.13,1.1,1.12,1.12,1.14,1.15,1.06


All year columns have gender ratio for primary to secondary schools enrollment data as floats for all countries. So, its good to go.

In [463]:
print(df_gend_ratio_pr_sec_ter_yrs.info())
df_gend_ratio_pr_sec_ter_yrs.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 47 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  188 non-null    object 
 1   1970     188 non-null    float64
 2   1971     188 non-null    float64
 3   1972     188 non-null    float64
 4   1973     188 non-null    float64
 5   1974     188 non-null    float64
 6   1975     188 non-null    float64
 7   1976     188 non-null    float64
 8   1977     188 non-null    float64
 9   1978     188 non-null    float64
 10  1979     188 non-null    float64
 11  1980     188 non-null    float64
 12  1981     188 non-null    float64
 13  1982     188 non-null    float64
 14  1983     188 non-null    float64
 15  1984     188 non-null    float64
 16  1985     188 non-null    float64
 17  1986     188 non-null    float64
 18  1987     188 non-null    float64
 19  1988     188 non-null    float64
 20  1989     188 non-null    float64
 21  1990     188 non

Unnamed: 0,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
count,188.0,188.0,188.0,188.0,188.0,188.0,188.0,188.0,188.0,188.0,...,188.0,188.0,188.0,188.0,188.0,188.0,188.0,188.0,188.0,188.0
mean,74.87234,75.308511,75.735106,76.148404,76.589894,77.01383,77.428191,77.861702,78.284574,78.7,...,89.390426,89.669149,89.97234,90.304255,90.625532,90.894681,91.191489,91.460106,91.720213,91.962766
std,23.531585,23.535147,23.561424,23.535533,23.561574,23.56049,23.545932,23.534903,23.514179,23.467923,...,21.011729,20.870095,20.741023,20.640666,20.540589,20.402434,20.289776,20.143014,20.022101,19.865456
min,11.2,11.3,11.4,11.9,12.0,12.0,12.4,12.4,12.8,13.1,...,21.5,21.9,22.2,22.3,22.6,22.9,23.1,23.4,23.5,23.7
25%,54.7,55.325,55.975,56.5,57.0,57.675,58.3,58.85,59.4,59.95,...,78.45,78.925,79.475,79.975,80.525,81.05,81.475,81.925,82.425,82.925
50%,85.55,86.0,86.35,86.75,87.15,87.55,87.8,88.2,88.65,89.05,...,99.35,99.6,99.85,100.0,100.0,100.5,101.0,101.0,101.0,101.0
75%,93.9,94.3,94.725,95.1,95.4,95.9,96.225,96.6,96.825,97.3,...,104.0,104.0,104.0,104.0,104.25,105.0,105.0,105.0,105.0,105.0
max,129.0,129.0,129.0,129.0,129.0,129.0,130.0,130.0,129.0,130.0,...,126.0,127.0,126.0,126.0,126.0,126.0,126.0,126.0,126.0,126.0


All year columns have gender ratio for primary, seconday and tertiary school years attended data as floats for all countries. So, its good to go.

In [464]:
print(df_govt_gdp_on_edu.info())
df_govt_gdp_on_edu.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2901 entries, 0 to 2900
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   NATMON_IND  2901 non-null   object 
 1   Indicator   2901 non-null   object 
 2   LOCATION    2901 non-null   object 
 3   Country     2901 non-null   object 
 4   TIME        2901 non-null   int64  
 5   Time        2901 non-null   int64  
 6   Value       2608 non-null   float64
 7   Flag Codes  315 non-null    object 
 8   Flags       315 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 204.1+ KB
None


Unnamed: 0,TIME,Time,Value
count,2901.0,2901.0,2608.0
mean,2017.81696,2017.81696,1.070092
std,1.432842,1.432842,5.215725
min,2016.0,2016.0,0.0052
25%,2017.0,2017.0,0.4066
50%,2018.0,2018.0,0.8076
75%,2019.0,2019.0,1.288763
max,2021.0,2021.0,249.22712


According to the UIS (source) website, there are 4 possible flags shown below :

\+	National Estimation <br>
a	Category not applicable <br>
n	Magnitude nil or negligible <br>
‡	UIS Estimation <br>

The last two columns (Flag Codes and Flags) relate to this information. We will be replacing the value according to these flags and later remove those columns. 

Now, rows that have values as per national estimatio or UIS estimation can be used directly without any manipulation. Even Category not applicable just gives us information that that particular year data couldn't be captured because those particular countries don't have education category mentioned or any other reason. But the code 'n' means negligible or nil magnitude, so all records with n as flag will have the value as 0.0.

In [465]:
df_govt_gdp_on_edu.loc[df_govt_gdp_on_edu.Flags == 'Magnitude nil or negligible', "Value"] = 0.0
df_govt_gdp_on_edu.loc[df_govt_gdp_on_edu.Flags == 'Magnitude nil or negligible', "Value"].value_counts()

0.0    96
Name: Value, dtype: int64

NATMON_IND and Indicator fields are code and description fields respecively. We will store a mapping of them separately and remove the Indicator column.

In [466]:
indicator_map = {}
for ind in df_govt_gdp_on_edu.NATMON_IND.unique():
    value = df_govt_gdp_on_edu.loc[df_govt_gdp_on_edu.NATMON_IND == ind, ["Indicator"]].Indicator.unique()[0]
    indicator_map[ind] = value.replace("Government expenditure on ", "").replace(" as a percentage of GDP (%)", "")
    
indicator_map

{'XGDP_02_FSGOV': 'pre-primary education',
 'XGDP_1_FSGOV': 'primary education',
 'XGDP_2T3_FSGOV': 'secondary education',
 'XGDP_5T8_FSGOV': 'tertiary education',
 'XGDP_2T4_V_FSGOV': 'secondary and post-secondary non-tertiary vocational education',
 'XGDP_4_FSGOV': 'post-secondary non-tertiary education',
 'XGDP_3_FSGOV': 'upper secondary education',
 'XGDP_2_FSGOV': 'lower secondary education'}

We will clean out the dataset to remove the following columns as they won't be useful for further analysis based on the reasons mentioned:
1. TIME - It is a duplicate of Time, so dropping the one of them
2. LOCATION - It is a code of Country field, we have Country value common accross all other datasets for any comprison, this code column will not be useful
3. Flag Codes - Based on its value, we have done changes to value column, hence, its not required
4. Flags - same reason as Flag Codes
5. Indicator - We will keep the code column and remove the Description 

In [467]:
df_govt_gdp_on_edu.drop(["TIME", "LOCATION", "Flag Codes", "Flags", "Indicator"], axis = 1, inplace = True)
df_govt_gdp_on_edu.rename(columns = {"NATMON_IND": "EduSysIndicator"},inplace = True)

In [468]:
df_govt_gdp_on_edu = df_govt_gdp_on_edu.pivot(index=['EduSysIndicator', 'Country'], columns='Time', values='Value')
df_govt_gdp_on_edu
# z.loc[("XGDP_1_FSGOV", slice(None)), :].reset_index().drop("EduSysIndicator", axis =1) #to get a indicator specific data
# z.loc[(slice(None),"United States of America"), :].reset_index().drop("Country", axis =1)  # to get a country specific data
# .replace(indicator_map)

Unnamed: 0_level_0,Time,2016,2017,2018,2019,2020,2021
EduSysIndicator,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
XGDP_02_FSGOV,Afghanistan,0.00000,0.00000,,,,
XGDP_02_FSGOV,Andorra,0.41140,0.38322,0.38590,0.37259,,
XGDP_02_FSGOV,Argentina,0.47815,0.47920,0.54614,0.49245,0.51280,
XGDP_02_FSGOV,Armenia,0.32860,0.32483,,,0.32909,0.37479
XGDP_02_FSGOV,Aruba,0.31257,,,,,
...,...,...,...,...,...,...,...
XGDP_5T8_FSGOV,United States of America,1.21170,1.46188,1.27900,1.35691,,
XGDP_5T8_FSGOV,Uruguay,1.10021,1.09746,1.14138,1.08645,1.20701,
XGDP_5T8_FSGOV,Uzbekistan,,,0.66318,,,
XGDP_5T8_FSGOV,Vanuatu,,,,0.16051,0.29341,


All year columns have % of GDP spent on education data as floats for all countries. So, its good to go.

In [469]:
print(df_emp.info())
df_emp.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 31 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  189 non-null    object 
 1   1991     189 non-null    float64
 2   1992     189 non-null    float64
 3   1993     189 non-null    float64
 4   1994     189 non-null    float64
 5   1995     189 non-null    float64
 6   1996     189 non-null    float64
 7   1997     189 non-null    float64
 8   1998     189 non-null    float64
 9   1999     189 non-null    float64
 10  2000     189 non-null    float64
 11  2001     189 non-null    float64
 12  2002     189 non-null    float64
 13  2003     189 non-null    float64
 14  2004     189 non-null    float64
 15  2005     189 non-null    float64
 16  2006     189 non-null    float64
 17  2007     189 non-null    float64
 18  2008     189 non-null    float64
 19  2009     189 non-null    float64
 20  2010     189 non-null    float64
 21  2011     189 non

Unnamed: 0,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
count,189.0,189.0,189.0,189.0,189.0,189.0,189.0,189.0,189.0,189.0,...,189.0,189.0,189.0,189.0,189.0,189.0,189.0,189.0,189.0,189.0
mean,58.230159,58.121164,57.801587,57.757143,57.679365,57.511111,57.517989,57.473016,57.360317,57.343386,...,57.262434,57.297354,57.275132,57.374074,57.521164,57.571429,57.766138,57.940212,58.055556,55.587302
std,11.896678,11.89682,11.94695,11.912027,11.964471,11.964199,11.89607,11.874309,11.832428,11.838322,...,11.916831,11.924699,11.847428,11.787706,11.686243,11.649396,11.585708,11.586904,11.467977,11.255951
min,33.0,32.9,33.1,32.5,30.7,32.1,33.2,32.4,31.4,30.6,...,33.1,32.9,32.4,31.9,32.3,32.6,32.7,32.1,32.8,30.9
25%,51.1,50.4,49.9,50.1,49.5,49.5,49.4,49.6,48.8,49.1,...,50.0,50.1,49.7,49.7,49.8,49.5,50.1,50.5,50.4,48.6
50%,57.2,56.9,56.8,57.1,56.5,56.4,56.5,56.5,56.3,56.5,...,57.2,57.4,57.3,57.2,57.8,57.6,58.0,58.4,58.3,55.6
75%,64.8,64.1,64.6,64.7,64.8,64.7,64.6,64.5,64.7,64.6,...,64.2,64.3,63.8,64.1,64.3,64.3,64.4,65.1,65.1,62.1
max,90.5,89.8,88.7,87.8,86.5,86.4,86.1,85.8,85.4,85.0,...,86.9,87.8,87.0,87.0,87.8,87.2,86.7,86.6,86.7,83.3


All year columns have employement % for 15+ aged people as floats for all countries. So, its good to go.

In [470]:
print(df_income_per_person.info())
df_income_per_person.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 62 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   country  210 non-null    object
 1   1960     87 non-null     object
 2   1961     91 non-null     object
 3   1962     91 non-null     object
 4   1963     91 non-null     object
 5   1964     91 non-null     object
 6   1965     95 non-null     object
 7   1966     98 non-null     object
 8   1967     99 non-null     object
 9   1968     101 non-null    object
 10  1969     101 non-null    object
 11  1970     111 non-null    object
 12  1971     111 non-null    object
 13  1972     111 non-null    object
 14  1973     111 non-null    object
 15  1974     113 non-null    object
 16  1975     115 non-null    object
 17  1976     117 non-null    object
 18  1977     123 non-null    object
 19  1978     123 non-null    object
 20  1979     124 non-null    object
 21  1980     136 non-null    object
 22  19

Unnamed: 0,country,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
count,210,87,91,91,91,91,95,98,99,101,...,205,204,206,206,209,205,205,204,202,195
unique,210,82,82,84,89,84,91,90,97,99,...,191,192,190,193,194,195,190,187,189,183
top,Aruba,1160,1200,1170,1140,1570,14.5k,1540,4110,4960,...,14.1k,11.5k,11.7k,1340,35.8k,15.8k,13.6k,38.3k,11.4k,4050
freq,1,2,4,3,2,2,2,3,2,2,...,3,3,2,3,3,2,4,3,3,3


Income per person data has all expected float/ integer fields as objects (strings). This is due to the presence of K in the numbers denoting thousands. So a value of 6.6k for 2017 in Australia means its 66000.00. <br/>
Let's convert the numbers to their true values by replacing them with their integer values and again run the info and describe methods to ensure they are in correct datatypes.

In [471]:
for column in df_income_per_person.columns[1:]:
    is_thousand = df_income_per_person[column].str.contains('k', na = False)
    df_income_per_person[column].replace("k$","", regex = True, inplace = True)
    df_income_per_person[column] = pd.to_numeric(df_income_per_person[column])
    df_income_per_person[column] = np.where(is_thousand, df_income_per_person[column] *1000, df_income_per_person[column])

print(df_income_per_person.info())    
df_income_per_person.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 62 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  210 non-null    object 
 1   1960     87 non-null     float64
 2   1961     91 non-null     float64
 3   1962     91 non-null     float64
 4   1963     91 non-null     float64
 5   1964     91 non-null     float64
 6   1965     95 non-null     float64
 7   1966     98 non-null     float64
 8   1967     99 non-null     float64
 9   1968     101 non-null    float64
 10  1969     101 non-null    float64
 11  1970     111 non-null    float64
 12  1971     111 non-null    float64
 13  1972     111 non-null    float64
 14  1973     111 non-null    float64
 15  1974     113 non-null    float64
 16  1975     115 non-null    float64
 17  1976     117 non-null    float64
 18  1977     123 non-null    float64
 19  1978     123 non-null    float64
 20  1979     124 non-null    float64
 21  1980     136 non

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
count,87.0,91.0,91.0,91.0,91.0,95.0,98.0,99.0,101.0,101.0,...,205.0,204.0,206.0,206.0,209.0,205.0,205.0,204.0,202.0,195.0
mean,4612.390805,4625.912088,4755.043956,4894.846154,5153.417582,5193.757895,5276.938776,5444.919192,5711.950495,6001.821782,...,15152.302439,15255.112745,15296.980583,15515.296117,16643.210526,16001.765854,16241.204878,16445.264706,16670.455446,14731.615385
std,6739.589804,6849.582546,7013.03381,7148.905823,7623.04719,7714.854185,8040.309538,8499.493232,8687.57234,9058.009338,...,21812.83481,21757.362983,22216.919841,22564.684704,24944.157812,23080.08781,23233.41283,23805.179498,24361.427576,21901.503397
min,179.0,145.0,145.0,155.0,156.0,156.0,156.0,144.0,145.0,152.0,...,316.0,319.0,325.0,328.0,306.0,294.0,286.0,282.0,278.0,271.0
25%,791.0,780.0,799.0,815.5,828.5,920.0,844.25,819.5,844.0,873.0,...,2170.0,2187.5,2215.0,2210.0,2140.0,2190.0,2290.0,2335.0,2442.5,2265.0
50%,1730.0,1660.0,1620.0,1790.0,1850.0,1840.0,1835.0,1770.0,1800.0,1840.0,...,5410.0,5690.0,5950.0,6090.0,6180.0,6020.0,6190.0,6230.0,6435.0,5660.0
75%,4140.0,4040.0,4345.0,4725.0,5015.0,4990.0,4982.5,5230.0,5810.0,6230.0,...,19400.0,19050.0,18525.0,18475.0,20000.0,19700.0,19900.0,19800.0,19600.0,17250.0
max,37900.0,38800.0,39500.0,39000.0,42300.0,43300.0,48600.0,54000.0,54100.0,54900.0,...,140000.0,139000.0,151000.0,160000.0,167000.0,170000.0,163000.0,171000.0,182000.0,159000.0


The struture is now similar to other dataframes with country and years as columns. This is a little different due to the presence of multiindex but it can be used to select relavant dataeasily. All year columns have income per person data as floats for all countries after the transformations. So, its good to go.

In [472]:
print(df_life_exp.info(verbose=True))
df_life_exp.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 302 columns):
 #    Column   Dtype  
---   ------   -----  
 0    country  object 
 1    1800     float64
 2    1801     float64
 3    1802     float64
 4    1803     float64
 5    1804     float64
 6    1805     float64
 7    1806     float64
 8    1807     float64
 9    1808     float64
 10   1809     float64
 11   1810     float64
 12   1811     float64
 13   1812     float64
 14   1813     float64
 15   1814     float64
 16   1815     float64
 17   1816     float64
 18   1817     float64
 19   1818     float64
 20   1819     float64
 21   1820     float64
 22   1821     float64
 23   1822     float64
 24   1823     float64
 25   1824     float64
 26   1825     float64
 27   1826     float64
 28   1827     float64
 29   1828     float64
 30   1829     float64
 31   1830     float64
 32   1831     float64
 33   1832     float64
 34   1833     float64
 35   1834     float64
 36   1835     float

Unnamed: 0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
count,186.0,186.0,186.0,186.0,186.0,186.0,186.0,186.0,186.0,186.0,...,186.0,186.0,186.0,186.0,186.0,186.0,186.0,186.0,186.0,186.0
mean,31.503763,31.463441,31.480108,31.385484,31.460753,31.586559,31.644086,31.598387,31.385484,31.313441,...,83.361828,83.476344,83.600538,83.717742,83.838172,83.955376,84.076344,84.193548,84.312903,84.430645
std,3.80951,3.801217,3.932344,3.955872,3.928388,4.003874,4.102694,3.974506,4.08023,4.033412,...,5.803782,5.797854,5.788922,5.777904,5.770755,5.766333,5.756555,5.750616,5.743805,5.741341
min,23.4,23.4,23.4,19.6,23.4,23.4,23.4,23.4,12.5,13.4,...,66.4,66.5,66.7,66.8,66.9,67.0,67.1,67.2,67.3,67.4
25%,29.025,28.925,28.9,28.9,28.925,29.025,29.025,29.025,28.925,28.825,...,79.65,79.75,79.925,80.025,80.15,80.325,80.425,80.525,80.7,80.8
50%,31.75,31.65,31.55,31.5,31.55,31.65,31.75,31.75,31.55,31.5,...,84.0,84.1,84.25,84.3,84.5,84.6,84.7,84.8,84.9,85.0
75%,33.875,33.9,33.875,33.675,33.775,33.875,33.975,33.975,33.775,33.675,...,87.775,87.875,87.975,88.075,88.175,88.3,88.4,88.5,88.675,88.775
max,42.9,40.3,44.4,44.8,42.8,44.3,45.8,43.6,43.5,41.7,...,93.4,93.5,93.6,93.7,93.8,94.0,94.1,94.2,94.3,94.4


All year columns have life extectancy data as floats for all countries. So, its good to go.

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### How many children complete their primary education for least 5 gross income countries? What is the distribution amoung boys and girls?

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!