<a href="https://colab.research.google.com/github/jaya-shankar/education-impact/blob/master/life_expectancy_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!rm -rf education-impact
!rm education-impact

rm: cannot remove 'education-impact': No such file or directory


In [2]:
!git clone https://github.com/jaya-shankar/education-impact.git


Cloning into 'education-impact'...
remote: Enumerating objects: 270, done.[K
remote: Counting objects: 100% (270/270), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 270 (delta 129), reused 132 (delta 38), pack-reused 0[K
Receiving objects: 100% (270/270), 1.82 MiB | 16.49 MiB/s, done.
Resolving deltas: 100% (129/129), done.


In [None]:
!pip install tensorflow_decision_forests
!pip install wurlitzer
!pip install seaborn

In [4]:
import pandas as pd
import numpy as np
import math
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import train_test_split
from wurlitzer import sys_pipes



In [16]:
#@title Default title text
root = "education-impact/" 
datasets_path = {
                    "infant_mortality"              :  root+ "datasets/Infant_Mortality_Rate.csv",
                    "child_mortality"               :  root+ "datasets/child_mortality_0_5_year_olds_dying_per_1000_born.csv",
                    "children_per_woman"            :  root+ "datasets/children_per_woman_total_fertility.csv",
                    "co2_emissions"                 :  root+"datasets/co2_emissions_tonnes_per_person.csv",
                    "population"                    :  root+ "datasets/converted_pop.csv",
                    "food_supply"                   :  root+ "datasets/food_supply_kilocalories_per_person_and_day.csv",
                    "gdp_per_captia"                :  root+ "datasets/gdp_per_capita_yearly_growth.csv",
                    "Avg_daily_income_ppp"          :  root+ "datasets/mincpcap_cppp.csv",
                    "gini_index"                    :  root+ "datasets/gini.csv",
                    "life_expectancy"               :  root+ "datasets/life_expectancy_years.csv",
                    "malnutrition"                  :  root+ "datasets/malnutrition_weight_for_age_percent_of_children_under_5.csv",
                    "poverty_index"                 :  root+ "datasets/mincpcap_cppp.csv",
                    "maternal_mortality"            :  root+ "datasets/mmr_who.csv",
                    "people_in_poverty"             :  root+ "datasets/number_of_people_in_poverty.csv",
                    "primary_completion"            :  root+ "datasets/primary_school_completion_percent_of_girls.csv",
                    "ratio_b/g_in_primary"          :  root+ "datasets/ratio_of_girls_to_boys_in_primary_and_secondary_education_perc.csv",
                    "wcde-25--34"                   :  root+ "datasets/wcde-25--34.csv",
                    "wcde-Incomplete_Primary"       :  root+ "datasets/wcde-Incomplete Primary.csv",
                    "wcde-Lower_Secondary"          :  root+ "datasets/wcde-Lower Secondary.csv",
                    "wcde-Post_Secondary"           :  root+ "datasets/wcde-Post Secondary.csv",
                    "wcde_female-Incomplete_Primary":  root+ "datasets/wcde_female-Incomplete Primary.csv",
                    "wcde_female-Lower_Secondary"   :  root+ "datasets/wcde_female-Lower Secondary.csv",
                    "wcde_female-Post_Secondary"    :  root+ "datasets/wcde_female-Post Secondary.csv",
                }

In [17]:
datasets = [
            "infant_mortality",
            "life_expectancy",
            "child_mortality",
            "co2_emissions",
            "Avg_daily_income_ppp",
            "wcde-Incomplete_Primary",
            "wcde-Lower_Secondary",
            "population",
            "wcde_female-Lower_Secondary"
            ]

In [18]:
PREDICT_FUTURE  = 10
OUTPUTS         = ['life_expectancy']
                   

In [19]:
# to find out how many countries each dataset has
countries_count = None
least_dataset_path   = None
for dataset in datasets:
  df = pd.read_csv(datasets_path[dataset])
  count = len(set(df.Country.unique()))
  if not countries_count:
    countries_count = count
  elif countries_count>count:
    countries_count = count
    least_dataset_path   = datasets_path[dataset]
  print(f"{'Factor: ' + dataset:<30} count: {count}")
print(f"{'To use: ' + least_dataset_path:<30} count: {countries_count}")

Factor: infant_mortality       count: 266
Factor: life_expectancy        count: 195
Factor: child_mortality        count: 197
Factor: co2_emissions          count: 194
Factor: Avg_daily_income_ppp   count: 195
Factor: wcde-Incomplete_Primary count: 202
Factor: wcde-Lower_Secondary   count: 202
Factor: population             count: 197
Factor: wcde_female-Lower_Secondary count: 202
To use: education-impact/datasets/co2_emissions_tonnes_per_person.csv count: 194


In [30]:
common_countries = set()
for dataset in datasets:
  countries_list = list(pd.read_csv(datasets_path[dataset]).Country)
  countries_list = set(map(lambda x: x.lower(), countries_list))
  if common_countries == set():
    common_countries = countries_list
  else:
    common_countries = common_countries.intersection(countries_list)
len(common_countries)

154

In [31]:
# creating a list of all countries & years
countries = list(common_countries)
years     = [y for y in range(1960,2015-PREDICT_FUTURE+1)]

In [32]:
keys=[]
for y in years:
  for c in countries:
    keys.append((c,str(y)))

In [33]:
big_dic = {k : [] for k in keys}
for dataset in datasets:
  df = pd.read_csv(datasets_path[dataset])
  df["Country"] = df["Country"].str.lower()
  df.set_index("Country", inplace=True)
  for k in keys:
    try:
      big_dic[k].append(df.loc[k[0]][k[1]])
    except:
      big_dic[k].append(np.NaN)
 

In [34]:
for output_path in OUTPUTS:
  df = pd.read_csv(datasets_path[output_path])
  df["Country"] = df["Country"].str.lower()
  df.set_index("Country", inplace=True)
  for k in keys:
    try:
      big_dic[k].append(df.loc[k[0]][str(int(k[1])+PREDICT_FUTURE)])
    except:
      big_dic[k].append(np.NaN)

In [35]:
columns = [k for k in datasets ]
output_columns = ["o_"+o for o in OUTPUTS]
columns.extend(output_columns)

In [36]:
input_df = pd.DataFrame.from_dict(big_dic,orient='index', columns = columns)
output_df = input_df[["o_"+o for o in OUTPUTS]]
input_df.drop(labels=["o_"+o for o in OUTPUTS], axis = 1, inplace=True)

In [37]:
X_train, X_test, y_train, y_test = train_test_split(input_df, output_df, test_size=0.30, random_state=43)

In [38]:
X_train.isna().sum()

infant_mortality               569
life_expectancy                  0
child_mortality                  0
co2_emissions                  120
Avg_daily_income_ppp             0
wcde-Incomplete_Primary          0
wcde-Lower_Secondary             0
population                       0
wcde_female-Lower_Secondary      0
dtype: int64

In [39]:
y_train.isna().sum()

o_life_expectancy    0
dtype: int64

In [40]:
y_train

Unnamed: 0,o_life_expectancy
"(cambodia, 1999)",66.5
"(panama, 1980)",76.4
"(south africa, 1999)",55.7
"(tonga, 1963)",66.9
"(italy, 1966)",73.1
...,...
"(papua new guinea, 2000)",62.8
"(albania, 1975)",71.9
"(nepal, 1974)",54.0
"(togo, 1982)",58.1


In [43]:
combine_dfs("o_"+OUTPUTS[0],X_test,y_test)

Unnamed: 0,infant_mortality,life_expectancy,child_mortality,co2_emissions,Avg_daily_income_ppp,wcde-Incomplete_Primary,wcde-Lower_Secondary,population,wcde_female-Lower_Secondary,o_life_expectancy
"(papua new guinea, 1984)",69.3,62.2,98.0,0.5150,2.10,0.00,0.00,3980000,0.00,63.1
"(togo, 1963)",150.9,47.9,256.0,0.0652,2.80,88.34,11.50,1630000,4.00,52.8
"(belize, 1973)",62.5,68.5,85.4,1.1300,6.07,33.40,57.34,129000,57.38,71.7
"(bhutan, 1998)",63.1,65.4,86.5,0.6510,3.66,73.24,20.08,564000,14.70,70.3
"(zambia, 2002)",79.6,45.9,142.0,0.1790,2.75,36.36,42.82,11000000,40.18,58.5
...,...,...,...,...,...,...,...,...,...,...
"(myanmar, 1998)",68.0,58.1,94.0,0.1760,1.29,37.40,49.08,45600000,45.56,57.7
"(vanuatu, 1986)",35.6,63.1,44.7,0.4410,4.49,42.48,48.86,133000,46.58,63.3
"(mozambique, 1979)",177.8,50.2,264.0,0.2320,1.49,87.62,11.86,11300000,5.76,51.4
"(niger, 1965)",,42.3,314.0,0.0234,2.42,98.30,1.70,3910000,0.90,43.1


In [42]:
def combine_dfs(label,X,y):
  frames      = [X,y[label]]
  le_model_df = pd.concat(frames,axis=1)
  le_model_df.dropna(subset=[label],inplace=True)
  return le_model_df

In [44]:
for output in OUTPUTS:
  train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(combine_dfs("o_"+output,X_train,y_train), label="o_"+output, task=tfdf.keras.Task.REGRESSION)
  model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
  
  model.fit(x=train_ds)
  # Convert it to a TensorFlow dataset
  test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(combine_dfs("o_"+output,X_train,y_train), label="o_"+output, task=tfdf.keras.Task.REGRESSION)

  # Evaluate the model
  model.compile(metrics=["mse"])
  # Evaluate the model on the test dataset.
  evaluation = model.evaluate(test_ds, return_dict=True)
  print(output.upper())
  print(evaluation)
  print()
  print(f"MSE: {evaluation['mse']}")
  print(f"RMSE: {math.sqrt(evaluation['mse'])}")
  print()

LIFE_EXPECTANCY
{'loss': 0.0, 'mse': 1.5712116956710815}

MSE: 1.5712116956710815
RMSE: 1.2534798345689817



In [45]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)

In [None]:
# %set_cell_height 300

model.summary()

# RNN Algorithm

from the above output
- **malnutrition & people in povery** have least no of countries
- **infant mortality & gdp per captia** have highest no of countries

*Doubt:* Does having more data for one factor will make the decision tree bias?


###Steps
1. create a csv file such that each row contains all values of particular year & country present
2. the output for each row is year + 40 years corresponding value 
    1. **outputs** - life expectany, education level, gdp




From above output
- if we dont drop any rows our table size = 4256 entries
- if we drop rows containing any if all of outputs missing then our table size = 3039 entries
- if we drop rows containing any one of output missing then our table size = 1745 entries

so, I think its is better to go with second choice and build different models, but not sure it will not effect performance of the model


now we have the dataframe containing both inputs and ouputs,our next step is
1. split the data into train & test data
  1. try to split data based on continents to reduce bias
2. build DF model using tensorflow
3. check the accuracy of the model