<a href="https://colab.research.google.com/github/jaya-shankar/education-impact/blob/master/life_expectancy_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!rm -rf education-impact
!rm education-impact

rm: cannot remove 'education-impact': No such file or directory


In [3]:
!git clone https://github.com/jaya-shankar/education-impact.git


Cloning into 'education-impact'...
remote: Enumerating objects: 264, done.[K
remote: Counting objects: 100% (264/264), done.[K
remote: Compressing objects: 100% (223/223), done.[K
remote: Total 264 (delta 125), reused 132 (delta 38), pack-reused 0[K
Receiving objects: 100% (264/264), 1.81 MiB | 11.53 MiB/s, done.
Resolving deltas: 100% (125/125), done.


In [None]:
!pip install tensorflow_decision_forests
!pip install wurlitzer
!pip install seaborn

In [7]:
#@title Default title text
root = "education-impact/" 
datasets_path = {
                    "infant_mortality"              :  root+ "datasets/Infant_Mortality_Rate.csv",
                    "child_mortality"               :  root+ "datasets/child_mortality_0_5_year_olds_dying_per_1000_born.csv",
                    "children_per_woman"            :  root+ "datasets/children_per_woman_total_fertility.csv",
                    "co2_emissions"                 :  root+"datasets/co2_emissions_tonnes_per_person.csv",
                    "population"                    :  root+ "datasets/converted_pop.csv",
                    "food_supply"                   :  root+ "datasets/food_supply_kilocalories_per_person_and_day.csv",
                    "gdp_per_captia"                :  root+ "datasets/gdp_per_capita_yearly_growth.csv",
                    "Avg_daily_income_ppp"          :  root+ "datasets/mincpcap_cppp.csv",
                    "gini_index"                    :  root+ "datasets/gini.csv",
                    "life_expectancy"               :  root+ "datasets/life_expectancy_years.csv",
                    "malnutrition"                  :  root+ "datasets/malnutrition_weight_for_age_percent_of_children_under_5.csv",
                    "poverty_index"                 :  root+ "datasets/mincpcap_cppp.csv",
                    "maternal_mortality"            :  root+ "datasets/mmr_who.csv",
                    "people_in_poverty"             :  root+ "datasets/number_of_people_in_poverty.csv",
                    "primary_completion"            :  root+ "datasets/primary_school_completion_percent_of_girls.csv",
                    "ratio_b/g_in_primary"          :  root+ "datasets/ratio_of_girls_to_boys_in_primary_and_secondary_education_perc.csv",
                    "wcde-25--34"                   :  root+ "datasets/wcde-25--34.csv",
                    "wcde-Incomplete_Primary"       :  root+ "datasets/wcde-Incomplete Primary.csv",
                    "wcde-Lower_Secondary"          :  root+ "datasets/wcde-Lower Secondary.csv",
                    "wcde-Post_Secondary"           :  root+ "datasets/wcde-Post Secondary.csv",
                    "wcde_female-Incomplete_Primary":  root+ "datasets/wcde_female-Incomplete Primary.csv",
                    "wcde_female-Lower_Secondary"   :  root+ "datasets/wcde_female-Lower Secondary.csv",
                    "wcde_female-Post_Secondary"    :  root+ "datasets/wcde_female-Post Secondary.csv",
                }

In [8]:
datasets = [
            "infant_mortality",
            # "life_expectancy",
            "child_mortality",
            "co2_emissions",
            "Avg_daily_income_ppp",
            "wcde-Incomplete_Primary",
            "wcde-Lower_Secondary",
            "population",
            "wcde_female-Lower_Secondary"
            ]

In [9]:
PREDICT_FUTURE  = 10
OUTPUTS         = ['life_expectancy']
                   

In [10]:
import pandas as pd
import numpy as np
import math
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import train_test_split
from wurlitzer import sys_pipes



In [44]:
# to find out how many countries each dataset has
countries_count = None
least_dataset_path   = None
for dataset in datasets:
  df = pd.read_csv(datasets_path[dataset])
  count = len(set(df.Country.unique()))
  if not countries_count:
    countries_count = count
  elif countries_count>count:
    countries_count = count
    least_dataset_path   = datasets_path[dataset]
  print(f"{'Factor: ' + dataset:<30} count: {count}")
print(f"{'To use: ' + least_dataset_path:<30} count: {countries_count}")

Factor: infant_mortality       count: 266
Factor: child_mortality        count: 197
Factor: co2_emissions          count: 194
Factor: Avg_daily_income_ppp   count: 195
Factor: wcde-Incomplete_Primary count: 202
Factor: wcde-Lower_Secondary   count: 202
Factor: population             count: 197
Factor: wcde_female-Lower_Secondary count: 202
To use: education-impact/datasets/co2_emissions_tonnes_per_person.csv count: 194


In [46]:
# creating a list of all countries & years
countries = list(pd.read_csv(least_dataset_path).Country.unique())
years     = [y for y in range(1960,2015-PREDICT_FUTURE+1)]

In [47]:
keys=[]
for y in years:
  for c in countries:
    keys.append((c,str(y)))

In [48]:
big_dic = {k : [] for k in keys}
for dataset in datasets:
  df = pd.read_csv(datasets_path[dataset])
  df.set_index("Country", inplace=True)
  for k in keys:
    try:
      big_dic[k].append(df.loc[k[0]][k[1]])
    except:
      big_dic[k].append(np.NaN)
 

In [49]:
for output_path in OUTPUTS:
  df = pd.read_csv(datasets_path[output_path])
  df.set_index("Country", inplace=True)
  for k in keys:
    try:
      big_dic[k].append(df.loc[k[0]][str(int(k[1])+PREDICT_FUTURE)])
    except:
      big_dic[k].append(np.NaN)

In [50]:
columns = [k for k in datasets ]
output_columns = ["o_"+o for o in OUTPUTS]
columns.extend(output_columns)

In [51]:
input_df = pd.DataFrame.from_dict(big_dic,orient='index', columns = columns)
output_df = input_df[["o_"+o for o in OUTPUTS]]
input_df.drop(labels=["o_"+o for o in OUTPUTS], axis = 1, inplace=True)

In [52]:
X_train, X_test, y_train, y_test = train_test_split(input_df, output_df, test_size=0.30, random_state=43)

In [53]:
X_train.isna().sum()

infant_mortality               1259
child_mortality                   0
co2_emissions                   316
Avg_daily_income_ppp             43
wcde-Incomplete_Primary        1165
wcde-Lower_Secondary           1165
population                        0
wcde_female-Lower_Secondary    1165
dtype: int64

In [54]:
y_train.isna().sum()

o_life_expectancy    43
dtype: int64

In [55]:
y_train

Unnamed: 0,o_life_expectancy
"(Burundi, 1975)",49.3
"(Bosnia and Herzegovina, 2001)",76.4
"(Philippines, 1973)",66.3
"(Algeria, 1988)",70.2
"(Mauritius, 1973)",68.8
...,...
"(Serbia, 2003)",74.9
"(Namibia, 1970)",58.9
"(Chile, 2001)",79.0
"(Thailand, 1971)",66.9


In [56]:
combine_dfs("o_"+OUTPUTS[0],X_test,y_test)

Unnamed: 0,infant_mortality,child_mortality,co2_emissions,Avg_daily_income_ppp,wcde-Incomplete_Primary,wcde-Lower_Secondary,population,wcde_female-Lower_Secondary,o_life_expectancy
"(Albania, 2005)",17.8,19.20,1.380,7.70,3.70,45.00,3090000,45.90,78.1
"(Bangladesh, 2002)",57.3,78.20,0.242,3.05,41.88,44.36,132000000,45.94,72.3
"(Vietnam, 1971)",53.9,80.80,0.551,1.60,,,44500000,,66.6
"(Malaysia, 1977)",30.5,37.30,1.770,9.50,22.30,53.04,12800000,51.20,71.0
"(Malaysia, 2000)",8.7,10.20,5.460,15.60,7.80,26.20,23200000,22.80,74.4
...,...,...,...,...,...,...,...,...,...
"(Iceland, 1995)",4.2,5.21,9.220,34.70,0.00,54.80,268000,55.10,81.6
"(Guatemala, 1989)",61.6,83.80,0.463,5.55,56.08,36.06,9050000,31.88,66.4
"(Micronesia, Fed. Sts., 2002)",39.8,51.00,1.340,7.33,,,107000,,64.2
"(Haiti, 1986)",110.4,161.00,0.128,3.68,73.20,23.98,6480000,20.00,55.7


In [57]:
def combine_dfs(label,X,y):
  frames      = [X,y[label]]
  le_model_df = pd.concat(frames,axis=1)
  le_model_df.dropna(subset=[label],inplace=True)
  return le_model_df

In [58]:
for output in OUTPUTS:
  train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(combine_dfs("o_"+output,X_train,y_train), label="o_"+output, task=tfdf.keras.Task.REGRESSION)
  model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
  with sys_pipes():
    model.fit(x=train_ds)
  # Convert it to a TensorFlow dataset
  test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(combine_dfs("o_"+output,X_train,y_train), label="o_"+output, task=tfdf.keras.Task.REGRESSION)

  # Evaluate the model
  model.compile(metrics=["mse"])
  # Evaluate the model on the test dataset.
  evaluation = model.evaluate(test_ds, return_dict=True)
  print(output.upper())
  print(evaluation)
  print()
  print(f"MSE: {evaluation['mse']}")
  print(f"RMSE: {math.sqrt(evaluation['mse'])}")
  print()



[INFO kernel.cc:736] Start Yggdrasil model training
[INFO kernel.cc:737] Collect training examples
[INFO kernel.cc:392] Number of batches: 97
[INFO kernel.cc:393] Number of examples: 6203
[INFO kernel.cc:759] Dataset:
Number of records: 6203
Number of columns: 9

Number of columns by type:
	NUMERICAL: 9 (100%)

Columns:

NUMERICAL: 9 (100%)
	0: "Avg_daily_income_ppp" NUMERICAL mean:12.3176 min:0.223 max:263 sd:18.3475
	1: "child_mortality" NUMERICAL mean:96.2245 min:3.2 max:423 sd:86.2446
	2: "co2_emissions" NUMERICAL num-nas:289 (4.65904%) mean:4.33371 min:0 max:99.5 sd:7.33579
	3: "infant_mortality" NUMERICAL num-nas:1216 (19.6034%) mean:60.2604 min:2.5 max:228.4 sd:47.9376
	4: "population" NUMERICAL mean:2.53245e+07 min:4630 max:1.33e+09 sd:1.02028e+08
	5: "wcde-Incomplete_Primary" NUMERICAL num-nas:1122 (18.088%) mean:33.1418 min:0 max:99.86 sd:32.8705
	6: "wcde-Lower_Secondary" NUMERICAL num-nas:1122 (18.088%) mean:35.2819 min:0 max:92.3 sd:20.4949
	7: "wcde_female-Lower_Secondary

LIFE_EXPECTANCY
{'loss': 0.0, 'mse': 2.161480665206909}

MSE: 2.161480665206909
RMSE: 1.4701974919060736



In [59]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)

In [None]:
# %set_cell_height 300

model.summary()

# RNN Algorithm

from the above output
- **malnutrition & people in povery** have least no of countries
- **infant mortality & gdp per captia** have highest no of countries

*Doubt:* Does having more data for one factor will make the decision tree bias?


###Steps
1. create a csv file such that each row contains all values of particular year & country present
2. the output for each row is year + 40 years corresponding value 
    1. **outputs** - life expectany, education level, gdp




From above output
- if we dont drop any rows our table size = 4256 entries
- if we drop rows containing any if all of outputs missing then our table size = 3039 entries
- if we drop rows containing any one of output missing then our table size = 1745 entries

so, I think its is better to go with second choice and build different models, but not sure it will not effect performance of the model


now we have the dataframe containing both inputs and ouputs,our next step is
1. split the data into train & test data
  1. try to split data based on continents to reduce bias
2. build DF model using tensorflow
3. check the accuracy of the model