<a href="https://colab.research.google.com/github/jaya-shankar/education-impact/blob/jaya-shankar/randomForest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!rm -rf education-impact
!rm education-impact

rm: cannot remove 'education-impact': No such file or directory


In [None]:
!git clone https://github.com/jaya-shankar/education-impact.git


In [None]:
!pip install tensorflow_decision_forests
!pip install wurlitzer

In [4]:
#@title Default title text
root = "education-impact/" 
datasets_path = {
                    "infant_mortality"              :  root+ "datasets/Infant_Mortality_Rate.csv",
                    "child_mortality"               :  root+ "datasets/child_mortality_0_5_year_olds_dying_per_1000_born.csv",
                    "children_per_woman"            :  root+ "datasets/children_per_woman_total_fertility.csv",
                    "co2_emissions"                 :  root+"datasets/co2_emissions_tonnes_per_person.csv",
                    "population"                    :  root+ "datasets/converted_pop.csv",
                    "food_supply"                   :  root+ "datasets/food_supply_kilocalories_per_person_and_day.csv",
                    "gdp_per_captia"                :  root+ "datasets/gdp_per_capita_yearly_growth.csv",
                    "gini_index"                    :  root+ "datasets/gini.csv",
                    "life_expectancy"               :  root+ "datasets/life_expectancy_years.csv",
                    "malnutrition"                  :  root+ "datasets/malnutrition_weight_for_age_percent_of_children_under_5.csv",
                    "poverty_index"                 :  root+ "datasets/mincpcap_cppp.csv",
                    "maternal_mortality"            :  root+ "datasets/mmr_who.csv",
                    "people_in_poverty"             :  root+ "datasets/number_of_people_in_poverty.csv",
                    "primary_completion"            :  root+ "datasets/primary_school_completion_percent_of_girls.csv",
                    "ratio_b/g_in_primary"          :  root+ "datasets/ratio_of_girls_to_boys_in_primary_and_secondary_education_perc.csv",
                    "wcde-25--34"                   :  root+ "datasets/wcde-25--34.csv",
                    "wcde-Lower_Secondary"          :  root+ "datasets/wcde-Lower Secondary.csv",
                }

In [52]:
datasets = [
            "infant_mortality",
            "child_mortality",
            "co2_emissions",
            "population",
            "food_supply",
            "gdp_per_captia",
            "gini_index",
            "life_expectancy",
            "poverty_index",
            "primary_completion",
            "ratio_b/g_in_primary",
            "wcde-Lower_Secondary"
            ]

In [51]:
PREDICT_FUTURE  = 10
OUTPUTS         = [
                  #  'life_expectancy', 
                  #  'gdp_per_captia', 
                   'primary_completion' 
                   ]

In [53]:
import pandas as pd
import numpy as np
import math
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import train_test_split
from wurlitzer import sys_pipes

In [54]:
# to find out how many countries each dataset has
countries_arr = []
for path in datasets_path:
  df = pd.read_csv(datasets_path[path])
  print(f"{'Factor: ' + path:<30} count: {len(set(df.Country.unique()))}")
  

Factor: infant_mortality       count: 266
Factor: child_mortality        count: 197
Factor: children_per_woman     count: 202
Factor: co2_emissions          count: 194
Factor: population             count: 197
Factor: food_supply            count: 179
Factor: gdp_per_captia         count: 221
Factor: gini_index             count: 195
Factor: life_expectancy        count: 195
Factor: malnutrition           count: 156
Factor: poverty_index          count: 195
Factor: maternal_mortality     count: 184
Factor: people_in_poverty      count: 145
Factor: primary_completion     count: 195
Factor: ratio_b/g_in_primary   count: 200
Factor: wcde-25--34            count: 202
Factor: wcde-Lower_Secondary   count: 202


from the above output
- **malnutrition & people in povery** have least no of countries
- **infant mortality & gdp per captia** have highest no of countries

*Doubt:* Does having more data for one factor will make the decision tree bias?


###Steps
1. create a csv file such that each row contains all values of particular year & country present
2. the output for each row is year + 40 years corresponding value 
    1. **outputs** - life expectany, education level, gdp




In [55]:
# creating a list of all countries & years
countries = list(pd.read_csv('education-impact/datasets/Infant_Mortality_Rate.csv').Country.unique())
years     = [y for y in range(1960,2015-PREDICT_FUTURE+1)]

In [56]:
keys=[]
for y in years:
  for c in countries:
    keys.append((c,str(y)))

In [57]:
big_dic = {k : [] for k in keys}
for dataset in datasets:
  df = pd.read_csv(datasets_path[dataset])
  df.set_index("Country", inplace=True)
  for k in keys:
    try:
      big_dic[k].append(df.loc[k[0]][k[1]])
    except:
      big_dic[k].append(np.NaN)
 

In [58]:
for output_path in OUTPUTS:
  df = pd.read_csv(datasets_path[output_path])
  df.set_index("Country", inplace=True)
  for k in keys:
    try:
      big_dic[k].append(df.loc[k[0]][str(int(k[1])+PREDICT_FUTURE)])
    except:
      big_dic[k].append(np.NaN)

In [59]:
columns = [k for k in datasets ]
output_columns = ["o_"+o for o in OUTPUTS]
columns.extend(output_columns)

In [60]:
input_df = pd.DataFrame.from_dict(big_dic,orient='index', columns = columns)
output_df = input_df[["o_"+o for o in OUTPUTS]]
input_df.drop(labels=["o_"+o for o in OUTPUTS], axis = 1, inplace=True)

From above output
- if we dont drop any rows our table size = 4256 entries
- if we drop rows containing any if all of outputs missing then our table size = 3039 entries
- if we drop rows containing any one of output missing then our table size = 1745 entries

so, I think its is better to go with second choice and build different models, but not sure it will not effect performance of the model


now we have the dataframe containing both inputs and ouputs,our next step is
1. split the data into train & test data
  1. try to split data based on continents to reduce bias
2. build DF model using tensorflow
3. check the accuracy of the model

In [61]:
X_train, X_test, y_train, y_test = train_test_split(input_df, output_df, test_size=0.30, random_state=43)

In [62]:
def combine_dfs(label):
  frames      = [X_train,y_train[label]]
  le_model_df = pd.concat(frames,axis=1)
  le_model_df.dropna(subset=[label],inplace=True)
  return le_model_df

In [63]:
for output in OUTPUTS:
  train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(combine_dfs("o_"+output), label="o_"+output, task=tfdf.keras.Task.REGRESSION)
  model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
  with sys_pipes():
    model.fit(x=train_ds)
  # Convert it to a TensorFlow dataset
  test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(combine_dfs("o_"+output), label="o_"+output, task=tfdf.keras.Task.REGRESSION)

  # Evaluate the model
  model.compile(metrics=["mse"])
  # Evaluate the model on the test dataset.
  evaluation = model.evaluate(test_ds, return_dict=True)
  print(output.upper())
  print(evaluation)
  print()
  print(f"MSE: {evaluation['mse']}")
  print(f"RMSE: {math.sqrt(evaluation['mse'])}")
  print()



[INFO kernel.cc:736] Start Yggdrasil model training
[INFO kernel.cc:737] Collect training examples
[INFO kernel.cc:392] Number of batches: 39
[INFO kernel.cc:393] Number of examples: 2459
[INFO kernel.cc:759] Dataset:
Number of records: 2459
Number of columns: 13

Number of columns by type:
	NUMERICAL: 13 (100%)

Columns:

NUMERICAL: 13 (100%)
	0: "child_mortality" NUMERICAL num-nas:38 (1.54534%) mean:89.5996 min:3.2 max:357 sd:83.7571
	1: "co2_emissions" NUMERICAL num-nas:91 (3.70069%) mean:4.47529 min:0.0116 max:99.5 sd:8.8863
	2: "food_supply" NUMERICAL num-nas:377 (15.3314%) mean:2551.97 min:1310 max:3730 sd:502.066
	3: "gdp_per_captia" NUMERICAL num-nas:1 (0.0406669%) mean:1.8866 min:-45.9 max:118 sd:7.10564
	4: "gini_index" NUMERICAL num-nas:38 (1.54534%) mean:41.0166 min:18.4 max:75.8 sd:10.2856
	5: "infant_mortality" NUMERICAL num-nas:112 (4.5547%) mean:57.0608 min:2.5 max:223.9 sd:46.2949
	6: "life_expectancy" NUMERICAL num-nas:45 (1.83001%) mean:64.1988 min:9.5 max:81.7 sd:10

PRIMARY_COMPLETION
{'loss': 0.0, 'mse': 28.090229034423828}

MSE: 28.090229034423828
RMSE: 5.300021606977072



In [64]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)