# Economic Well-Being Prediction Challenge

File name: EconomicAI.ipynb

Author: kogni7

Date: July/August 2021

## Contents
* 1 Preparation
* 2 Data
* 3 Training
* 4 Prediction and Submission

This notebook uses only the data sets provided by ZINDI. These data sets contain economical information. These are the only used features in this notebook. The task is to predict the economic wealth.

The file system for this project is:

* EconomicAI (root)
    * EconomicAI.ipynb (this notebook)
    * Data
        * Train.csv
        * Test.csv
        * SampleSubmission.csv
    * Submission
        * 1 - x: Submission directories named by the version number
            * submission.csv

This jupyter notebook with an R kernel runs in Google Colab without special configuration. GPU is disabled.

This notebook uses CatBoost (catboost.ai).

## 1 Preparation
### Time

In [1]:
start_time <- Sys.time()

### Installations and Libraries

In [2]:
# R Version
print(R.version.string)

# Install googledrive! (Reference: https://gist.github.com/jobdiogenes/235620928c84e604c6e56211ccf681f0)
if (file.exists("/usr/local/lib/python3.7/dist-packages/google/colab/_ipython.py")) {
  install.packages("R.utils")
  library("R.utils")
  library("httr")
  my_check <- function() {return(TRUE)}
  reassignInPackage("is_interactive", pkgName = "httr", my_check) 
  options(rlang_interactive=TRUE)
}
packages <- c("googledrive")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}
library("googledrive")
drive_auth(use_oob=TRUE, cache=TRUE)

# Install CatBoost!
install.packages('devtools')
devtools::install_url('https://github.com/catboost/catboost/releases/download/v0.26/catboost-R-Linux-0.26.tgz', INSTALL_opts = c("--no-multiarch", "--no-test-load"))
library(catboost)

# Install caret!
install.packages("caret", dependencies = TRUE)
library(caret)

# Seed
SEED = 42

# Version
VERSION = "VERSION_12"

[1] "R version 4.1.0 (2021-05-18)"


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘R.oo’, ‘R.methodsS3’


Loading required package: R.oo

Loading required package: R.methodsS3

R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.

R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.


Attaching package: ‘R.oo’


The following object is masked from ‘package:R.methodsS3’:

    throw


The following objects are masked from ‘package:methods’:

    getClasses, getMethods


The following objects are masked from ‘package:base’:

    attach, detach, load, save


R.utils v2.10.1 (2020-08-26 22:50:31 UTC) successfully loaded. See ?R.utils for help.


Attaching package: ‘R.utils’


The following object is masked from ‘package:utils’:

    timestamp


The following objects are masked from ‘package:base’:

    cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,


Please point your browse

Enter authorization code: 4/1AX4XfWgiQJdW7ppIt_DpjQIvxGs-3csCTENH1Fx9j_7cBxnTJF_U44f6_fA


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Downloading package from url: https://github.com/catboost/catboost/releases/download/v0.26/catboost-R-Linux-0.26.tgz




[32m✔[39m  [90mchecking for file ‘/tmp/Rtmp4yNRU6/remotes4024ce3636/catboost/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘catboost’:[39m[36m[36m (395ms)[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘catboost_0.26.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘sass’, ‘jquerylib’, ‘bitops’, ‘numDeriv’, ‘SQUAREM’, ‘httpuv’, ‘xtable’, ‘sourcetools’, ‘later’, ‘promises’, ‘bslib’, ‘R.cache’, ‘caTools’, ‘TH.data’, ‘profileModel’, ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘plotrix’, ‘lava’, ‘shiny’, ‘miniUI’, ‘styler’, ‘classInt’, ‘labelled’, ‘gplots’, ‘libcoin’, ‘matrixStats’, ‘multcomp’, ‘lazyeval’, ‘iterators’, ‘gower’, ‘timeDate’, ‘brglm’, ‘gtools’, ‘lme4’, ‘qvcalc’, ‘Formula’, ‘plotmo’, ‘TeachingDemos’, ‘prodlim’, ‘combinat’, ‘questionr’, ‘ROCR’, ‘mvtnorm’, ‘modeltools’, ‘strucchange’, ‘coin’, ‘zoo’, ‘sandwich’, ‘ISwR’, ‘corpcor’, ‘rex’, ‘foreach’, ‘plyr’, ‘ModelMetrics’, ‘reshape2’, ‘recipes’, ‘pROC’, ‘BradleyTerry2’, ‘e1071’, ‘earth’, ‘fastICA’, ‘gam’, ‘ipred’, ‘kernlab’, ‘klaR’, ‘ellipse’, ‘mda’, ‘mlbench’, ‘MLmetrics’, ‘party’, ‘pls’, ‘proxy’, ‘randomFo

## 2 Data

In [3]:
drive_download("EconomicAI/Data/Train.csv")
drive_download("EconomicAI/Data/Test.csv")
drive_download("EconomicAI/Data/SampleSubmission.csv")

File downloaded:
  * Train.csv
Saved locally as:
  * Train.csv

File downloaded:
  * Test.csv
Saved locally as:
  * Test.csv

File downloaded:
  * SampleSubmission.csv
Saved locally as:
  * SampleSubmission.csv



In [4]:
train = read.csv2("/content/Train.csv", header = TRUE, sep = ",", dec = ".", fill = TRUE)
head(train)

Unnamed: 0_level_0,ID,country,year,urban_or_rural,ghsl_water_surface,ghsl_built_pre_1975,ghsl_built_1975_to_1990,ghsl_built_1990_to_2000,ghsl_built_2000_to_2014,ghsl_not_built_up,ghsl_pop_density,landcover_crops_fraction,landcover_urban_fraction,landcover_water_permanent_10km_fraction,landcover_water_seasonal_10km_fraction,nighttime_lights,dist_to_capital,dist_to_shoreline,Target
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ID_AAIethGy,Ethiopia,2016,R,0,0.0,0.0,5.549359e-05,0.000536438,0.9994081,12.14613,25.489659,0.8794843,0.0,0.0,0.0,278.788451,769.3384,0.132782655
2,ID_AAYiaCeL,Ethiopia,2005,R,0,0.0,0.0001098293,0.0,1.830489e-05,0.9998719,113.80672,64.136053,0.6014272,0.0,0.005426636,0.0,200.986978,337.1352,0.004898371
3,ID_AAdurmKj,Mozambique,2009,R,0,0.0,0.0,0.0,0.0,1.0,0.0,4.400096,0.1319001,0.0,0.00307795,0.0,642.594208,169.9138,0.097319538
4,ID_AAgNHles,Malawi,2015,R,0,0.0001405108,0.0001813272,0.0002543559,0.0002283301,0.9991955,5.21332,25.379371,2.0171358,11.293841067,0.131035181,0.0,365.349451,613.5916,0.304107443
5,ID_AAishfND,Guinea,2012,U,0,0.0116487475,0.0175603698,0.01738268,0.09987471,0.8535335,31.73466,5.08162,22.8159838,0.005047406,0.130475446,1.461894,222.867189,192.9264,0.605328378
6,ID_AAnetgMr,Ethiopia,2016,U,0,0.0086234041,0.0194090158,0.05988562,0.08268153,0.8294004,203.58051,24.629433,31.2357077,0.0,0.008222504,22.98197,9.803702,487.7909,0.463882051


In [5]:
test = read.csv2("/content/Test.csv", header = TRUE, sep = ",", dec = ".", fill = TRUE)
head(test)

Unnamed: 0_level_0,ID,country,year,urban_or_rural,ghsl_water_surface,ghsl_built_pre_1975,ghsl_built_1975_to_1990,ghsl_built_1990_to_2000,ghsl_built_2000_to_2014,ghsl_not_built_up,ghsl_pop_density,landcover_crops_fraction,landcover_urban_fraction,landcover_water_permanent_10km_fraction,landcover_water_seasonal_10km_fraction,nighttime_lights,dist_to_capital,dist_to_shoreline
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ID_AAcismbB,Democratic Republic of Congo,2007,R,0.0,0.0,0.0,0.0005227648,0.0001306912,0.9993465,0.6607576,0.9909424,0.1322639,0.06905451,0.03262657,0.0,1249.29472,1364.5921
2,ID_AAeBMsji,Democratic Republic of Congo,2007,U,0.0,0.004238547,0.0002378193,0.0012266313,0.0028597346,0.9914373,6.4415469,5.4613653,0.4523995,0.0,0.0,0.0,821.019579,1046.0287
3,ID_AAjFMjzy,Uganda,2011,U,0.00735941,0.5256816,0.1327951,0.095416039,0.0423747725,0.196373,587.5164577,2.8818109,87.387991,3.24848286,3.9503742,60.070041,3.620455,906.0573
4,ID_AAmMOEEC,Burkina Faso,2010,U,0.0,8.936764e-05,3.574706e-05,0.0015192499,0.0013583881,0.9969972,35.1417616,33.8789266,4.1664369,0.0,0.1307269,1.333999,109.493969,775.1392
5,ID_ABguzDxp,Zambia,2007,R,0.0,0.0001383505,0.0006225773,0.0006146425,0.0029614469,0.995663,3.442449,33.4919942,3.4371285,0.1335626,0.12899741,0.502203,133.643319,835.5915
6,ID_ABomWihE,Angola,2015,R,0.0,4.520264e-05,1.375733e-05,0.0,0.0002147828,0.9997263,0.7884931,26.5510233,0.4565683,0.09740531,0.05435116,0.0,592.234658,375.9099


## 3 Training

In [6]:
country = as.factor(train$country)
year = as.factor(train$year)
urban_or_rural = as.factor(train$urban_or_rural)
ghsl_water_surface = train$ghsl_water_surface
ghsl_built_pre_1975 = train$ghsl_built_pre_1975
ghsl_built_1975_to_1990 = train$ghsl_built_1975_to_1990
ghsl_built_1990_to_2000 = train$ghsl_built_1990_to_2000
ghsl_built_2000_to_2014 = train$ghsl_built_2000_to_2014
ghsl_not_built_up = train$ghsl_not_built_up
ghsl_pop_density = train$ghsl_pop_density
landcover_crops_fraction = train$landcover_crops_fraction
landcover_urban_fraction = train$landcover_urban_fraction
landcover_water_permanent_10km_fraction = train$landcover_water_permanent_10km_fraction
landcover_water_seasonal_10km_fraction = train$landcover_water_seasonal_10km_fraction
nighttime_lights = train$nighttime_lights
dist_to_capital = train$dist_to_capital
dist_to_shoreline = train$dist_to_shoreline
Target = train$Target

rm(train)
train = data.frame(country,
                   year,
                   urban_or_rural,
                   ghsl_water_surface,
                   ghsl_built_pre_1975,
                   ghsl_built_1975_to_1990,
                   ghsl_built_1990_to_2000,
                   ghsl_built_2000_to_2014,
                   ghsl_not_built_up,
                   ghsl_pop_density,
                   landcover_crops_fraction,
                   landcover_urban_fraction,
                   landcover_water_permanent_10km_fraction,
                   landcover_water_seasonal_10km_fraction,
                   nighttime_lights, dist_to_capital,
                   dist_to_shoreline)

In [7]:
summary(train)

     country          year      urban_or_rural ghsl_water_surface
 Nigeria :2695   2014   :2931   R:14061        Min.   :0.00000   
 Kenya   :2626   2011   :2309   U: 7393        1st Qu.:0.00000   
 Tanzania:2450   2008   :2196                  Median :0.00000   
 Malawi  :1957   2015   :2103                  Mean   :0.02826   
 Ethiopia:1721   2010   :1969                  3rd Qu.:0.00000   
 Ghana   :1419   2013   :1728                  Max.   :0.96996   
 (Other) :8586   (Other):8218                                    
 ghsl_built_pre_1975 ghsl_built_1975_to_1990 ghsl_built_1990_to_2000
 Min.   :0.0000000   Min.   :0.0000000       Min.   :0.0000000      
 1st Qu.:0.0000000   1st Qu.:0.0000000       1st Qu.:0.0000428      
 Median :0.0001975   Median :0.0007092       Median :0.0010009      
 Mean   :0.0382224   Mean   :0.0286437       Mean   :0.0126889      
 3rd Qu.:0.0079866   3rd Qu.:0.0098682       3rd Qu.:0.0081277      
 Max.   :0.8771158   Max.   :0.6850103       Max.   :0.515

In [8]:
control <- trainControl(method = "cv", number = 5)

grid <- expand.grid(depth = c(5),
                    learning_rate = c(0.05),
                    iterations = c(1000),
                    l2_leaf_reg = c(6.0),
                    rsm = c(1),
                    border_count = c(254))

set.seed(SEED)
model <- train(train,
               Target,
               method = catboost.caret,
               logging_level = "Silent",
               tuneGrid = grid,
               trControl = control)

print(model)

Catboost 

21454 samples
   17 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 17163, 17164, 17162, 17164, 17163 
Resampling results:

  RMSE        Rsquared   MAE       
  0.08653146  0.8017513  0.06330817

Tuning parameter 'depth' was held constant at a value of 5
Tuning

Tuning parameter 'rsm' was held constant at a value of 1
Tuning
 parameter 'border_count' was held constant at a value of 254


## 4 Prediction and Submission

In [9]:
country = as.factor(test$country)
year = as.factor(test$year)
urban_or_rural = as.factor(test$urban_or_rural)
ghsl_water_surface = test$ghsl_water_surface
ghsl_built_pre_1975 = test$ghsl_built_pre_1975
ghsl_built_1975_to_1990 = test$ghsl_built_1975_to_1990
ghsl_built_1990_to_2000 = test$ghsl_built_1990_to_2000
ghsl_built_2000_to_2014 = test$ghsl_built_2000_to_2014
ghsl_not_built_up = test$ghsl_not_built_up
ghsl_pop_density = test$ghsl_pop_density
landcover_crops_fraction = test$landcover_crops_fraction
landcover_urban_fraction = test$landcover_urban_fraction
landcover_water_permanent_10km_fraction = test$landcover_water_permanent_10km_fraction
landcover_water_seasonal_10km_fraction = test$landcover_water_seasonal_10km_fraction
nighttime_lights = test$nighttime_lights
dist_to_capital = test$dist_to_capital
dist_to_shoreline = test$dist_to_shoreline

rm(test)
test = data.frame(country,
                  year,
                  urban_or_rural,
                  ghsl_water_surface,
                  ghsl_built_pre_1975,
                  ghsl_built_1975_to_1990,
                  ghsl_built_1990_to_2000,
                  ghsl_built_2000_to_2014,
                  ghsl_not_built_up,
                  ghsl_pop_density,
                  landcover_crops_fraction,
                  landcover_urban_fraction,
                  landcover_water_permanent_10km_fraction,
                  landcover_water_seasonal_10km_fraction,
                  nighttime_lights, dist_to_capital,
                  dist_to_shoreline)

prediction <- predict(model, test)

In [10]:
SampleSubmission = read.csv2("/content/SampleSubmission.csv", header = TRUE, sep = ",", dec = ".", fill = TRUE)
head(SampleSubmission)

Unnamed: 0_level_0,ID,Target
Unnamed: 0_level_1,<chr>,<int>
1,ID_AAcismbB,0
2,ID_AAeBMsji,0
3,ID_AAjFMjzy,0
4,ID_AAmMOEEC,0
5,ID_ABguzDxp,0
6,ID_ABomWihE,0


In [11]:
SampleSubmission$Target = prediction
head(SampleSubmission)

Unnamed: 0_level_0,ID,Target
Unnamed: 0_level_1,<chr>,<dbl>
1,ID_AAcismbB,0.1694372
2,ID_AAeBMsji,0.1774752
3,ID_AAjFMjzy,0.6594629
4,ID_AAmMOEEC,0.3941256
5,ID_ABguzDxp,0.2518752
6,ID_ABomWihE,0.2508821


In [12]:
write.csv(SampleSubmission, "/content/submission.csv", row.names = FALSE, quote = FALSE)

In [13]:
drive_mkdir(paste0("/EconomicAI/Submission/", VERSION))

Created Drive file:
  * VERSION_12: 1RZVv9g2kQExkzM5DxdBsLaibzgLpMolQ
with MIME type:
  * application/vnd.google-apps.folder



In [14]:
drive_upload("/content/submission.csv", paste0("EconomicAI/Submission/", VERSION, "/submission.csv"))

Local file:
  * /content/submission.csv
uploaded into Drive file:
  * submission.csv: 1m6Kw-hge8ByhB538a4VZHzNcQTuCpLWP
with MIME type:
  * text/csv



In [15]:
end_time <- Sys.time()

round(end_time - start_time, 1)

Time difference of 20.6 mins