# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Couchard Darius - Parent Paul - Donne Stefano

## Pump it Up: Data Mining the Water Table
####  April 29, 2021


# 1) Data Pre-Processing




In [1]:
training_set<-read.csv("../Data/TrainingSet/4910797b-ee55-40a7-8668-10efd5c1b960.csv",header=TRUE) # loads the training set csv file (it's magic)
dim(training_set) # dimension of the set 
names(training_set) # names of the variables
training_set


id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
<int>,<dbl>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<chr>,<int>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
69572,6000,2011-03-14,Roman,1390,Roman,34.93809,-9.85632177,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
8776,0,2013-03-06,Grumeti,1399,GRUMETI,34.69877,-2.14746569,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
34310,25,2013-02-25,Lottery Club,686,World vision,37.46066,-3.82132853,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
67743,0,2013-01-28,Unicef,263,UNICEF,38.48616,-11.15529772,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
19728,0,2011-07-13,Action In A,0,Artisan,31.13085,-1.82535885,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
9944,20,2011-03-13,Mkinga Distric Coun,0,DWE,39.17280,-4.76558728,Tajiri,0,...,per bucket,salty,salty,enough,enough,other,other,unknown,communal standpipe multiple,communal standpipe
19816,0,2012-10-01,Dwsp,0,DWSP,33.36241,-3.76636472,Kwa Ngomho,0,...,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
54551,0,2012-10-09,Rwssp,0,DWE,32.62062,-4.22619802,Tushirikiane,0,...,unknown,milky,milky,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump
53934,0,2012-11-03,Wateraid,0,Water Aid,32.71110,-5.14671181,Kwa Ramadhan Musa,0,...,never pay,salty,salty,seasonal,seasonal,machine dbh,borehole,groundwater,hand pump,hand pump
46144,0,2011-08-03,Isingiro Ho,0,Artisan,30.62699,-1.25705061,Kwapeto,0,...,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump


## How to enhance the data set :
First, it's needed to remove empty values in the table: each NaN or empty cell has to be removed or replaced, it is decided to map Nan <- 0
Then, modifications have to be made depending on the nature of the data :
* If a column (variable) consists in continous numerical values : standardization is applied such has obtaining a new column with a mean value of 0 and a standard deviation of 1
* If a column is made of ordinal categorical variable (hierarchy between categories) : map each string to a numerical value 
* In case of nominal categorical variable : apply one hot encoding -> create new column (with binary values) for each category
But some other cases have to be assesed :
* the name of the water point (**wpt_name**) isn't relevant to use as it is, as every water point has a unique or no name -> so either drop this column or map it as 1 : has name , 0 : no name
* the funder or installer variable cannot be addressed with one hot encoding as there is 1900 different funders, many of them having only 1 installation. So two solutions exists : drop the column (loss of information) or create new categories for funders (eg: number of installation ma

## Methods are implement to apply these changes :

In [42]:
# METHOD TO REASSIGN EMPTY VALUES
NaN_handler <- function(column_name) { # input : column_name (name of the variable)
    training_set[training_set[,column_name] == "", ] <- 0 # select row where column element is empty string "" and assign to 0
    training_set[is.na(training_set[,column_name])] <- 0 # select row where column element is NaN and assign to 0
}
NaN_handler("funder") # TEST

In [48]:
# STANDARDIZATION METHOD FOR VARIABLESS WITH CONTINUOUS NUMERICAL VALUES
Standardization <- function(column_name){ # input : column_name (name of the variable)
    mean_col = mean(training_set[,column_name], na.rm = TRUE) # mean of the variable
    sd_col = sd(training_set[,column_name], na.rm = TRUE) # standard deviation of the variable
    training_set[,column_name]<-(training_set[,column_name]-mean_col)/sd_col # apply the transformation
    # now for the whole column : mean = 0 and sd = 1
}
NaN_handler("gps_height") # TEST 
Standardization("gps_height") # TEST


In [12]:
require(tidyr)
require(dplyr)
# HANDLING OF NOMINAL CATEGORICAL VARIABLES
# before using : change notebook IOPub data rate limit with Jupyter  notebook --NotebookApp.iopub_data_rate_limit=jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
Nom_cat_handle <- function(column_name){
    training_set %>% mutate(value = 1)  %>% spread(column_name, value,  fill = 0 )
}
Nom_cat_handle("source_type") # TEST

# THERE IS 1897 different funders ! many funding only 1 installation : it's needed to categorize them

id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,source_class,waterpoint_type,waterpoint_type_group,borehole,dam,other,rainwater harvesting,river/lake,shallow well,spring
<int>,<dbl>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<chr>,<int>,...,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
69572,6000,2011-03-14,Roman,1390,Roman,34.93809,-9.85632177,none,0,...,groundwater,communal standpipe,communal standpipe,0,0,0,0,0,0,1
8776,0,2013-03-06,Grumeti,1399,GRUMETI,34.69877,-2.14746569,Zahanati,0,...,surface,communal standpipe,communal standpipe,0,0,0,1,0,0,0
34310,25,2013-02-25,Lottery Club,686,World vision,37.46066,-3.82132853,Kwa Mahundi,0,...,surface,communal standpipe multiple,communal standpipe,0,1,0,0,0,0,0
67743,0,2013-01-28,Unicef,263,UNICEF,38.48616,-11.15529772,Zahanati Ya Nanyumbu,0,...,groundwater,communal standpipe multiple,communal standpipe,1,0,0,0,0,0,0
19728,0,2011-07-13,Action In A,0,Artisan,31.13085,-1.82535885,Shuleni,0,...,surface,communal standpipe,communal standpipe,0,0,0,1,0,0,0
9944,20,2011-03-13,Mkinga Distric Coun,0,DWE,39.17280,-4.76558728,Tajiri,0,...,unknown,communal standpipe multiple,communal standpipe,0,0,1,0,0,0,0
19816,0,2012-10-01,Dwsp,0,DWSP,33.36241,-3.76636472,Kwa Ngomho,0,...,groundwater,hand pump,hand pump,1,0,0,0,0,0,0
54551,0,2012-10-09,Rwssp,0,DWE,32.62062,-4.22619802,Tushirikiane,0,...,groundwater,hand pump,hand pump,0,0,0,0,0,1,0
53934,0,2012-11-03,Wateraid,0,Water Aid,32.71110,-5.14671181,Kwa Ramadhan Musa,0,...,groundwater,hand pump,hand pump,1,0,0,0,0,0,0
46144,0,2011-08-03,Isingiro Ho,0,Artisan,30.62699,-1.25705061,Kwapeto,0,...,groundwater,hand pump,hand pump,0,0,0,0,0,1,0


In [11]:
# TEST
training_set<-training_set[,!(names(training_set) %in% "wpt_name")] # drop the desired column