# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Couchard Darius - Parent Paul - Donne Stefano

## Pump it Up: Data Mining the Water Table
####  April 29, 2021


# 1) Data Pre-Processing




In [1]:
training_set<-read.csv("../Data/TrainingSet/4910797b-ee55-40a7-8668-10efd5c1b960.csv",header=TRUE) # loads the training set csv file (it's magic)
dim(training_set) # dimension of the set 
names(training_set) # names of the variables

traninin_labels<-read.csv("../Data/TrainingLabel/0bf8bc6e-30d0-4c50-956a-603fc693d966.csv", header=TRUE) # Loads the corresponding labels


## Analysis of each variable
All the variables will be analysed one by one, as some of them aren't representative of the problem and can be ommited.

In [8]:
sm = table(training_set["scheme_management"])
sm = as.data.frame(sm)
names(sm)[1] <- "Scheme Management"
sm

Scheme Management,Freq
<fct>,<int>
,3877
Company,1061
,1
Other,766
Parastatal,1680
Private operator,1063
SWC,97
Trust,72
VWC,36793
Water authority,3153


## How to enhance the data set :
First, it's needed to remove empty values in the table: each NaN or empty cell has to be removed or replaced, it is decided for now to map Nan <- 0.

Then, modifications have to be made depending on the nature of the data :
* If a column (variable) consists in continous numerical values : standardization is applied such has obtaining a new column with a mean value of 0 and a standard deviation of 1 (**gps_height**)
* If a column is made of ordinal categorical variable (hierarchy between categories) : map each string to a numerical value (**water_quality**)
* In case of nominal categorical variable : apply one hot encoding -> create new column (with binary values) for each category (**source_type**)
<br/>

But some other cases have to be assesed :
* the name of the water point (**wpt_name**) isn't relevant to use as it is, as every water point has a unique or no name -> so either drop this column or map it as 1 : has name , 0 : no name
* the **funder** or **installer** variable cannot be addressed with one hot encoding as there is 1900 different funders, many of them having only 1 installation. So two solutions exists : drop the column (loss of information) or create new categories for funders (eg: number of installation per funder -> 1 , 1-10, 10+)
* ...
* todo : address variable redundancy (eg: **source** vs **source_type**)

## Methods are implement to apply these changes :

In [None]:
# METHOD TO REASSIGN EMPTY VALUES
NaN_handler <- function(column_name) { # input : column_name (name of the variable)
    training_set[training_set[,column_name] == "", ] <- 0 # select row where column element is empty string "" and assign to 0
    training_set[is.na(training_set[,column_name])] <- 0 # select row where column element is NaN and assign to 0
    return(training_set)
}
training_set<-NaN_handler("funder") # TEST

In [None]:
# STANDARDIZATION METHOD FOR VARIABLES WITH CONTINUOUS NUMERICAL VALUES
Standardization <- function(column_name){ # input : column_name (name of the variable)
    mean_col = mean(training_set[,column_name], na.rm = TRUE) # mean of the variable
    sd_col = sd(training_set[,column_name], na.rm = TRUE) # standard deviation of the variable
    training_set[,column_name]<-(training_set[,column_name]-mean_col)/sd_col # apply the transformation
    # now for the whole column : mean = 0 and sd = 1
    return(training_set)
}
training_set<-NaN_handler("gps_height") # TEST 
training_set<-Standardization("gps_height") # TEST
# also, amount_tsh, longitude, latitude, ...

In [None]:
require(tidyr)
require(dplyr)
# HANDLING OF NOMINAL CATEGORICAL VARIABLES
# before using : change notebook IOPub data rate limit with Jupyter  notebook --NotebookApp.iopub_data_rate_limit=jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
Nom_cat_handle <- function(column_name){
    training_set %>% mutate(value = 1)  %>% spread(column_name, value,  fill = 0 )
    return(training_set)
}
training_set<-Nom_cat_handle("source_type") # TEST

# THERE IS 1897 different funders ! many funding only 1 installation : it's needed to categorize them, no one hot encoding

In [None]:
library("writexl")
funder_occurency<-as.data.frame(table(training_set[,"funder"])) # data_frame containing the number of occurences of each funder
funder_occurency<-arrange(funder_occurency,Freq) # sorted in ascending frequency order
write_xlsx(funder_occurency,"../Data/PreProcess/funder_occ.csv") # stores occurences for later test_set pre-processing
funder_occurency[as.integer(nrow(funder_occurency)/3),2] # thresh 1 
funder_occurency[as.integer(2*nrow(funder_occurency)/3),2] # thresh 2
# the two thresholds split the data frame in 3 equal parts : (the number of 3 is arbitrary)
# - funders having opened 1 water pump
# - funders having opened 2 or 3 water pumps
# - funders having opened more than 3 water pumps
# The funder column can now be transformed, where every funder is now assigned to an ordinal categorical variable (1,2 or 3)
# 0 is already assigned by default the the rows without funder names

for(row in 1: nrow(training_set)){
    val = funder_occurency[which(funder_occurency[,1] == training_set[row,"funder"]),2]
    if(val>3){
        training_set[row,"funder"]<-3
    }
    else if (val>1){
        training_set[row,"funder"]<-2
    }
    else if (val >0){
        training_set[row,"funder"]<-1
    }
    else{
        training_set[row,"funder"]<-0
    }
    # reassign each funder value to its new category
}

# this process can be done for other very large nominal categorical variables

In [None]:
# TEST : remap wpt_name with 0: no name, 1: has name
training_set[training_set[,"wpt_name"] == "none", ] <- 0
training_set[training_set[,"wpt_name"] != 0, ] <- 1

In [None]:
# TEST
training_set<-training_set[,!(names(training_set) %in% "wpt_name")] # drop the desired column