# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Couchard Darius - Parent Paul - Donne Stefano

## Pump it Up: Data Mining the Water Table
####  April 29, 2021


# 2) Data Pre-Processing




In [1]:
require(tidyr)
require(plyr)
require(dplyr)
library(mltools)
library(data.table)

training_set<-read.csv("../Data/TrainingSet/4910797b-ee55-40a7-8668-10efd5c1b960.csv",header=TRUE) # loads the training set csv file (it's magic)
dim(training_set) # dimension of the set 
names(training_set) # names of the variables

traninin_labels<-read.csv("../Data/TrainingLabel/0bf8bc6e-30d0-4c50-956a-603fc693d966.csv", header=TRUE) # Loads the corresponding labels


Loading required package: tidyr

Loading required package: plyr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘mltools’


The following object is masked from ‘package:tidyr’:

    replace_na



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last




## How to enhance the data set :

After having analyzed the data set and assessed each variable relevance, it's needed to standardize the datas.

First, it's needed to remove empty values in the table (Imputation): each NaN or empty cell has to be removed or replaced, different solution exists : 
* Mapping NaN<-0 for nominal categorical variables
* Replace missing value by mean of column for numerical variables

Then, modifications have to be made depending on the nature of the data :
* If a column (variable) consists in continous numerical values : standardization is applied such has obtaining a new column with a mean value of 0 and a standard deviation of 1 (**longitude**)
* If a column is made of ordinal categorical variable (hierarchy between categories) : map each string to a numerical value (**water_quality**)
* In case of nominal categorical variable : apply one hot encoding -> create new column (with binary values) for each category (**source_type**)
<br/>


## Methods are implemented to apply these changes :

In [2]:
# METHOD TO REASSIGN EMPTY VALUES
NaN_handler_categorical <- function(column_name) { # input : column_name (name of the variable)
    training_set[training_set[,column_name] == "",column_name ] <- 0 # select row where column element is empty string "" and assign to 0
    training_set[is.na(training_set[,column_name]),column_name] <- 0 # select row where column element is NaN and assign to 0
    return(training_set)
}


# METHOD TO REASSIGN EMPTY VALUES
NaN_handler_num<- function(column_name) { # input : column_name (name of the variable)
    mean_col <- mean(training_set[,column_name], na.rm = TRUE)
    training_set[is.na(training_set[,column_name]),column_name] <- mean_col
    return(training_set)
}



In [3]:
# STANDARDIZATION METHOD FOR VARIABLES WITH CONTINUOUS NUMERICAL VALUES
Standardization <- function(column_name){ # input : column_name (name of the variable)
    mean_col <- mean(training_set[,column_name], na.rm = TRUE) # mean of the variable
    sd_col <- sd(training_set[,column_name], na.rm = TRUE) # standard deviation of the variable
    training_set[,column_name]<-(training_set[,column_name]-mean_col)/sd_col # apply the transformation
    # now for the whole column : mean = 0 and sd = 1
    return(training_set)
}

In [4]:
# HANDLING OF NOMINAL CATEGORICAL VARIABLES (ONE HOT ENCODING)
# before using : change notebook IOPub data rate limit with Jupyter  notebook --NotebookApp.iopub_data_rate_limit=jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
Nom_cat_handle <- function(column_name){
    training_set <- training_set %>% mutate(value = 1)  %>% spread(column_name, value,  fill = 0 )
    return(training_set)
}


## Drop unused variables

In [5]:
# COLUMNS TO DROP 
column_to_drop<-c("wpt_name","amount_tsh","date_recorded","gps_height","num_private","public_meeting","recorded_by",
                 "scheme_name","quantity_group","source_class","subvillage","waterpoint_type",
                 "extraction_type","extraction_type_group","water_quality","source", "payment_type", "management")
#TODO : adds others columns to drop
training_set<-training_set[,!(names(training_set) %in% column_to_drop)] # drop the desired column

In [6]:
# MANAGE SIMPLE NOMINAL CATEGORICAL VARIABLE
#TODO : apply one-hot encoding to other needed variables
training_set<-Nom_cat_handle("basin") # apply one-hot-encoding to the basin related column
# ...

## Population variable

In [7]:
# MANAGE POPULATION VARIABLE
# NaN have to be replaced by mean of region population
# column 1 = region code, column 2 = population mean in this region
region_code_frame <- data.frame("region" = unique(training_set$region_code),"mean_pop" = NA)
for(row in 1: nrow(region_code_frame) ){ 
    sel <-training_set[which(training_set[,"region_code"]==region_code_frame[row,1],),"population"] # select pop row with corresponding region
    region_code_frame[row,"mean_pop"]<-mean(sel[sel!=0],na.rm = TRUE)
}

# TODO : some region population mean are NaN , find a solution (same problem with region and district_code variables)
# temporairement on remplace les NaN par le mean des autres valeurs dans region_code_frame
region_code_frame[which(is.na(region_code_frame[,2])),2]<-mean(region_code_frame[,2],na.rm=TRUE)

# replace NaN value of population by mean region values
index<-which(training_set[,"population"]==0)
for(elem in index){
    training_set[elem,"population"]<-region_code_frame[training_set[elem,"region_code"],2] # replace NaN by their mean region value
}

# Now Standardize Population Variable 
training_set<-Standardization("population")

#TODO: Manage other tricky variables

## Permit variable

In [8]:
# MANAGE ORDINAL CATEGORICAL VARIABLE

# variable PERMIT
# remap True : 1 , False : 0, Missing "" : NA
training_set$permit <- mapvalues(training_set$permit, 
          from=c("True","False",""), 
          to=c(1,0,NA))
training_set <- transform(training_set, permit = as.numeric(permit)) # transform column data type (char to int)
# replace missing value by mean of column
training_set<-NaN_handler_num("permit")
# Standardize Variable
training_set<-Standardization("permit")


## Scheme management variable

In [9]:
# One hot encoding
training_set <- Nom_cat_handle("scheme_management")

“The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.


## Construction year variable

In [10]:
# Replace the construction's year by age
max_year <- max(training_set$construction_year)
# Replace 0 values by NaN
training_set$construction_year <- mapvalues(training_set$construction_year, from=0, to=NaN)
# Changes construction year by age
training_set$construction_year <- max_year - training_set$construction_year
# Computes mean age
mean_age <- mean(na.omit(training_set$construction_year))
# Replace NaN by the mean age
training_set$construction_year <- mapvalues(training_set$construction_year, from=NaN, to=mean_age)

# Standardize mean age values
training_set <- Standardization("construction_year")

# Rename column to age
names(training_set["construction_year"]) <- "age"

## Extraction type class variable

In [11]:
# One hot encoding
training_set <- Nom_cat_handle("extraction_type_class")

## Management group variable

In [12]:
# One hot encode
training_set <- Nom_cat_handle("management_group")

## Payment variable

In [13]:
# One hot encode
training_set <- Nom_cat_handle("payment")

## Water quality group variable

In [14]:
# Integer encoding
training_set$quality_group <- mapvalues(training_set$quality_group,
                                        from=c("milky", "good", "salty", "colored", "unknown", "fluoride"),
                                        to=c(2,3,0,1,NaN,4))

training_set$quality_group = as.integer(training_set$quality_group)

quality_mean <- mean(na.omit(training_set$quality_group))

training_set$quality_group <- mapvalues(training_set$quality_group,
                                       from="unknown",
                                       to=quality_mean)

training_set <- Standardization("quality_group")

The following `from` values were not present in `x`: unknown



## Water quantity variable

In [15]:
# Integer encoding
training_set$quantity <- mapvalues(training_set$quantity,
                                    from=c("enough", "insufficient", "dry", "seasonal", "unknown"),
                                    to=c(3, 1, 0, 2, NaN))
training_set$quantity = as.integer(training_set$quantity)

# compute mean
quantity_mean <- mean(na.omit(training_set$quantity))

# replace NaN by mean
training_set$quantity <- mapvalues(training_set$quantity,
                                    from="unknown",
                                    to=quantity_mean)

# Standardize data
training_set <- Standardization("quantity")

The following `from` values were not present in `x`: unknown



## Source type variable

In [16]:
# One hot encode
training_set <- Nom_cat_handle("source_type")

## Water point type variable

In [17]:
# One hot encode
training_set <- Nom_cat_handle("waterpoint_type_group")

## Write preprocessed data

In [None]:
# Write the pre-processed data into a new XLS file
library("writexl")
write_xlsx(training_set,"../Data/PreProcess/processed_training_data.xls") 