# **EARTHQUAKE DAMAGE PREDICTION**

https://www.drivendata.org/competitions/57/nepal-earthquake/page/136/

* Andrea Morales Garzón `andreamgmg@correo.ugr.es`
* Ithiel Piñero Darias `ithiel@correo.ugr.es`
* Paula Villa Martín `pvilla@correo.ugr.es`
* Antonio Manjavacas Lucas `manjavacas@correo.ugr.es`

Basándonos en factores relacionados con la localización de los edificios y su construcción, el objetivo de este trabajo será predecir el nivel de daño provocado por el terremoto Gorkha de 2015 sobre edificios en Nepal.

Los datos fueron recopilados por medio de encuestas realizadas por Kathmandu Living Labs y la Oficina Central de Estadística, dependiente de la Comisión Nacional de Planificación de la Secretaría de Nepal. Esta encuesta es uno de los mayores conjuntos de datos posteriores a un desastre jamás reunidos, y 
contiene información valiosa sobre los efectos de los terremotos, las condiciones de los hogares y estadísticas socioeconómicas y demográficas.

Trataremos de predecir la variable ordinal `damage_grade`, que representa el nivel de daño provocado sobre los edificios afectados por el terremoto:

* `damage_grade` = 1 representa un daño bajo;
* `damage_grade` = 2 representa un daño medio;
* `damage_grade` = 3 representa una destrucción del edificio casi completa.


In [None]:
options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m"))
gc()

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,532751,28.5,1195072,63.9,685560,36.7
Vcells,1012850,7.8,8388608,64.0,1771067,13.6


Instalamos las librerías necesarias

In [None]:
install.packages('tidyverse')
install.packages('NoiseFiltersR')
install.packages('caret')
install.packages('RWeka')
install.packages('nortest')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘numDeriv’, ‘SQUAREM’, ‘lava’, ‘prodlim’, ‘iterators’, ‘data.table’, ‘gower’, ‘ipred’, ‘timeDate’, ‘RWekajars’, ‘igraph’, ‘foreach’, ‘plyr’, ‘ModelMetrics’, ‘reshape2’, ‘recipes’, ‘pROC’, ‘RWeka’, ‘kknn’, ‘caret’, ‘e1071’, ‘randomForest’, ‘rJava’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# **Preprocesamiento**

Carga de las librerías

In [None]:
set.seed(42)

library(tidyverse)
library(NoiseFiltersR)
library(caret)
library(RWeka)
library(nortest)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.5     [32m✔[39m [34mdplyr  [39m 1.0.3
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift




Carga de datos:

In [None]:
TRAIN_VALUES_ID = '15ykpkKIJNKlEXQQ3taspRjUZ2sJN5zS_'
TRAIN_LABELS_ID = '1nrNVfj9NmNvwPhXuucUBYhh-FmODCCBK'
TEST_VALUES_ID = '1_GpX1sh7XkJLm-kyOpcObzXICW5Z-tb_'

load_file <- function(id) {
  read_csv(sprintf('https://docs.google.com/uc?id=%s&export=download', id), col_types=cols())
}

train_data <- load_file(TRAIN_VALUES_ID)
train_labels <- load_file(TRAIN_LABELS_ID)
test_data <- load_file(TEST_VALUES_ID)

ids <- test_data$building_id

Conversión de variables:

In [None]:
cols_to_factor <- c(9:15, 27)

train_data[cols_to_factor] <-
  lapply(train_data[cols_to_factor], factor)
test_data[cols_to_factor] <-
  lapply(test_data[cols_to_factor], factor)

train_labels$damage_grade <- factor(train_labels$damage_grade)

Agrupamiento de categorías:

In [None]:
train_data <- merge(x=train_data,y=train_labels,by='building_id') 

head(train_data)

Unnamed: 0_level_0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,⋯,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,4,30,266,1224,1,25,5,2,t,r,⋯,0,0,0,0,0,0,0,0,0,2
2,8,17,409,12182,2,0,13,7,t,r,⋯,0,0,0,0,0,0,0,0,0,3
3,12,17,716,7056,2,5,12,6,o,r,⋯,0,0,0,0,0,0,0,0,0,3
4,16,4,651,105,2,80,5,4,n,r,⋯,0,0,0,0,0,0,0,0,0,2
5,17,3,1387,3909,5,40,5,10,t,r,⋯,0,0,0,0,0,0,0,0,0,2
6,25,26,1132,6645,2,0,6,6,t,w,⋯,0,0,0,0,0,0,0,0,0,1


Agrupamiento (`cement-mortar-stone`, `cement-mortar-brick`,`timber`,`bamboo`,`rc-non-engineered`,`rc-engineered`,`other`,`adobe-mud`, `mud-mortar-brick`,`mud-mortar-stone`y`stone-flag`):

* **ROBUST**: `cement-mortar-stone`, `cement-mortar-brick`,`timber`,`bamboo`,`rc-non-engineered`,`rc-engineered`y`other`.
* **NON-ROBUST**: `adobe-mud`, `mud-mortar-brick`,`mud-mortar-stone`y`stone-flag`.

In [None]:
group_superstructure <- function(data) {
  data <- data %>% mutate(
    superstructure =
      ifelse(
        has_superstructure_adobe_mud == 1 |
          has_superstructure_mud_mortar_brick == 1 |
          has_superstructure_mud_mortar_stone == 1 |
          has_superstructure_stone_flag == 1,
        "non-robust",
        "robust"
      )
  )
  
  data$superstructure <- as.factor(data$superstructure)
  data <- data %>% select(-starts_with('has_superstructure'))
  
}

train_data <- group_superstructure(train_data)
test_data <- group_superstructure(test_data)

Agrupación de variables categóricas: uso secundario, (`HOUSING`, `GOVERNANCE`, `AGRICULTURE`, `SERVICES`, `INDUSTRY`, `NONE`):

* **HOUSING**: `hotel`, `rental`.
* **GOVERNANCE**: `gov_office`, `institution`.
* **AGRICULTURE**: `agriculture`.
* **SERVICES**: `police`, `school`, `health_post`.
* **INDUSTRY**: `industry`.
* **NONE**.

In [None]:
group_secondary_use <- function(data) {
  data <- data %>% mutate(
    secondary_use =
      ifelse(
        has_secondary_use_hotel == 1 |
          has_secondary_use_rental == 1,
        'housing',
        ifelse(
          has_secondary_use_gov_office == 1 |
            has_secondary_use_institution == 1,
          'governance',
          ifelse(
            has_secondary_use_agriculture == 1,
            'agriculture',
            ifelse(
              has_secondary_use_use_police == 1 |
                has_secondary_use_school == 1 |
                has_secondary_use_health_post == 1,
              'services',
              ifelse(has_secondary_use_industry == 1, 'industry', 'none')
            )
          )
        )
      )
  )
  
  data$secondary_use <- as.factor(data$secondary_use)
  data <- data %>% select(-starts_with('has_secondary'))
}

train_data <- group_secondary_use(train_data)
test_data <- group_secondary_use(test_data)

# **Modelo: C4.5**

In [None]:
head(train_data)
colnames(train_data)

Unnamed: 0_level_0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,ground_floor_type,other_floor_type,position,plan_configuration,legal_ownership_status,count_families,damage_grade,superstructure,secondary_use
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<fct>,<fct>,<fct>
1,4,30,266,1224,1,25,5,2,t,r,n,f,j,s,d,v,0,2,non-robust,none
2,8,17,409,12182,2,0,13,7,t,r,n,f,q,s,d,v,1,3,non-robust,none
3,12,17,716,7056,2,5,12,6,o,r,q,f,q,s,d,v,1,3,non-robust,none
4,16,4,651,105,2,80,5,4,n,r,n,f,q,s,d,v,1,2,non-robust,none
5,17,3,1387,3909,5,40,5,10,t,r,n,f,q,o,d,v,1,2,non-robust,none
6,25,26,1132,6645,2,0,6,6,t,w,n,f,x,s,d,a,1,1,robust,none


Ganancia de información:

In [None]:
gain <- InfoGainAttributeEval(damage_grade~ ., train_data)

gain

Selección de variables

In [None]:
train_values <- train_data$damage_grade
train_data <- train_data[,-18]
train_data <- train_data[,gain > 0.1]
train_data <- cbind(train_data,train_values)
names(train_data)[ncol(train_data)] <- "damage_grade"

test_data <- test_data[,gain>0.1]

In [None]:
cat("Variables seleccionadas para la predicción: \n")
colnames(test_data)

Variables seleccionadas para la predicción: 


## **Mejor modelo: C45 con M=25 & C=0.1**

In [None]:
#Colocamos los hiperparámetros seleccionados
jctrl <- Weka_control(M=25,C=0.1)

#Creación del modelo
modelC4.5 = J48(damage_grade~. ,train_data, control=jctrl)

#Validación cruzada
cv_resul = evaluate_Weka_classifier(modelC4.5,numFolds=10)
cv_resul

#Predicción
modelC4.5.pred = predict(modelC4.5, newdata = test_data)

=== 10 Fold Cross Validation ===

=== Summary ===

Correctly Classified Instances      189036               72.5385 %
Incorrectly Classified Instances     71565               27.4615 %
Kappa statistic                          0.4741
Mean absolute error                      0.2625
Root mean squared error                  0.3639
Relative absolute error                 70.9491 %
Root relative squared error             84.6086 %
Total Number of Instances           260601     

=== Confusion Matrix ===

      a      b      c   <-- classified as
  10412  14411    301 |      a = 1
   4916 126302  17041 |      b = 2
    363  34533  52322 |      c = 3

## **Otra opción C45 2vsAll & 1vs3 con clustering por similitud de características**

Peor resultado que el modelo propuesto anteriormente en Test (0.7194 micro F1).

In [None]:
'
cluster_buildings <-function(data, k = round(sqrt(nrow(data))), iters = 10000) {
 # num_data <- data %>% select_if(is.numeric)
  #cat_data <- data %>% select_if(is.factor)
  
  #ohe_cat <- one_hot(as.data.table(cat_data))
  #data <- data.frame(num_data, ohe_cat)
 ' 
  #kmeans(data, centers = k, iter.max = iters, algorithm = 'MacQueen')
'}

#kmeans_results_tr <- cluster_buildings(train_data %>% select(-damage_grade))
#train_data <- train_data %>% mutate(cluster=kmeans_results_tr$cluster)

#kmeans_results_ts <- cluster_buildings(test_data)
#test_data <- test_data %>% mutate(cluster=kmeans_results_ts$cluster)
'
'
a <- InfoGainAttributeEval(damage_grade~ ., train_data_2_vs_A)

jctrl <- Weka_control(M=25,C=0.1)

modelC4.5 = J48(damage_grade~. ,train_data_2_vs_A[,names(a[a>0.1])], control=jctrl)

modelC4.5.pred_2vsA = predict(modelC4.5, newdata = test_data[,names(a[a>0.1])])

res <- as.data.frame(cbind(ids, modelC4.5.pred_2vsA))

colnames(res) <- c("building_id", "damage_grade")


train_data_1_vs_3 <- train_data[which(train_data$damage_grade != 2),]
train_data_1_vs_3$damage_grade <- as.factor(ifelse(train_data_1_vs_3$damage_grade == 1, 1,0))

a <- InfoGainAttributeEval(damage_grade~ ., train_data_1_vs_3)

modelC4.5 = J48(damage_grade~. ,train_data_1_vs_3[,names(a[a>0.1])], control=jctrl)

modelC4.5.pred_1vs3 = predict(modelC4.5, newdata = test_data[,names(a[a>0.1])])

res$pred1vs3 <- modelC4.5.pred_1vs3

res$damage_grade <- ifelse(res$damage_grade==2,2,ifelse(res$pred1vs3==0,3,1))

table(res$damage_grade)

res <- res[,1:2]

write.csv(res,"submit.csv", row.names = FALSE)
'

# **Resultados: 0.7239 microF1 en test en DrivenData**

In [None]:
cat("Predicción \n")
table(modelC4.5.pred)

#Almacenamos la predicción final
pred <- as.data.frame(cbind(ids, modelC4.5.pred))
colnames(pred) <- c("building_id", "damage_grade")

write.csv(pred,"submit.csv", row.names = FALSE)

modelC4.5.pred
    1     2     3 
 5205 58399 23264 