# **EARTHQUAKE DAMAGE PREDICTION**

https://www.drivendata.org/competitions/57/nepal-earthquake/page/136/

* Andrea Morales Garzón `andreamgmg@correo.ugr.es`
* Ithiel Piñero Darias `ithiel@correo.ugr.es`
* Paula Villa Martín `pvilla@correo.ugr.es`
* Antonio Manjavacas Lucas `manjavacas@correo.ugr.es`

Basándonos en factores relacionados con la localización de los edificios y su construcción, el objetivo de este trabajo será predecir el nivel de daño provocado por el terremoto Gorkha de 2015 sobre edificios en Nepal.

Los datos fueron recopilados por medio de encuestas realizadas por Kathmandu Living Labs y la Oficina Central de Estadística, dependiente de la Comisión Nacional de Planificación de la Secretaría de Nepal. Esta encuesta es uno de los mayores conjuntos de datos posteriores a un desastre jamás reunidos, y 
contiene información valiosa sobre los efectos de los terremotos, las condiciones de los hogares y estadísticas socioeconómicas y demográficas.

Trataremos de predecir la variable ordinal `damage_grade`, que representa el nivel de daño provocado sobre los edificios afectados por el terremoto:

* `damage_grade` = 1 representa un daño bajo;
* `damage_grade` = 2 representa un daño medio;
* `damage_grade` = 3 representa una destrucción del edificio casi completa.


In [None]:
options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m"))
gc()

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,2936402,156.9,5303322,283.3,5303322,283.3
Vcells,17466907,133.3,54036049,412.3,81938379,625.2


In [None]:
install.packages('tidyverse')
install.packages('NoiseFiltersR')
install.packages('caret')
install.packages('RWeka')
install.packages('MLmetrics')
install.packages('UBL')
install.packages('mltools')
install.packages('data.table')

# **Preprocesamiento**

In [None]:
set.seed(42)

library(tidyverse)
library(NoiseFiltersR)
library(caret)
library(RWeka)
library(MLmetrics)
library(UBL)
library(mltools)
library(data.table)

Carga de datos:

In [None]:
TRAIN_VALUES_ID = '15ykpkKIJNKlEXQQ3taspRjUZ2sJN5zS_'
TRAIN_LABELS_ID = '1nrNVfj9NmNvwPhXuucUBYhh-FmODCCBK'
TEST_VALUES_ID = '1_GpX1sh7XkJLm-kyOpcObzXICW5Z-tb_'

load_file <- function(id) {
  read_csv(sprintf('https://docs.google.com/uc?id=%s&export=download', id), col_types=cols())
}

train_values <- load_file(TRAIN_VALUES_ID)
train_labels <- load_file(TRAIN_LABELS_ID)
test_values <- load_file(TEST_VALUES_ID)

test_ids <- test_values$building_id

Conversión de variables:

In [None]:
cols_to_factor <- c(9:15, 27)

train_values[cols_to_factor] <-
  lapply(train_values[cols_to_factor], factor)
test_values[cols_to_factor] <-
  lapply(test_values[cols_to_factor], factor)

train_labels$damage_grade <- factor(train_labels$damage_grade)

Agrupamiento de categorías:

In [None]:
group_label <- function(x, label1, label2, new_label) {
  x <- sub(label1, new_label, x)
  x <- sub(label2, new_label, x)
}

group_cat <- function(data, var, label1, label2, grouped_label) {
  as.factor(sapply(data[, var], group_label, label1, label2, grouped_label))
}

train_values$foundation_type <-
  group_cat(train_values, "foundation_type", "^u$", "^w$", "u+w")
train_values$ground_floor_type <-
  group_cat(train_values, "ground_floor_type", "^f$", "^x$", "f+x")
train_values$ground_floor_type <-
  group_cat(train_values, "ground_floor_type", "^m$", "^z$", "m+z")

train_values$plan_configuration <-
  sub("^a$", "a+c+m+o+u", train_values$plan_configuration)
train_values$plan_configuration <-
  sub("^c$", "a+c+m+o+u", train_values$plan_configuration)
train_values$plan_configuration <-
  sub("^m$", "a+c+m+o+u", train_values$plan_configuration)
train_values$plan_configuration <-
  sub("^o$", "a+c+m+o+u", train_values$plan_configuration)
train_values$plan_configuration <-
  sub("^u$", "a+c+m+o+u", train_values$plan_configuration)

train_values$plan_configuration <-
  sub("^d$", "d+n+q", train_values$plan_configuration)
train_values$plan_configuration <-
  sub("^n$", "d+n+q", train_values$plan_configuration)
train_values$plan_configuration <-
  sub("^q$", "d+n+q", train_values$plan_configuration)

train_values$roof_type <-
  group_cat(train_values, "roof_type", "^n$", "^q$", "n+q")
train_values$other_floor_type <-
  group_cat(train_values, "other_floor_type", "^q$", "^x$", "q+x")
train_values$legal_ownership_status <-
  group_cat(train_values, "legal_ownership_status", "^r$", "^v$", "r+v")

test_values$foundation_type <-
  group_cat(test_values, "foundation_type", "^u$", "^w$", "u+w")
test_values$ground_floor_type <-
  group_cat(test_values, "ground_floor_type", "^f$", "^x$", "f+x")
test_values$ground_floor_type <-
  group_cat(test_values, "ground_floor_type", "^m$", "^z$", "m+z")

test_values$plan_configuration <-
  sub("^a$", "a+c+m+o+u", test_values$plan_configuration)
test_values$plan_configuration <-
  sub("^c$", "a+c+m+o+u", test_values$plan_configuration)
test_values$plan_configuration <-
  sub("^m$", "a+c+m+o+u", test_values$plan_configuration)
test_values$plan_configuration <-
  sub("^o$", "a+c+m+o+u", test_values$plan_configuration)
test_values$plan_configuration <-
  sub("^u$", "a+c+m+o+u", test_values$plan_configuration)

test_values$plan_configuration <-
  sub("^d$", "d+n+q", test_values$plan_configuration)
test_values$plan_configuration <-
  sub("^n$", "d+n+q", test_values$plan_configuration)
test_values$plan_configuration <-
  sub("^q$", "d+n+q", test_values$plan_configuration)

test_values$roof_type <-
  group_cat(test_values, "roof_type", "^n$", "^q$", "n+q")
test_values$other_floor_type <-
  group_cat(test_values, "other_floor_type", "^q$", "^x$", "q+x")
test_values$legal_ownership_status <-
  group_cat(test_values, "legal_ownership_status", "^r$", "^v$", "r+v")

train_values$plan_configuration <-
  as.factor(train_values$plan_configuration)
test_values$plan_configuration <-
  as.factor(test_values$plan_configuration)

Agrupamiento de `superstructure`:

* **ROBUST**: `cement-mortar-stone`, `cement-mortar-brick`,`timber`,`bamboo`,`rc-non-engineered`,`rc-engineered`y`other`.
* **NON-ROBUST**: `adobe-mud`, `mud-mortar-brick`,`mud-mortar-stone`y`stone-flag`.

In [None]:
group_superstructure <- function(data) {
  data <- data %>% mutate(
    superstructure =
      ifelse(
        has_superstructure_adobe_mud == 1 |
          has_superstructure_mud_mortar_brick == 1 |
          has_superstructure_mud_mortar_stone == 1 |
          has_superstructure_stone_flag == 1,
        "non-robust",
        "robust"
      )
  )
  
  data$superstructure <- as.factor(data$superstructure)
  data <- data %>% select(-starts_with('has_superstructure'))
  
}

train_values <- group_superstructure(train_values)
test_values <- group_superstructure(test_values)

Agrupación de variables categóricas: uso secundario:

* **HOUSING**: `hotel`, `rental`.
* **GOVERNANCE**: `gov_office`, `institution`.
* **AGRICULTURE**: `agriculture`.
* **SERVICES**: `police`, `school`, `health_post`.
* **INDUSTRY**: `industry`.
* **NONE**.

In [None]:
train_values %>% select(starts_with('has_secondary')) %>% names

group_secondary_use <- function(data) {
  data <- data %>% mutate(
    secondary_use =
      ifelse(
        has_secondary_use_hotel == 1 |
          has_secondary_use_rental == 1,
        'housing',
        ifelse(
          has_secondary_use_gov_office == 1 |
            has_secondary_use_institution == 1,
          'governance',
          ifelse(
            has_secondary_use_agriculture == 1,
            'agriculture',
            ifelse(
              has_secondary_use_use_police == 1 |
                has_secondary_use_school == 1 |
                has_secondary_use_health_post == 1,
              'services',
              ifelse(has_secondary_use_industry == 1, 'industry', 'none')
            )
          )
        )
      )
  )
  
  data$secondary_use <- as.factor(data$secondary_use)
  data <- data %>% select(-starts_with('has_secondary'))
}

train_values <- group_secondary_use(train_values)
test_values <- group_secondary_use(test_values)

# **Modelo: RIPPER**

In [None]:
train_data <- data.frame(train_values, damage_grade = train_labels$damage_grade)
head(train_data)

Unnamed: 0_level_0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,ground_floor_type,other_floor_type,position,plan_configuration,legal_ownership_status,count_families,superstructure,secondary_use,damage_grade
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<fct>,<fct>,<fct>
1,802906,6,487,12198,2,30,6,5,t,r,n+q,f+x,q+x,t,d+n+q,r+v,1,non-robust,none,3
2,28830,8,900,2812,2,10,8,7,o,r,n+q,f+x,q+x,s,d+n+q,r+v,1,non-robust,none,2
3,94947,21,363,8973,2,10,5,5,t,r,n+q,f+x,q+x,t,d+n+q,r+v,1,non-robust,none,3
4,590882,22,418,10694,2,10,6,5,t,r,n+q,f+x,q+x,s,d+n+q,r+v,1,non-robust,none,2
5,201944,11,131,1488,3,30,8,9,t,r,n+q,f+x,q+x,s,d+n+q,r+v,1,non-robust,none,3
6,333020,8,558,6089,2,10,9,5,t,r,n+q,f+x,q+x,s,d+n+q,r+v,1,non-robust,agriculture,2


Clustering:

In [None]:
cluster_buildings <-
  function(data, k = round(sqrt(nrow(data))), iters = 10000) {
    num_data <- data %>% select_if(is.numeric)
    cat_data <- data %>% select_if(is.factor)
    
    ohe_cat <- one_hot(as.data.table(cat_data))
    data <- data.frame(num_data, ohe_cat)
    
    kmeans(data, centers = k, iter.max = iters, algorithm = 'MacQueen')
  }

kmeans_results_tr <- cluster_buildings(train_data %>% select(-damage_grade))
train_data <- train_data %>% mutate(cluster = kmeans_results_tr$cluster)

kmeans_results_ts <- cluster_buildings(test_values)
test_values <- test_values %>% mutate(cluster = kmeans_results_ts$cluster)

Ganancia de información:

In [None]:
InfoGainAttributeEval(damage_grade ~ ., train_data)

## **a) Modelo 2 vs ALL**

In [None]:
train_data_2_vs_A <- train_data
train_data_2_vs_A$damage_grade <- as.factor(ifelse(train_data$damage_grade == 2, 1, 0))

# InfoGainAttributeEval(damage_grade ~ ., train_data_2_vs_A)
# names(train_data_2_vs_A)

train_data_2_vs_A <- RandUnderClassif(damage_grade ~ ., dat = train_data_2_vs_A, C.perc = list('0'=.67, '1'=.51))
table(train_data_2_vs_A$damage_grade)

ripper_2_vs_A <- JRip(damage_grade ~ ., data = train_data_2_vs_A)
pred <- predict(ripper_2_vs_A, newdata = test_values)

evaluate_Weka_classifier(ripper_2_vs_A, class=TRUE)

res <- as.data.frame(cbind(test_ids, pred))
colnames(res) <- c('building_id', 'damage_grade')



    0     1 
75269 75612 


=== Summary ===

Correctly Classified Instances       98435               65.2402 %
Incorrectly Classified Instances     52446               34.7598 %
Kappa statistic                          0.3047
Mean absolute error                      0.4419
Root mean squared error                  0.47  
Relative absolute error                 88.3755 %
Root relative squared error             94.0082 %
Total Number of Instances           150881     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.605    0.301    0.667      0.605    0.635      0.306    0.677     0.660     0
                 0.699    0.395    0.640      0.699    0.668      0.306    0.677     0.622     1
Weighted Avg.    0.652    0.348    0.654      0.652    0.652      0.306    0.677     0.641     

=== Confusion Matrix ===

     a     b   <-- classified as
 45556 29713 |     a = 0
 22733 52879 |     b = 1

## **b) Modelo 1 vs 3**

In [None]:
train_data_1_vs_3 <- train_data[which(train_data$damage_grade != 2), ]
train_data_1_vs_3$damage_grade <- as.factor(ifelse(train_data_1_vs_3$damage_grade == 1, 1, 0))

# InfoGainAttributeEval(damage_grade ~ ., train_data_1_vs_3)
# names(train_data_1_vs_3)

info_1_vs_3 <- InfoGainAttributeEval(damage_grade ~ ., train_data_1_vs_3)
vars_1_vs_3 <- names(which(info_1_vs_3 >= 0.1))
train_data_1_vs_3 <- train_data_1_vs_3 %>% select(damage_grade, all_of(vars_1_vs_3))

train_data_1_vs_3 <- RandUnderClassif(damage_grade ~ ., dat = train_data_1_vs_3)
table(train_data_1_vs_3$damage_grade)

ripper_1_vs_3 <- JRip(damage_grade ~ ., data = train_data_1_vs_3)
pred <- predict(ripper_1_vs_3, newdata = test_values)

evaluate_Weka_classifier(ripper_1_vs_3, class=TRUE)

res$pred1vs3 <- pred
res$damage_grade <- ifelse(res$damage_grade == 2, 2, ifelse(res$pred1vs3 == 0, 3, 1))



    0     1 
25124 25124 


=== Summary ===

Correctly Classified Instances       45723               90.9947 %
Incorrectly Classified Instances      4525                9.0053 %
Kappa statistic                          0.8199
Mean absolute error                      0.1579
Root mean squared error                  0.281 
Relative absolute error                 31.5826 %
Root relative squared error             56.1984 %
Total Number of Instances            50248     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.881    0.061    0.935      0.881    0.907      0.821    0.925     0.920     0
                 0.939    0.119    0.887      0.939    0.913      0.821    0.925     0.880     1
Weighted Avg.    0.910    0.090    0.911      0.910    0.910      0.821    0.925     0.900     

=== Confusion Matrix ===

     a     b   <-- classified as
 22128  2996 |     a = 0
  1529 23595 |     b = 1

# **Resultados**

In [None]:
res <- res[, 1:2]
table(res$damage_grade)

write.csv(res, 'test_labels.csv', row.names = FALSE)


    1     2     3 
 8594 48913 29361 