# Richert's Predictor: Modeling Earthquake Damage

## 1. Business Understanding

### 1.1 Problem Statement

In April of 2015 the 7.8 magnitude Gorkha earthquake occured near the Gorkha district of Gandaki Pradesh, Nepal. Almost 9,000 lives were lost, millions of people were instantly made homeless, and $10 billion in damages––about half of Nepal's nominal GDP––were incurred. Following the Earthquake, Nepal carried out a massive household survey using mobile technology to assess building damage in the earthquake-affected districts in an effort to identify beneficiaries eligible for government assistance for housing reconstruction.

In the years since, the Nepalese government has worked intensely to help rebuild the affected districts' infrastructures. Throughout this process, the National Planning Commission, along with Kathmandu Living Labs and the Central Bureau of Statistics, has generated one of the largest post-disaster datasets ever collected, containing valuable information on earthquake impacts, household conditions, and socio-economic-demographic statistics.

#### 1.1.1 Objective

- Create a model that can predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.

#### 1.1.2 Specific Objectives

#### 1.1.3 Metric of Success

The best performing model will be selected based on:

   - An F1 Score > 0.80

## 2. Data Understanding

For this project we were provided with a dataset that mainly consists of information on the buildings' structure and their legal ownership. Each row in the dataset represents a specific building in the region that was hit by Gorkha earthquake.
There are 39 columns in this dataset, where the building_id column is a unique and random identifier.

The remaining 38 features are described as follows:

-  geo_level_1_id, geo_level_2_id, geo_level_3_id (type: int): geographic region in which building exists, from largest (level 1) to most specific    sub-region (level 3). Possible values: level 1: 0-30, level 2: 0-1427, level 3: 0-12567.

-  count_floors_pre_eq (type: int): number of floors in the building before the earthquake.
-  age (type: int): age of the building in years.
-  area_percentage (type: int): normalized area of the building footprint.
-  height_percentage (type: int): normalized height of the building footprint.
-  land_surface_condition (type: categorical): surface condition of the land where the building was built. Possible values: n, o, t.
-  foundation_type (type: categorical): type of foundation used while building. Possible values: h, i, r, u, w.
-  roof_type (type: categorical): type of roof used while building. Possible values: n, q, x.
-  ground_floor_type (type: categorical): type of the ground floor. Possible values: f, m, v, x, z.
-  other_floor_type (type: categorical): type of constructions used in higher than the ground floors (except of roof). Possible values: j, q, s, x.
-  position (type: categorical): position of the building. Possible values: j, o, s, t.
-  plan_configuration (type: categorical): building plan configuration. Possible values: a, c, d, f, m, n, o, q, s, u.
-  has_superstructure_adobe_mud (type: binary): flag variable that indicates if the superstructure was made of Adobe/Mud.
-  has_superstructure_mud_mortar_stone (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Stone.
-  has_superstructure_stone_flag (type: binary): flag variable that indicates if the superstructure was made of Stone.
-  has_superstructure_cement_mortar_stone (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Stone.
-  has_superstructure_mud_mortar_brick (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Brick.
-  has_superstructure_cement_mortar_brick (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Brick.
-  has_superstructure_timber (type: binary): flag variable that indicates if the superstructure was made of Timber.
-  has_superstructure_bamboo (type: binary): flag variable that indicates if the superstructure was made of Bamboo.
-  has_superstructure_rc_non_engineered (type: binary): flag variable that indicates if the superstructure was made of non-engineered reinforced concrete.
-  has_superstructure_rc_engineered (type: binary): flag variable that indicates if the superstructure was made of engineered reinforced concrete.
-  has_superstructure_other (type: binary): flag variable that indicates if the superstructure was made of any other material.
-  legal_ownership_status (type: categorical): legal ownership status of the land where building was built. Possible values: a, r, v, w.
-  count_families (type: int): number of families that live in the building.
-  has_secondary_use (type: binary): flag variable that indicates if the building was used for any secondary purpose.
-  has_secondary_use_agriculture (type: binary): flag variable that indicates if the building was used for agricultural purposes.
-  has_secondary_use_hotel (type: binary): flag variable that indicates if the building was used as a hotel.
-  has_secondary_use_rental (type: binary): flag variable that indicates if the building was used for rental purposes.
-  has_secondary_use_institution (type: binary): flag variable that indicates if the building was used as a location of any institution.
-  has_secondary_use_school (type: binary): flag variable that indicates if the building was used as a school.
-  has_secondary_use_industry (type: binary): flag variable that indicates if the building was used for industrial purposes.
-  has_secondary_use_health_post (type: binary): flag variable that indicates if the building was used as a health post.
-  has_secondary_use_gov_office (type: binary): flag variable that indicates if the building was used fas a government office.
-  has_secondary_use_use_police (type: binary): flag variable that indicates if the building was used as a police station.
-  has_secondary_use_other (type: binary): flag variable that indicates if the building was secondarily used for other purposes.

We have also been provided with three datasets:

- test_values.csv - contains the test features (contains 39 columns)
- train_values.csv - contains the train features (contains 39 columns)
- train_labels.csv - contains the train target labels

The labels dataset contains a single column 'damage_grade', which represents a level of damage to the building that was hit by the earthquake. There are 3 grades of the damage:

- 1 represents low damage
- 2 represents a medium amount of damage
- 3 represents almost complete destruction

## 3. Loading Relevant Libraries & Data

### 3.1 Loading Libraries

In [1]:
# import the necessary libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split,GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve,accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder

### 3.2 Loading Data

In [2]:
# load the train values
features = pd.read_csv('Data/train_values.csv', index_col="building_id")

#previewing the features dataset
features.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# load the train target labels
target = pd.read_csv('Data/train_labels.csv', index_col="building_id")

# previewing the target dataset
target.head()

Unnamed: 0_level_0,damage_grade
building_id,Unnamed: 1_level_1
802906,3
28830,2
94947,3
590882,2
201944,3


Merge the features and target datasets.

In [4]:
# merge the datasets
data = features.merge(target, how="inner", on="building_id")

# preview the data
data.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,3
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,2
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,3
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,2
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,3


### 3.3 Previewing Data

In [5]:
# checking the shape of the data

print(f'The data has {data.shape[0]} rows and {data.shape[1]} columns')

The data has 260601 rows and 39 columns


In [6]:
# checking the data types of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260601 entries, 802906 to 747594
Data columns (total 39 columns):
 #   Column                                  Non-Null Count   Dtype 
---  ------                                  --------------   ----- 
 0   geo_level_1_id                          260601 non-null  int64 
 1   geo_level_2_id                          260601 non-null  int64 
 2   geo_level_3_id                          260601 non-null  int64 
 3   count_floors_pre_eq                     260601 non-null  int64 
 4   age                                     260601 non-null  int64 
 5   area_percentage                         260601 non-null  int64 
 6   height_percentage                       260601 non-null  int64 
 7   land_surface_condition                  260601 non-null  object
 8   foundation_type                         260601 non-null  object
 9   roof_type                               260601 non-null  object
 10  ground_floor_type                       260601 non-

Generate the descriptive statistics of the data.

In [7]:
data.describe()

Unnamed: 0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,has_superstructure_stone_flag,...,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
count,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,...,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0
mean,13.900353,701.074685,6257.876148,2.129723,26.535029,8.018051,5.434365,0.088645,0.761935,0.034332,...,0.033626,0.008101,0.00094,0.000361,0.001071,0.000188,0.000146,8.8e-05,0.005119,2.238272
std,8.033617,412.710734,3646.369645,0.727665,73.565937,4.392231,1.918418,0.284231,0.4259,0.182081,...,0.180265,0.089638,0.030647,0.018989,0.032703,0.013711,0.012075,0.009394,0.071364,0.611814
min,0.0,0.0,0.0,1.0,0.0,1.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,7.0,350.0,3073.0,2.0,10.0,5.0,4.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
50%,12.0,702.0,6270.0,2.0,15.0,7.0,5.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,21.0,1050.0,9412.0,2.0,30.0,9.0,6.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,30.0,1427.0,12567.0,9.0,995.0,100.0,32.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0


Based, on the output above we get a statistical summary of the numerical datatype columns in our dataframe providing us with information on the measures such as the columns mean, standard deviation, interquartile range etc. It also provides the minimum and maximum values in each column which will be helpful in identfying the outliers.

## 4. Data Cleaning

### 4.1 Missing Values

- Checking for missing values

In [8]:
# Define a function to get missing data

def missing_data(data: pd.DataFrame) -> pd.DataFrame:
    """
    The function finds columns that have missing values, and returns the column,
    and the number of rows with missing data
    """
    missing_data = data.isna().sum()

    missing_data = missing_data[missing_data>0]

    return missing_data

In [9]:
# Getting the sum of missing values per column
missing_data(data).to_frame()

Unnamed: 0,0


Based on the output, out of the 39 columns, there are no missing values in our dataset.

### 4.2 Duplicates

- checking for duplicates

In [10]:
# checking for duplicates

print(f"The data has {data.duplicated().sum()} duplicated rows")

The data has 12319 duplicated rows


In [12]:
# exploring the duplicates

duplicates = data[data.duplicated()]

duplicates.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
242723,20,508,11256,2,10,9,6,t,w,q,...,0,0,0,0,0,0,0,0,0,1
461019,21,111,11714,2,10,9,5,t,r,q,...,0,0,0,0,0,0,0,0,0,3
505127,6,1108,5909,2,30,4,7,t,r,n,...,0,0,0,0,0,0,0,0,0,2
553217,27,269,11121,2,15,10,7,n,r,n,...,0,0,0,0,0,0,0,0,0,3
369695,10,1382,5036,2,15,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,3


The duplicated columns will be dropped to reduce noise in our dataset.

In [15]:
# drop the duplicates

data.drop_duplicates(inplace=True)

In [16]:
# Confirming if there are duplicates

print(f"The data has {data.duplicated().sum()} duplicated rows")

The data has 0 duplicated rows
