# Richter's Predictor: Modeling Earthquake Damage
This project is a part of the [Data Science Retreat](https://datascienceretreat.com) mini competition. 

Richter's Predictor: Modeling Earthquake Damage Challenge hosted by DRIVENDATA.

## 1. Problem Statement

Based on aspects of building location and construction, our goal is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.

The data was collected through surveys by Kathmandu Living Labs and the Central Bureau of Statistics, which works under the National Planning Commission Secretariat of Nepal. This survey is one of the largest post-disaster datasets ever collected, containing valuable information on earthquake impacts, household conditions, and socio-economic-demographic statistics.

![NepalMap26Apr.png](attachment:NepalMap26Apr.png)

## 2. Data Collection

Data is available at: 
[Richter's Predictor: Modeling Earthquake Damage](https://www.drivendata.org/competitions/57/nepal-earthquake/data/)  Challenge hosted by  [DRIVENDATA](https://www.drivendata.org). 

### Features
The dataset mainly consists of information on the buildings' structure and their legal ownership. Each row in the dataset represents a specific building in the region that was hit by Gorkha earthquake.

There are 39 columns in this dataset, where the building_id column is a unique and random identifier. The remaining 38 features are described in the section below. Categorical variables have been obfuscated random lowercase ascii characters. The appearance of the same character in distinct columns does not imply the same original value.

### Description
1. `geo_level_1_id, geo_level_2_id, geo_level_3_id` (type: int): geographic region in which building exists, from largest (level 1) to most specific sub-region (level 3). Possible values: level 1: 0-30, level 2: 0-1427, level 3: 0-12567.
2. `count_floors_pre_eq` (type: int): number of floors in the building before the earthquake.
3. `age` (type: int): age of the building in years.
4. `area_percentage` (type: int): normalized area of the building footprint.
5. `height_percentage` (type: int): normalized height of the building footprint.
6. `land_surface_condition` (type: categorical): surface condition of the land where the building was built. Possible values: n, o, t.
7. `foundation_type` (type: categorical): type of foundation used while building. Possible values: h, i, r, u, w.
8. `roof_type` (type: categorical): type of roof used while building. Possible values: n, q, x.
9. `ground_floor_type` (type: categorical): type of the ground floor. Possible values: f, m, v, x, z.
10. `other_floor_type` (type: categorical): type of constructions used in higher than the ground floors (except of roof). Possible values: j, q, s, x.
11. `position` (type: categorical): position of the building. Possible values: j, o, s, t.
12. `plan_configuration` (type: categorical): building plan configuration. Possible values: a, c, d, f, m, n, o, q, s, u.
13. `has_superstructure_adobe_mud` (type: binary): flag variable that indicates if the superstructure was made of Adobe/Mud.
14. `has_superstructure_mud_mortar_stone` (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Stone.
15. `has_superstructure_stone_flag` (type: binary): flag variable that indicates if the superstructure was made of Stone.
16. `has_superstructure_cement_mortar_stone` (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Stone.
17. `has_superstructure_mud_mortar_brick` (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Brick.
18. `has_superstructure_cement_mortar_brick` (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Brick.
19. `has_superstructure_timber` (type: binary): flag variable that indicates if the superstructure was made of Timber.
20. `has_superstructure_bamboo` (type: binary): flag variable that indicates if the superstructure was made of Bamboo.
21. `has_superstructure_rc_non_engineered` (type: binary): flag variable that indicates if the superstructure was made of non-engineered reinforced concrete.
22. `has_superstructure_rc_engineered` (type: binary): flag variable that indicates if the superstructure was made of engineered reinforced concrete.
23. `has_superstructure_other` (type: binary): flag variable that indicates if the superstructure was made of any other material.
24. `legal_ownership_status` (type: categorical): legal ownership status of the land where building was built. Possible values: a, r, v, w.
25. `count_families` (type: int): number of families that live in the building.
26. `has_secondary_use` (type: binary): flag variable that indicates if the building was used for any secondary purpose.
27. `has_secondary_use_agriculture` (type: binary): flag variable that indicates if the building was used for agricultural purposes.
28. `has_secondary_use_hotel` (type: binary): flag variable that indicates if the building was used as a hotel.
29. `has_secondary_use_rental` (type: binary): flag variable that indicates if the building was used for rental purposes.
30. `has_secondary_use_institution` (type: binary): flag variable that indicates if the building was used as a location of any institution.
31. `has_secondary_use_school` (type: binary): flag variable that indicates if the building was used as a school.
32. `has_secondary_use_industry` (type: binary): flag variable that indicates if the building was used for industrial purposes.
33. `has_secondary_use_health_post` (type: binary): flag variable that indicates if the building was used as a health post.
34. `has_secondary_use_gov_office` (type: binary): flag variable that indicates if the building was used fas a government office.
35. `has_secondary_use_use_police` (type: binary): flag variable that indicates if the building was used as a police station.
36. `has_secondary_use_other` (type: binary): flag variable that indicates if the building was secondarily used for other purposes.


### Target
We're trying to predict the ordinal variable damage_grade, which represents a level of damage to the building that was hit by the earthquake. There are 3 grades of the damage:

1. represents low damage
2. represents a medium amount of damage
3. represents almost complete destruction


### Initial Data Understanding
Here, we get to know our data and try to figure out next steps for modelling.

A goal in this step is to get to know what types of data cleaning, preparation and encoding we need to do in order to prepare our data for inclusion in a model.

Since we are predicting damage_grade, this is a supervised ordinal/multiclass classification problem.

## 3. Data Cleaning, Exploratory Data Analysis (EDA), Feature Engineering, Modelling 

We are using [PyCaret](https://pycaret.readthedocs.io/en/latest/index.html).

PyCaret's classification module `(pycaret.classification)` is a supervised machine learning module which is used for Multi-class classification problems.  We are using it for data cleaning, to analyze the performance of models, creating model, hyper-parameter tuning and also for prediction.

PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. This makes experiments exponentially fast and efficient

* `pycaret.classification.setup`

The `setup()` function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline.

PyCaret's inference algorithm will automatically infer the data types for all features. PyCaret displays a table containing the features and their inferred data types after setup() is executed. 

 * `pycaret.classification.compare_models`
 
The `compare_models()` function trains and evaluates performance of all estimators available in the model library using cross validation. The output of this function is a score grid with average cross validated scores.
 
 ![model_comparison.png](attachment:model_comparison.png)

 * `pycaret.classification.create_model`
 
The `create_model()` function trains and evaluates the performance of a given estimator using cross validation. The output of this function is a grid of scores. 

We are using `lightgbm` - Light Gradient Boosting Machine as estimator.

* `pycaret.classification.tune_model`

The `tune_model()` function tunes the hyperparameters of a given estimator. The output of this function is a score grid with scoresthe best selected model based on.

We tried hypertuning `learning_rate, n_estimators, num_leaves, reg_alpha, reg_lambda` hyperparameters.

* `pycaret.classification.predict_model`

The `predict_model()` function predicts Target and Score (probability of predicted class) using a trained model. 

* `pycaret.classification.plot_model`

The `plot_model()` function analyzes the performance of a trained/tuned model on test dataset.

## 4. Plots

### 1. Baseline Model

![plot_baseline_feature_importance.png](attachment:plot_baseline_feature_importance.png)

![plot_baseline_confusion_matrix.png](attachment:plot_baseline_confusion_matrix.png)

![plot_tuned_class_report.png](attachment:plot_tuned_class_report.png)

### 2. Tuned Model

![plot_tuned_class_report.png](attachment:plot_tuned_class_report.png)

## 5. Performance metric
To measure the performance of our algorithms, we'll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score. 

After hypertuning of hyperparametes, we are able to acheive the 
### F1 score = [0.7364] 

![F1_score.png](attachment:F1_score.png)