# **DSR 36 - Mini Competition (Oct 5-7th, 2023): Richter's Earthquake Damage Predictor**

## Team Remote: Boris, David, Petr, Milosz

![Our Team](images/team.png)


- Working with agile principles
- Using Google Meets and Jira
- Dividing roles Admin: Milosz, EDA/Viz: Petr, Modeling: Boris, Feature Engineering: David


## Git Repository

https://github.com/miloszpaul/minicomp


![Our Repo](images/repo.png)


# In this competition, we aim to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal

![Map](images/map.png)

# Our given dataset includes 38 features related to 162k buildings. 

# Our main hypothesis is that destruction is based on building **location** and **construction**. 


![Map](images/building_types.jpg)
Source: https://doi.org/10.1016/j.engstruct.2016.04.043


MISSING Image that correlates location and construction features of final model 


# The **target variable** is _'damage_grade'_, which represents the level of damage to the building. There are 3 grades: 

# 1 (low damage)
# 2 (medium damage)
# 3 (almost complete destruction)

MISSING Image of distribution of final target variable model  

# Our model is ... and received a test score of ....

![Submission](images/submission.png)

# In this presentation we walk you through the .. steps of our model. 

# 0. Import modules and data

In [95]:
# Modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [96]:
# Import scripts
import helper_functions # Various helper functions
import log_regression # Simple regression model
import lgb_optimized # Random forest

In [97]:
# Load data
X, y, X_test = helper_functions.imports()

## 1. Perform exploratory **data analysis** to understand the dataset's characteristics, distributions, and relationships between variables.



In [98]:
print(f"Proportions of the DataFrame X, containing the features for testing: {X.shape}")
print(f"Proportions of the DataFrage y, containing the target value for testing: {y.shape}")
print(f"Proportions of the DataFrame X, containing the features for the prediction: {X_test.shape}")

Proportions of the DataFrame X, containing the features for testing: (260601, 39)
Proportions of the DataFrage y, containing the target value for testing: (260601, 2)
Proportions of the DataFrame X, containing the features for the prediction: (86868, 39)


In [99]:
X.head()

Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
0,802906,6,487,12198,2,30,6,5,t,r,...,0,0,0,0,0,0,0,0,0,0
1,28830,8,900,2812,2,10,8,7,o,r,...,0,0,0,0,0,0,0,0,0,0
2,94947,21,363,8973,2,10,5,5,t,r,...,0,0,0,0,0,0,0,0,0,0
3,590882,22,418,10694,2,10,6,5,t,r,...,0,0,0,0,0,0,0,0,0,0
4,201944,11,131,1488,3,30,8,9,t,r,...,0,0,0,0,0,0,0,0,0,0


In [100]:
y.head()

Unnamed: 0,building_id,damage_grade
0,802906,3
1,28830,2
2,94947,3
3,590882,2
4,201944,3


In [101]:
X_test.head()


Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
0,300051,17,596,11307,3,20,7,6,t,r,...,0,0,0,0,0,0,0,0,0,0
1,99355,6,141,11987,2,25,13,5,t,r,...,1,0,0,0,0,0,0,0,0,0
2,890251,22,19,10044,2,5,4,5,t,r,...,0,0,0,0,0,0,0,0,0,0
3,745817,26,39,633,1,0,19,3,t,r,...,0,0,1,0,0,0,0,0,0,0
4,421793,17,289,7970,3,15,8,7,t,r,...,0,0,0,0,0,0,0,0,0,0


## Check whether X and X_test have the same columns

In [102]:
X.columns

Index(['building_id', 'geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id',
       'count_floors_pre_eq', 'age', 'area_percentage', 'height_percentage',
       'land_surface_condition', 'foundation_type', 'roof_type',
       'ground_floor_type', 'other_floor_type', 'position',
       'plan_configuration', 'has_superstructure_adobe_mud',
       'has_superstructure_mud_mortar_stone', 'has_superstructure_stone_flag',
       'has_superstructure_cement_mortar_stone',
       'has_superstructure_mud_mortar_brick',
       'has_superstructure_cement_mortar_brick', 'has_superstructure_timber',
       'has_superstructure_bamboo', 'has_superstructure_rc_non_engineered',
       'has_superstructure_rc_engineered', 'has_superstructure_other',
       'legal_ownership_status', 'count_families', 'has_secondary_use',
       'has_secondary_use_agriculture', 'has_secondary_use_hotel',
       'has_secondary_use_rental', 'has_secondary_use_institution',
       'has_secondary_use_school', 'has_secondary_use_i

In [103]:
helper_functions.test_column_equality(X, X_test)

AttributeError: module 'helper_functions' has no attribute 'test_column_equality'

In [None]:
X.dtypes

building_id                                int64
geo_level_1_id                             int64
geo_level_2_id                             int64
geo_level_3_id                             int64
count_floors_pre_eq                        int64
age                                        int64
area_percentage                            int64
height_percentage                          int64
land_surface_condition                    object
foundation_type                           object
roof_type                                 object
ground_floor_type                         object
other_floor_type                          object
position                                  object
plan_configuration                        object
has_superstructure_adobe_mud               int64
has_superstructure_mud_mortar_stone        int64
has_superstructure_stone_flag              int64
has_superstructure_cement_mortar_stone     int64
has_superstructure_mud_mortar_brick        int64
has_superstructure_c

In [None]:
X.nunique()

has_superstructure_adobe_mud               2
age                                       42
count_floors_pre_eq                        9
area_percentage                           84
height_percentage                         27
has_secondary_use                          2
has_superstructure_cement_mortar_brick     2
has_superstructure_timber                  2
has_superstructure_bamboo                  2
dtype: int64

Note: From the description, we assume Geo Level 3 is the most precise whereas Geo Level 1 the least precise. Follow that logic, we expect more unique data points in Level 3 than Level 1.

In [104]:
print(f"Unique data points in geo_level_1_id: {X.loc[:, 'geo_level_1_id'].nunique()}")
print(f"Unique data points in geo_level_2_id: {X.loc[:, 'geo_level_2_id'].nunique()}")
print(f"Unique data points in geo_level_3_id: {X.loc[:, 'geo_level_3_id'].nunique()}")

Unique data points in geo_level_1_id: 31
Unique data points in geo_level_2_id: 1414
Unique data points in geo_level_3_id: 11595


## 2. **Clean the data** to prepare it for modeling.

In [112]:
#First cleaning (before split - if required)
#For this competition, we initially focus on predicting 'damage_grade.' So, we extract this target variable from the dataset.

y = y['damage_grade']

KeyError: 'damage_grade'

In [106]:
# Split data in training and validation set to evaluate our model's performance.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [113]:
# More cleaning (after split)

In [114]:
# list of columns that shall be dropped at the end
drop_cols = []

# helper functions
def ids():
    ids = [id for id in globals().keys() if not id.startswith("_")]
    print(ids[5:])
ids()

print(X.shape, 'damage_grade' in X.columns)
data = pd.merge(left=y, right=X, on='building_id', how='inner')
print(data.shape, 'damage_grade' in data.columns) #ok, one column was added

cols = data.columns
for col in cols:
    globals()[col] = col

['open', 'pd', 'train_test_split', 'f1_score', 'helper_functions', 'log_regression', 'lgb_optimized', 'X', 'y', 'X_test', 'X_train', 'X_valid', 'y_train', 'y_valid', 'mask', 'X_test_pred', 'model', 'drop_cols', 'ids']
(260601, 9) False


KeyError: 'building_id'

## 3. **Visualization**:...

## 4. Generate a **Random Forest model** and optimize it for better predictive performance.



In [None]:
# Initially, we limit the dataset to a subset of relevant features for testing tree generation. 
# These features are selected based on our domain knowledge and preliminary analysis.


mask = ['has_superstructure_adobe_mud', 'age','count_floors_pre_eq','area_percentage','height_percentage','has_secondary_use',
        'has_superstructure_cement_mortar_brick', 'has_superstructure_timber', 'has_superstructure_bamboo'] 


# Apply the feature mask
X = X[mask]

X_train = X_train[mask]
X_valid = X_valid[mask]

# Prepare the test data with the same feature mask
X_test_pred = X_test[mask]

In [None]:
# Initialize a LightGBM (LGBM) model using the training and validation sets.

model = lgb_optimized.LGBM(X_train, X_valid, y_train, y_valid)


# Generate and optimize the Random Forest model and feed it with test data to create predictions.

y_pred = model.optimization(X, y, X_test_pred)

[I 2023-10-07 12:49:43,496] A new study created in memory with name: no-name-e60ba60f-b407-4d51-8c31-70b0da29ad2c


[1]	valid_0's multi_logloss: 0.88078
[2]	valid_0's multi_logloss: 0.862107
[3]	valid_0's multi_logloss: 0.850587
[4]	valid_0's multi_logloss: 0.842973
[5]	valid_0's multi_logloss: 0.83772
[6]	valid_0's multi_logloss: 0.833926
[7]	valid_0's multi_logloss: 0.831138
[8]	valid_0's multi_logloss: 0.829048
[9]	valid_0's multi_logloss: 0.827415
[10]	valid_0's multi_logloss: 0.826062
[11]	valid_0's multi_logloss: 0.825129
[12]	valid_0's multi_logloss: 0.824381
[13]	valid_0's multi_logloss: 0.823578
[14]	valid_0's multi_logloss: 0.823005
[15]	valid_0's multi_logloss: 0.8225
[16]	valid_0's multi_logloss: 0.822128
[17]	valid_0's multi_logloss: 0.821817
[18]	valid_0's multi_logloss: 0.821545
[19]	valid_0's multi_logloss: 0.821434
[20]	valid_0's multi_logloss: 0.82126
[21]	valid_0's multi_logloss: 0.821174
[22]	valid_0's multi_logloss: 0.821075
[23]	valid_0's multi_logloss: 0.820985
[24]	valid_0's multi_logloss: 0.820863
[25]	valid_0's multi_logloss: 0.820786
[26]	valid_0's multi_logloss: 0.820788


[I 2023-10-07 12:49:45,378] Trial 0 finished with value: 0.2974233034669327 and parameters: {'learning_rate': 0.24821531815903775, 'subsample': 0.7176022592079963, 'num_leaves': 42, 'min_data_in_leaf': 12, 'max_depth': 14, 'lambda_l2': 0.5608186055546341}. Best is trial 0 with value: 0.2974233034669327.


[1]	valid_0's multi_logloss: 0.904382
[2]	valid_0's multi_logloss: 0.893879
[3]	valid_0's multi_logloss: 0.885342
[4]	valid_0's multi_logloss: 0.878216
[5]	valid_0's multi_logloss: 0.871906
[6]	valid_0's multi_logloss: 0.866577
[7]	valid_0's multi_logloss: 0.862006
[8]	valid_0's multi_logloss: 0.858027
[9]	valid_0's multi_logloss: 0.854585
[10]	valid_0's multi_logloss: 0.851601
[11]	valid_0's multi_logloss: 0.849005
[12]	valid_0's multi_logloss: 0.846646
[13]	valid_0's multi_logloss: 0.844604
[14]	valid_0's multi_logloss: 0.84271
[15]	valid_0's multi_logloss: 0.841059
[16]	valid_0's multi_logloss: 0.839479
[17]	valid_0's multi_logloss: 0.838169
[18]	valid_0's multi_logloss: 0.836947
[19]	valid_0's multi_logloss: 0.835859
[20]	valid_0's multi_logloss: 0.834876
[21]	valid_0's multi_logloss: 0.833994
[22]	valid_0's multi_logloss: 0.833163
[23]	valid_0's multi_logloss: 0.83232
[24]	valid_0's multi_logloss: 0.831655
[25]	valid_0's multi_logloss: 0.831031
[26]	valid_0's multi_logloss: 0.8303

[I 2023-10-07 12:49:47,221] Trial 1 finished with value: 0.3043686805702116 and parameters: {'learning_rate': 0.08924342069901042, 'subsample': 0.8798255995621543, 'num_leaves': 23, 'min_data_in_leaf': 10, 'max_depth': 10, 'lambda_l2': 0.18174168412601532}. Best is trial 1 with value: 0.3043686805702116.


Number of finished trials: 2
Best trial:
--------------------------------
Best F1 Score: 0.3043686805702116
--------------------------------


AttributeError: 'LGBM' object has no attribute 'optimization'

5. **Evaluation**

6. **Export**

In [None]:
# Export the model's predictions to create a CSV file following DrivenData.org data standards.

helper_functions.write_output(X_test, y_pred)