# **DSR 36 - Mini Competition (Oct 5-7th, 2023): Richter's Earthquake Damage Predictor**

## Team Boris, David, Petr, Milosz

![Our Team](images/team.png)

Working with agile principles. Using Google Meets and Jira. Dividing roles Admin: Milosz, EDA/Viz: Petr, Modeling: Boris, Feature Engineering: David. 

Git Repository

![Our Repo](images/repo.png)




# In this competition, we aim to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal

![Map](images/map.png)

# Our given dataset includes 38 features related to 162k buildings. Our main hypothesis is that destruction is based on building **location** and **construction**. 

Image that correlates location and construction features of final model 

# The **target variable** is 'damage_grade,' which represents the level of damage to the building. There are 3 grades: 1 (low damage), 2 (medium damage), and 3 (almost complete destruction).

Image of distribution of final target variable model  

# Our model is ... and received a test score of ....

Image of test score 

# In this presentation we walk you through the .. steps of our model. 

0. Import modules and data

In [1]:
# Modules
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Import scripts
import helper_functions # various helper functions
import lgb_optimized # Random forest

In [3]:
# Load data
X, y, X_test = helper_functions.imports()

1. Perform exploratory **data analysis** to understand the dataset's characteristics, distributions, and relationships between variables.



2. **Clean the data** to prepare it for modeling.

In [4]:
#First cleaning (before split - if required)
#For this competition, we initially focus on predicting 'damage_grade.' So, we extract this target variable from the dataset.

y = y['damage_grade']

In [5]:
# Split data in training and validation set to evaluate our model's performance.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# More cleaning (after split)

3. **Visualization**:...

4. Generate a **Random Forest model** and optimize it for better predictive performance.



In [7]:
# Initially, we limit the dataset to a subset of relevant features for testing tree generation. 
# These features are selected based on our domain knowledge and preliminary analysis.


mask = ['has_superstructure_adobe_mud', 'age','count_floors_pre_eq','area_percentage','height_percentage','has_secondary_use',
        'has_superstructure_cement_mortar_brick', 'has_superstructure_timber', 'has_superstructure_bamboo'] 


# Apply the feature mask
X = X[mask]

X_train = X_train[mask]
X_valid = X_valid[mask]

# Prepare the test data with the same feature mask
X_test_pred = X_test[mask]

In [8]:
# Initialize a LightGBM (LGBM) model using the training and validation sets.

model = lgb_optimized.LGBM(X_train, X_valid, y_train, y_valid)


# Generate and optimize the Random Forest model and feed it with test data to create predictions.

y_pred = model.optimization(X, y, X_test_pred)

[I 2023-10-07 12:04:02,974] A new study created in memory with name: no-name-938f97b9-6a3a-4e5a-93c1-f28aab5e5a7b


[1]	valid_0's multi_logloss: 0.902354
[2]	valid_0's multi_logloss: 0.890471
[3]	valid_0's multi_logloss: 0.880951
[4]	valid_0's multi_logloss: 0.873128
[5]	valid_0's multi_logloss: 0.866468
[6]	valid_0's multi_logloss: 0.860815
[7]	valid_0's multi_logloss: 0.856026
[8]	valid_0's multi_logloss: 0.851891
[9]	valid_0's multi_logloss: 0.84837
[10]	valid_0's multi_logloss: 0.845279
[11]	valid_0's multi_logloss: 0.84259
[12]	valid_0's multi_logloss: 0.840243
[13]	valid_0's multi_logloss: 0.838192
[14]	valid_0's multi_logloss: 0.836367
[15]	valid_0's multi_logloss: 0.834763
[16]	valid_0's multi_logloss: 0.833314
[17]	valid_0's multi_logloss: 0.832069
[18]	valid_0's multi_logloss: 0.830939
[19]	valid_0's multi_logloss: 0.829935
[20]	valid_0's multi_logloss: 0.829008
[21]	valid_0's multi_logloss: 0.828209
[22]	valid_0's multi_logloss: 0.827459
[23]	valid_0's multi_logloss: 0.826823
[24]	valid_0's multi_logloss: 0.826162
[25]	valid_0's multi_logloss: 0.825618
[26]	valid_0's multi_logloss: 0.8250

[I 2023-10-07 12:04:05,561] Trial 0 finished with value: 0.29876633218856125 and parameters: {'learning_rate': 0.08931994015058015, 'subsample': 0.7092792624159049, 'num_leaves': 85, 'min_data_in_leaf': 12, 'max_depth': 14, 'lambda_l2': 0.07911213951880502}. Best is trial 0 with value: 0.29876633218856125.


[1]	valid_0's multi_logloss: 0.900866
[2]	valid_0's multi_logloss: 0.88866
[3]	valid_0's multi_logloss: 0.878986
[4]	valid_0's multi_logloss: 0.871135
[5]	valid_0's multi_logloss: 0.8648
[6]	valid_0's multi_logloss: 0.859672
[7]	valid_0's multi_logloss: 0.855483
[8]	valid_0's multi_logloss: 0.851899
[9]	valid_0's multi_logloss: 0.848856
[10]	valid_0's multi_logloss: 0.846287
[11]	valid_0's multi_logloss: 0.844054
[12]	valid_0's multi_logloss: 0.841888
[13]	valid_0's multi_logloss: 0.84003
[14]	valid_0's multi_logloss: 0.838403
[15]	valid_0's multi_logloss: 0.837039
[16]	valid_0's multi_logloss: 0.835889
[17]	valid_0's multi_logloss: 0.834813
[18]	valid_0's multi_logloss: 0.833806
[19]	valid_0's multi_logloss: 0.832946
[20]	valid_0's multi_logloss: 0.831989
[21]	valid_0's multi_logloss: 0.831301
[22]	valid_0's multi_logloss: 0.830727
[23]	valid_0's multi_logloss: 0.830045
[24]	valid_0's multi_logloss: 0.829553
[25]	valid_0's multi_logloss: 0.829009
[26]	valid_0's multi_logloss: 0.828602

[I 2023-10-07 12:04:06,888] Trial 1 finished with value: 0.30486751980967364 and parameters: {'learning_rate': 0.12030806996526378, 'subsample': 0.7109852232756951, 'num_leaves': 16, 'min_data_in_leaf': 7, 'max_depth': 11, 'lambda_l2': 0.3133180823931917}. Best is trial 1 with value: 0.30486751980967364.


[1]	valid_0's multi_logloss: 0.876644
[2]	valid_0's multi_logloss: 0.857609
[3]	valid_0's multi_logloss: 0.846364
[4]	valid_0's multi_logloss: 0.839009
[5]	valid_0's multi_logloss: 0.83417
[6]	valid_0's multi_logloss: 0.830742
[7]	valid_0's multi_logloss: 0.82836
[8]	valid_0's multi_logloss: 0.82654
[9]	valid_0's multi_logloss: 0.825083
[10]	valid_0's multi_logloss: 0.823896
[11]	valid_0's multi_logloss: 0.823034
[12]	valid_0's multi_logloss: 0.822378
[13]	valid_0's multi_logloss: 0.821823
[14]	valid_0's multi_logloss: 0.821568
[15]	valid_0's multi_logloss: 0.821268
[16]	valid_0's multi_logloss: 0.821057
[17]	valid_0's multi_logloss: 0.820857
[18]	valid_0's multi_logloss: 0.820683
[19]	valid_0's multi_logloss: 0.820613
[20]	valid_0's multi_logloss: 0.820577
[21]	valid_0's multi_logloss: 0.82055
[22]	valid_0's multi_logloss: 0.820434
[23]	valid_0's multi_logloss: 0.820338
[24]	valid_0's multi_logloss: 0.820301
[25]	valid_0's multi_logloss: 0.820225
[26]	valid_0's multi_logloss: 0.820131

[I 2023-10-07 12:04:08,740] Trial 2 finished with value: 0.2938163120431304 and parameters: {'learning_rate': 0.27099492502377376, 'subsample': 0.8093985898327212, 'num_leaves': 59, 'min_data_in_leaf': 8, 'max_depth': 10, 'lambda_l2': 0.963010979101808}. Best is trial 1 with value: 0.30486751980967364.


[1]	valid_0's multi_logloss: 0.891707
[2]	valid_0's multi_logloss: 0.875365
[3]	valid_0's multi_logloss: 0.86357
[4]	valid_0's multi_logloss: 0.854835
[5]	valid_0's multi_logloss: 0.848285
[6]	valid_0's multi_logloss: 0.843228
[7]	valid_0's multi_logloss: 0.839212
[8]	valid_0's multi_logloss: 0.836024
[9]	valid_0's multi_logloss: 0.83346
[10]	valid_0's multi_logloss: 0.83143
[11]	valid_0's multi_logloss: 0.829704
[12]	valid_0's multi_logloss: 0.828233
[13]	valid_0's multi_logloss: 0.826998
[14]	valid_0's multi_logloss: 0.825916
[15]	valid_0's multi_logloss: 0.825017
[16]	valid_0's multi_logloss: 0.824363
[17]	valid_0's multi_logloss: 0.823747
[18]	valid_0's multi_logloss: 0.823268
[19]	valid_0's multi_logloss: 0.822876
[20]	valid_0's multi_logloss: 0.822487
[21]	valid_0's multi_logloss: 0.821947
[22]	valid_0's multi_logloss: 0.821688
[23]	valid_0's multi_logloss: 0.821374
[24]	valid_0's multi_logloss: 0.821159
[25]	valid_0's multi_logloss: 0.820963
[26]	valid_0's multi_logloss: 0.82078

[W 2023-10-07 12:04:09,791] Trial 3 failed with parameters: {'learning_rate': 0.15874286016585107, 'subsample': 0.9946390527704619, 'num_leaves': 71, 'min_data_in_leaf': 20, 'max_depth': 15, 'lambda_l2': 0.3660971273142003} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/Users/miloszrosi/anaconda3/envs/dsr-setup/lib/python3.10/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "/Users/miloszrosi/Downloads/minicomp/lgb_optimized.py", line 59, in _objective
    model = lgb.train(
  File "/Users/miloszrosi/anaconda3/envs/dsr-setup/lib/python3.10/site-packages/lightgbm/engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "/Users/miloszrosi/anaconda3/envs/dsr-setup/lib/python3.10/site-packages/lightgbm/basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
KeyboardInterrupt
[W 2023-10-07 12:04:09,794] Trial 3 failed with value None.


[35]	valid_0's multi_logloss: 0.820052
[36]	valid_0's multi_logloss: 0.820011
[37]	valid_0's multi_logloss: 0.819977
[38]	valid_0's multi_logloss: 0.819959
[39]	valid_0's multi_logloss: 0.819937
[40]	valid_0's multi_logloss: 0.819957


KeyboardInterrupt: 

5. **Evaluation**

6. **Export**

In [None]:
# Export the model's predictions to create a CSV file following DrivenData.org data standards.

helper_functions.write_output(X_test, y_pred)