# **DSR 36 - Mini Competition (Oct 5-7th, 2023): Richter's Earthquake Damage Predictor**

![Eearthquake](images/earthquake.jpg)


## Team Remote: Boris, David, Petr, Milosz

![Our Team](images/team.png)


- Working with agile principles
- Using Google Meets and Jira
- Dividing roles Admin: Milosz -- EDA/Viz: Petr -- Modeling: Boris -- Feature Engineering: David


## Git Repository

https://github.com/miloszpaul/minicomp


![Our Repo](images/repo.png)


# In this competition, we aim to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal

![Map](images/map.png)

# Our given dataset includes 38 features related to 162k buildings. 

# Our main hypothesis is that destruction is based on building **location** and **construction**. 

## Location (geo_level_1_id)

![Map](images/nepal.png)
Source: Wikimedia 

## Construction types (foundation_type, roof_type, ground_floor_type, etc)


![Map](images/building_types.jpg)
Source: https://doi.org/10.1016/j.engstruct.2016.04.043


MISSING Image that correlates location and construction features of final model 


# The **target variable** is _'damage_grade'_, which represents the level of damage to the building. There are 3 grades: 

# 1 (low damage)
# 2 (medium damage)
# 3 (almost complete destruction)

MISSING Image of distribution of final target variable model  

# Our model is ... and received a test score of ....

![Submission](images/submission.png)

# In this presentation we walk you through the .. steps of our model. 

# 0. Import modules and data

In [176]:
# Modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [177]:
# Import scripts
import helper_functions # Various helper functions
import log_regression # Simple regression model
import lgb_optimized # Random forest

In [178]:
# Load data
X, y, X_test = helper_functions.imports()

## 1. Perform exploratory **data analysis** to understand the dataset's characteristics, distributions, and relationships between variables.



In [179]:
print(f"Proportions of the DataFrame X, containing the features for testing: {X.shape}")
print(f"Proportions of the DataFrage y, containing the target value for testing: {y.shape}")
print(f"Proportions of the DataFrame X, containing the features for the prediction: {X_test.shape}")

Proportions of the DataFrame X, containing the features for testing: (260601, 39)
Proportions of the DataFrage y, containing the target value for testing: (260601, 2)
Proportions of the DataFrame X, containing the features for the prediction: (86868, 39)


In [180]:
X.head()

Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
0,802906,6,487,12198,2,30,6,5,t,r,...,0,0,0,0,0,0,0,0,0,0
1,28830,8,900,2812,2,10,8,7,o,r,...,0,0,0,0,0,0,0,0,0,0
2,94947,21,363,8973,2,10,5,5,t,r,...,0,0,0,0,0,0,0,0,0,0
3,590882,22,418,10694,2,10,6,5,t,r,...,0,0,0,0,0,0,0,0,0,0
4,201944,11,131,1488,3,30,8,9,t,r,...,0,0,0,0,0,0,0,0,0,0


In [181]:
y.head()

Unnamed: 0,building_id,damage_grade
0,802906,3
1,28830,2
2,94947,3
3,590882,2
4,201944,3


In [182]:
X_test.head()


Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
0,300051,17,596,11307,3,20,7,6,t,r,...,0,0,0,0,0,0,0,0,0,0
1,99355,6,141,11987,2,25,13,5,t,r,...,1,0,0,0,0,0,0,0,0,0
2,890251,22,19,10044,2,5,4,5,t,r,...,0,0,0,0,0,0,0,0,0,0
3,745817,26,39,633,1,0,19,3,t,r,...,0,0,1,0,0,0,0,0,0,0
4,421793,17,289,7970,3,15,8,7,t,r,...,0,0,0,0,0,0,0,0,0,0


## Check whether _X_ and _X_test_ have the same columns

In [183]:
X.columns

Index(['building_id', 'geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id',
       'count_floors_pre_eq', 'age', 'area_percentage', 'height_percentage',
       'land_surface_condition', 'foundation_type', 'roof_type',
       'ground_floor_type', 'other_floor_type', 'position',
       'plan_configuration', 'has_superstructure_adobe_mud',
       'has_superstructure_mud_mortar_stone', 'has_superstructure_stone_flag',
       'has_superstructure_cement_mortar_stone',
       'has_superstructure_mud_mortar_brick',
       'has_superstructure_cement_mortar_brick', 'has_superstructure_timber',
       'has_superstructure_bamboo', 'has_superstructure_rc_non_engineered',
       'has_superstructure_rc_engineered', 'has_superstructure_other',
       'legal_ownership_status', 'count_families', 'has_secondary_use',
       'has_secondary_use_agriculture', 'has_secondary_use_hotel',
       'has_secondary_use_rental', 'has_secondary_use_institution',
       'has_secondary_use_school', 'has_secondary_use_i

In [184]:
helper_functions.test_column_equality(X, X_test)

AttributeError: module 'helper_functions' has no attribute 'test_column_equality'

In [None]:
X.dtypes

building_id                                int64
geo_level_1_id                             int64
geo_level_2_id                             int64
geo_level_3_id                             int64
count_floors_pre_eq                        int64
age                                        int64
area_percentage                            int64
height_percentage                          int64
land_surface_condition                    object
foundation_type                           object
roof_type                                 object
ground_floor_type                         object
other_floor_type                          object
position                                  object
plan_configuration                        object
has_superstructure_adobe_mud               int64
has_superstructure_mud_mortar_stone        int64
has_superstructure_stone_flag              int64
has_superstructure_cement_mortar_stone     int64
has_superstructure_mud_mortar_brick        int64
has_superstructure_c

In [None]:
X.nunique()

has_superstructure_adobe_mud               2
age                                       42
count_floors_pre_eq                        9
area_percentage                           84
height_percentage                         27
has_secondary_use                          2
has_superstructure_cement_mortar_brick     2
has_superstructure_timber                  2
has_superstructure_bamboo                  2
dtype: int64

Note: From the description, we assume Geo Level 3 is the most precise whereas Geo Level 1 the least precise. Follow that logic, we expect more unique data points in Level 3 than Level 1.

In [None]:
print(f"Unique data points in geo_level_1_id: {X.loc[:, 'geo_level_1_id'].nunique()}")
print(f"Unique data points in geo_level_2_id: {X.loc[:, 'geo_level_2_id'].nunique()}")
print(f"Unique data points in geo_level_3_id: {X.loc[:, 'geo_level_3_id'].nunique()}")

Unique data points in geo_level_1_id: 31
Unique data points in geo_level_2_id: 1414
Unique data points in geo_level_3_id: 11595


!!!MISSING (Report) on MISSING DATA? 

## 2. **Clean the data** to prepare it for modeling.

In [None]:
#First cleaning (before split - if required)
#For this competition, we initially focus on predicting 'damage_grade.' So, we extract this target variable from the dataset.

y = y['damage_grade']

KeyError: 'damage_grade'

In [None]:
# Split data in training and validation set to evaluate our model's performance.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

## _geo_level2_ mean encodings

In [None]:
geo_levels = []
for col in filter(lambda col: col.startswith("geo"), X):
    print(str(col))
    geo_levels.append(col)

In [None]:
mean_encodings = data.groupby('geo_level_2_id')['damage_grade'].mean()
pd.DataFrame(mean_encodings).head()


NameError: name 'data' is not defined

In [None]:
for df in X, X_test:
    df["geo_level_2_enc"] = df['geo_level_2_id'].map(mean_encodings)
    # print(df[["geo_level_2_id", "geo_level_2_enc"]].sort_values("geo_level_2_id")[:100], end="\n")

print(geo_levels)

KeyError: 'geo_level_2_id'

## _geo_level1_ dummies

In [None]:
dummies = pd.get_dummies(X["geo_level_1_id"], prefix="geo_level_cat")
X = pd.concat([X, dummies], axis=1)
dummies = pd.get_dummies(X_test["geo_level_1_id"], prefix="geo_level_cat")
X_test = pd.concat([X_test, dummies], axis=1)

print(
    X.shape,    
    X_test.shape, sep="\n"
)

KeyError: 'geo_level_1_id'

In [None]:
geo_columns = [col for col in X.columns if col.startswith('geo_level_cat')]
print(X.groupby("geo_level_1_id").first()[geo_columns].iloc[:6,:6])

KeyError: 'geo_level_1_id'

## _foundation_type_ dummies

In [None]:
X = pd.get_dummies(X, columns=["foundation_type"], drop_first=False)
X_test = pd.get_dummies(X_test, columns=["foundation_type"], drop_first=False)

print(X.shape)
print(X_test.shape)
foundation_cols = [col for col in X.columns if col.startswith("foundation")]
print(
    X[foundation_cols].columns,
    X_test[foundation_cols].columns,
    sep="\n")

KeyError: "None of [Index(['foundation_type'], dtype='object')] are in the [columns]"

In [None]:
X = pd.get_dummies(X, columns=["roof_type"], drop_first=False)
X_test = pd.get_dummies(X_test, columns=["roof_type"], drop_first=False)

print(X.shape)
print(X_test.shape)
cols = [col for col in X.columns if col.startswith("roof_type")]
print(
    X[cols].columns,
    X_test[cols].columns,
    sep="\n")

KeyError: "None of [Index(['roof_type'], dtype='object')] are in the [columns]"

In [None]:
X = pd.get_dummies(X, columns=["ground_floor_type"], drop_first=False)
X_test = pd.get_dummies(X_test, columns=["ground_floor_type"], drop_first=False)

print(X.shape)
print(X_test.shape)
cols = [col for col in X.columns if col.startswith("ground_floor_type")]
print(
    X[cols].columns,
    X_test[cols].columns,
    sep="\n")

KeyError: "None of [Index(['ground_floor_type'], dtype='object')] are in the [columns]"

# Hot encodings

In [None]:
for name,df in {"X":X, "X_test":X_test}.items():
    encode_cols = ["foundation_type", "roof_type", "ground_floor_type", "geo_level_1_id"]
    globals()[name] = pd.get_dummies(df, columns=encode_cols, drop_first=False)
    globals()[name].drop(columns=['geo_level_2_id', 'geo_level_3_id'], inplace=True)
    cols = [col for col in df.columns if any(map(col.startswith, encode_cols))]

In [None]:
for df in X, X_test:
    print(
        df.shape,
        # df.columns,
        df.info()
    )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260601 entries, 0 to 260600
Data columns (total 9 columns):
 #   Column                                  Non-Null Count   Dtype
---  ------                                  --------------   -----
 0   has_superstructure_adobe_mud            260601 non-null  int64
 1   age                                     260601 non-null  int64
 2   count_floors_pre_eq                     260601 non-null  int64
 3   area_percentage                         260601 non-null  int64
 4   height_percentage                       260601 non-null  int64
 5   has_secondary_use                       260601 non-null  int64
 6   has_superstructure_cement_mortar_brick  260601 non-null  int64
 7   has_superstructure_timber               260601 non-null  int64
 8   has_superstructure_bamboo               260601 non-null  int64
dtypes: int64(9)
memory usage: 17.9 MB
(260601, 9) None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86868 entries, 0 to 86867
Data colum

## Prepare data for analysis

In [None]:
# set index to building_id
for df in y, X, X_test:
    df.set_index("building_id", inplace=True)

drop_cols = ["land_surface_condition", "other_floor_type", "position", 
             "plan_configuration", "legal_ownership_status"]
for df in X, X_test:
    df.drop(columns=drop_cols, inplace=True)

KeyError: "None of ['building_id'] are in the columns"

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_test_pred = X_test

# for k,v in {"X":X, X_train, X_valid, X_test}:
for name,df in {"X":X, "X_train":X_train, "X_valid": X_valid, "X_test": X_test}.items():
    print(name, *df.shape, sep="\t")

X	260601	9
X_train	208480	9
X_valid	52121	9
X_test	86868	77


## 3. **Visualizations**

In [None]:
# correlation heatmap of predictors
corr = X.select_dtypes(include=[np.number]).drop('building_id', axis=1).corr()
fig, ax = plt.subplots(figsize=(10,8))
ax.set_title("Feature Correlation Heatmap")
sns.heatmap(corr, ax=ax)

NameError: name 'np' is not defined

## **Plot:** Correlation heatmap of predictors

## **Observations:** Secondary uses are highly correlated, Ground floor and height are highly correlated

## **Actions:** 1) eliminate one variable (gorund floor OR height), or 2) keep both but be aware that the coefficients are not statistically significant, however combined predictive power might be worthwhile to keep

In [185]:
# for each predictor, show histogram and 100% stacked histogram  
visualisation_functions.hist2x2_f(X_train,y_train,cols_include=['geo_level_1_id','age'],cols_ignore=[])

NameError: name 'visualisation_functions' is not defined

## **Plot I:** Histograms of geo_id

## **Observations:** Epicenter of earthquake is at 17

## **Actions:** Confirmation of hypothesis that location is important

## **Plot II:** Histograms of age 

## **Observations:** 1) Correlation between newest buildings and low damage 1, i.e. the percentage of low damage decreases with age; 1.1) very low number of buildings explains varation around age 200 2) outliers at age 1000 are old protected buildings 


## **Actions:** Non linear model, use _lightgbm_ 


In [186]:
# pairwise correlation
# Concatenate X_train and y for correlation calculation
visualisation_functions.plot_pairwise_cor(X_train,y)

NameError: name 'visualisation_functions' is not defined

## **Plot:** Pairwise correlation of each feature

## **Observations:** Outliers (for instance superstructures), are highly correlated to the predictor

## **Actions:** Not blindly exclude outliers  

In [194]:
# feature importance default plot
fitted_model=model.get_model()
lgb.plot_importance(fitted_model)


NameError: name 'lgb' is not defined

## **Plot:** Feature importance of our most relevant model lightgbm classifier 

## **Observations:** How frequent is it that we ask to split forest by age? 

## **Actions:** 

## 4. Generate a **Random Forest model** and optimize it for better predictive performance.



In [None]:
# Initially, we limit the dataset to a subset of relevant features for testing tree generation. 
# These features are selected based on our domain knowledge and preliminary analysis.


mask = ['has_superstructure_adobe_mud', 'age','count_floors_pre_eq','area_percentage','height_percentage','has_secondary_use',
        'has_superstructure_cement_mortar_brick', 'has_superstructure_timber', 'has_superstructure_bamboo'] 


# Apply the feature mask
X = X[mask]

X_train = X_train[mask]
X_valid = X_valid[mask]

# Prepare the test data with the same feature mask
X_test_pred = X_test[mask]

In [None]:
# Initialize a LightGBM (LGBM) model using the training and validation sets.

model = lgb_optimized.LGBM(X_train, X_valid, y_train, y_valid)


# Generate and optimize the Random Forest model and feed it with test data to create predictions.

y_pred = model.optimization(X, y, X_test_pred)

[I 2023-10-07 13:50:13,836] A new study created in memory with name: no-name-0c5a9c9b-3887-4349-b110-2fe8f8691a7d


[1]	valid_0's multi_logloss: 0.889345
[2]	valid_0's multi_logloss: 0.873301
[3]	valid_0's multi_logloss: 0.862211
[4]	valid_0's multi_logloss: 0.854359
[5]	valid_0's multi_logloss: 0.848449
[6]	valid_0's multi_logloss: 0.843939
[7]	valid_0's multi_logloss: 0.840582
[8]	valid_0's multi_logloss: 0.837772
[9]	valid_0's multi_logloss: 0.835388
[10]	valid_0's multi_logloss: 0.833687
[11]	valid_0's multi_logloss: 0.832322
[12]	valid_0's multi_logloss: 0.831086
[13]	valid_0's multi_logloss: 0.830206
[14]	valid_0's multi_logloss: 0.829291
[15]	valid_0's multi_logloss: 0.828517
[16]	valid_0's multi_logloss: 0.827949
[17]	valid_0's multi_logloss: 0.827334
[18]	valid_0's multi_logloss: 0.826787
[19]	valid_0's multi_logloss: 0.826356
[20]	valid_0's multi_logloss: 0.82609
[21]	valid_0's multi_logloss: 0.825711
[22]	valid_0's multi_logloss: 0.825501
[23]	valid_0's multi_logloss: 0.825115
[24]	valid_0's multi_logloss: 0.824962
[25]	valid_0's multi_logloss: 0.824711
[26]	valid_0's multi_logloss: 0.824

[I 2023-10-07 13:50:16,139] Trial 0 finished with value: 0.2996488939199171 and parameters: {'learning_rate': 0.2088837713596667, 'subsample': 0.9895708837679428, 'num_leaves': 56, 'min_data_in_leaf': 5, 'max_depth': 5, 'lambda_l2': 0.0019107565692958461}. Best is trial 0 with value: 0.2996488939199171.


[100]	valid_0's multi_logloss: 0.820918
[1]	valid_0's multi_logloss: 0.891678
[2]	valid_0's multi_logloss: 0.876861
[3]	valid_0's multi_logloss: 0.867622
[4]	valid_0's multi_logloss: 0.860887
[5]	valid_0's multi_logloss: 0.85583
[6]	valid_0's multi_logloss: 0.851918
[7]	valid_0's multi_logloss: 0.848564
[8]	valid_0's multi_logloss: 0.845679
[9]	valid_0's multi_logloss: 0.843656
[10]	valid_0's multi_logloss: 0.841882
[11]	valid_0's multi_logloss: 0.840287
[12]	valid_0's multi_logloss: 0.839198
[13]	valid_0's multi_logloss: 0.837882
[14]	valid_0's multi_logloss: 0.836802
[15]	valid_0's multi_logloss: 0.835787
[16]	valid_0's multi_logloss: 0.835094
[17]	valid_0's multi_logloss: 0.834112
[18]	valid_0's multi_logloss: 0.833665
[19]	valid_0's multi_logloss: 0.833136
[20]	valid_0's multi_logloss: 0.832662
[21]	valid_0's multi_logloss: 0.832363
[22]	valid_0's multi_logloss: 0.831801
[23]	valid_0's multi_logloss: 0.8315
[24]	valid_0's multi_logloss: 0.831143
[25]	valid_0's multi_logloss: 0.8308

[I 2023-10-07 13:50:17,664] Trial 1 finished with value: 0.30482914756048424 and parameters: {'learning_rate': 0.24567376174643976, 'subsample': 0.8651750307524942, 'num_leaves': 76, 'min_data_in_leaf': 12, 'max_depth': 3, 'lambda_l2': 0.4208698856132629}. Best is trial 1 with value: 0.30482914756048424.


[95]	valid_0's multi_logloss: 0.824568
[96]	valid_0's multi_logloss: 0.824482
[97]	valid_0's multi_logloss: 0.824458
[98]	valid_0's multi_logloss: 0.824381
[99]	valid_0's multi_logloss: 0.824347
[100]	valid_0's multi_logloss: 0.824327
Number of finished trials: 2
Best trial:
--------------------------------
Best F1 Score: 0.30482914756048424
--------------------------------


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)




AttributeError: 'LGBM' object has no attribute 'optimization'

## 5. **Evaluation**

## 6. **Export**

In [None]:
# Export the model's predictions to create a CSV file following DrivenData.org data standards.

helper_functions.write_output(X_test, y_pred)