# HACKEREARTH: #6 - Predict the damange to the building
- **Competition** : [here](https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-6-1/machine-learning/predict-the-energy-used-612632a9-3f496e7f/)

- **Leaderboard** : [here](https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-6-1/leaderboard/)

- **Data**        : [Download](https://he-s3.s3.amazonaws.com/media/hackathon/machine-learning-challenge-6-1/predict-the-energy-used-612632a9-3f496e7f/a490e594-6-Dataset.zip)

```
Opened At : Jun 16, 2018, 09:00 PM IST
Closed At : Aug 15, 2018, 11:55 PM IST
Rank      : 44
```

## Problem Statement:
Determining the degree of damage that is done to buildings post an earthquake can help identify safe and unsafe buildings, thus avoiding death and injuries resulting from aftershocks. Leveraging the power of machine learning is one viable option that can potentially prevent massive loss of lives while simultaneously making rescue efforts easy and efficient. In this challenge we provide you with the before and after details of nearly one million buildings after an earthquake. The damage to a building is categorized in five grades. Each grade depicts the extent of damage done to a building post an earthquake. Given building details, your task is to build a model that can predict the extent of damage that has been done to a building after an earthquake. 

---
## Code
### 1. Load libraries
#### Additional things
- Remove warnings
- Pandas maximum columns display = 1000
- Matplotlib inline

In [None]:
import pandas as pd
import math
import numpy as np
import warnings
import seaborn as sns
import glob
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 1000)
%matplotlib inline

### 2. Load data

In [None]:
data = pd.read_csv('../data/train.csv')
building_structure = pd.read_csv('../data/Building_Structure.csv')
building_ownership_use = pd.read_csv('../data/Building_Ownership_Use.csv')

In [None]:
building_structure.drop(['district_id', 'vdcmun_id'], axis = 1, inplace = True)
building_ownership_use.drop(['district_id', 'vdcmun_id', 'ward_id'], axis = 1, inplace = True)

In [None]:
test = pd.read_csv('../data/test.csv')

In [None]:
data.shape

In [None]:
building_structure.shape

In [None]:
building_ownership_use.shape

In [None]:
test.shape

### 3. Merge data

In [None]:
data = data.set_index('building_id').join(building_structure.set_index('building_id')).reset_index()
data = data.set_index('building_id').join(building_ownership_use.set_index('building_id')).reset_index()

In [None]:
test = test.set_index('building_id').join(building_structure.set_index('building_id')).reset_index()
test = test.set_index('building_id').join(building_ownership_use.set_index('building_id')).reset_index()

In [None]:
del building_structure
del building_ownership_use

In [None]:
data.shape

In [None]:
test.shape

In [None]:
data.to_csv('../data/full_train.csv', index = False)
test.to_csv('../data/full_test.csv', index = False)

### 4. EDA

#### 4.1 Check for missing
- has_repair_started has approximately 5% missing values in both train and test
  - replace missing with 2 (treat differently)
- count_families has 1 missing value in train data

In [None]:
data.isnull().sum(axis = 0)

In [None]:
test.isnull().sum(axis = 0)

In [None]:
data['count_families'][data['count_families'].isnull()] = 1

In [None]:
data['has_repair_started'][data['has_repair_started'].isnull()] = 2
test['has_repair_started'][test['has_repair_started'].isnull()] = 2

### 5. Create model data
- Seperate Independent and Dependent data
- Label encoding for categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
drop_cols = ['building_id']

In [None]:
independent_cols = [x for x in data.columns if x not in ['damage_grade'] + drop_cols]
target = 'damage_grade'

In [None]:
X = data[independent_cols]
y = np.array(data[target])

In [None]:
y = np.array([int(value.split()[1]) for value in y])

In [None]:
categorical_cols = X.columns[X.dtypes == 'object']
numeric_cols = X.columns[X.dtypes != 'object']

In [None]:
le = LabelEncoder()
for column in categorical_cols:
    X[column] = le.fit_transform(X[column])
    test[column] = le.fit_transform(test[column])

### 6. Modelling

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score, make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.7, stratify = y, random_state = 294056)

#### 6.1 Decision Tree

In [None]:
clf1 = DecisionTreeClassifier()
cross_val_score(clf1, X, y, scoring = make_scorer(f1_score, average='weighted'), cv = 10)
# array([0.72248245, 0.72073848, 0.7182898 , 0.71662196, 0.71920304, 0.72171188, 0.72030592, 0.71891392, 0.71888828, 0.71699247])

In [None]:
clf1.fit(X, y)
dt_pred = clf1.predict(test.drop(['building_id'], axis = 1))
dt_pred = ['Grade ' + str(pred) for pred in dt_pred]
dt_sub = pd.DataFrame({'building_id' : test['building_id'], 'damage_grade' : dt_pred})
dt_sub.to_csv('../submissions/dt_sub1.csv', index = False)

#### 6.2 Random Forest

In [None]:
clf2 = RandomForestClassifier(n_estimators = 1500)
cv_scores = cross_val_score(clf2, X, y, scoring = make_scorer(f1_score, average='weighted'), cv = 5)
print(np.mean(cv_scores))
# for 10 trees: array([0.75505873, 0.75331655, 0.75546319, 0.75023971, 0.75276322])
# for 500 trees: array([0.77158428, 0.76935488, 0.77136891, 0.76922603, 0.76947412])
# for 700 trees: array([0.771809  , 0.76984322, 0.77153816, 0.76925957, 0.76927937])

In [None]:
clf2.fit(X, y)
rf_pred = clf2.predict(test.drop(['building_id'], axis = 1))
rf_pred = ['Grade ' + str(pred) for pred in rf_pred]
rf_sub = pd.DataFrame({'building_id' : test['building_id'], 'damage_grade' : rf_pred})
rf_sub.to_csv('../submissions/rf_sub3.csv', index = False)

#### 6.3 LightGBM

In [None]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_valid, y_valid, reference = lgb_train)

In [None]:
# specify your configurations as a dict# specify 
params = {
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'metric': 'multi_error',
    'num_leaves': 50,
    'learning_rate': 0.05,
    'feature_fraction': 0.7,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'max_depth' : -1,
    'num_class' : 5
}

In [None]:
def lgb_f1(pred, data):
    label = data.get_label()
    pred = np.reshape(pred, (len(label), 5), 1)
    pred = np.argmax(pred, axis = 1)
    fs = f1_score(label, pred, average = 'weighted')
    return 'fscore', fs, True

In [None]:
clf3 = lgb.train(params,
            lgb_train,
            num_boost_round=5000,
            valid_sets=[lgb_train, lgb_valid],
            early_stopping_rounds = 100,
            verbose_eval=20,
            feval = lgb_f1)

In [None]:
lgb_pred = ['Grade ' + str(pred + 1) for pred in np.argmax(clf3.predict(test.drop(['building_id'], axis = 1)), axis = 1)]

In [None]:
lgb_sub = pd.DataFrame({'building_id': test['building_id'], 'damage_grade': lgb_pred})
lgb_sub.to_csv('../submissions/lgb_sub2.csv', index = False)