# HACKEREARTH: #6 - Predict the damange to the building
- **Competition** : [here](https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-6-1/machine-learning/predict-the-energy-used-612632a9-3f496e7f/)

- **Leaderboard** : [here](https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-6-1/leaderboard/)

- **Data**        : [Download](https://he-s3.s3.amazonaws.com/media/hackathon/machine-learning-challenge-6-1/predict-the-energy-used-612632a9-3f496e7f/a490e594-6-Dataset.zip)

```
Opened At : Jun 16, 2018, 09:00 PM IST
Closed At : Aug 15, 2018, 11:55 PM IST
Rank      : XX
```

## Problem Statement:
Determining the degree of damage that is done to buildings post an earthquake can help identify safe and unsafe buildings, thus avoiding death and injuries resulting from aftershocks. Leveraging the power of machine learning is one viable option that can potentially prevent massive loss of lives while simultaneously making rescue efforts easy and efficient. In this challenge we provide you with the before and after details of nearly one million buildings after an earthquake. The damage to a building is categorized in five grades. Each grade depicts the extent of damage done to a building post an earthquake. Given building details, your task is to build a model that can predict the extent of damage that has been done to a building after an earthquake. 

---
## Code
### 1. Load libraries
#### Additional things
- Remove warnings
- Pandas maximum columns display = 1000
- Matplotlib inline

In [1]:
import pandas as pd
import math
import numpy as np
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 1000)
%matplotlib inline

### 2. Load data

In [2]:
data = pd.read_csv('../data/train.csv')
building_structure = pd.read_csv('../data/Building_Structure.csv')
building_ownership_use = pd.read_csv('../data/Building_Ownership_Use.csv')

In [3]:
building_structure.drop(['district_id', 'vdcmun_id'], axis = 1, inplace = True)
building_ownership_use.drop(['district_id', 'vdcmun_id', 'ward_id'], axis = 1, inplace = True)

In [4]:
test = pd.read_csv('../data/test.csv')

In [5]:
data.shape

(631761, 14)

In [6]:
building_structure.shape

(1052948, 27)

In [7]:
building_ownership_use.shape

(1052948, 14)

In [8]:
test.shape

(421175, 13)

### 3. Merge data

In [9]:
data = data.set_index('building_id').join(building_structure.set_index('building_id')).reset_index()
data = data.set_index('building_id').join(building_ownership_use.set_index('building_id')).reset_index()

In [10]:
test = test.set_index('building_id').join(building_structure.set_index('building_id')).reset_index()
test = test.set_index('building_id').join(building_ownership_use.set_index('building_id')).reset_index()

In [11]:
del building_structure
del building_ownership_use

In [12]:
data.shape

(631761, 53)

In [13]:
test.shape

(421175, 52)

In [15]:
data.to_csv('../data/full_train.csv', index = False)
test.to_csv('../data/full_test.csv', index = False)

### 4. EDA

#### 4.1 Check for missing
- has_repair_started has approximately 5% missing values in both train and test
  - replace missing with 2 (treat differently)
- count_families has 1 missing value in train data

In [21]:
data.isnull().sum(axis = 0)

building_id                               0
area_assesed                              0
damage_grade                              0
district_id                               0
has_geotechnical_risk                     0
has_geotechnical_risk_fault_crack         0
has_geotechnical_risk_flood               0
has_geotechnical_risk_land_settlement     0
has_geotechnical_risk_landslide           0
has_geotechnical_risk_liquefaction        0
has_geotechnical_risk_other               0
has_geotechnical_risk_rock_fall           0
has_repair_started                        0
vdcmun_id                                 0
ward_id                                   0
count_floors_pre_eq                       0
count_floors_post_eq                      0
age_building                              0
plinth_area_sq_ft                         0
height_ft_pre_eq                          0
height_ft_post_eq                         0
land_surface_condition                    0
foundation_type                 

In [22]:
test.isnull().sum(axis = 0)

building_id                               0
area_assesed                              0
district_id                               0
has_geotechnical_risk                     0
has_geotechnical_risk_fault_crack         0
has_geotechnical_risk_flood               0
has_geotechnical_risk_land_settlement     0
has_geotechnical_risk_landslide           0
has_geotechnical_risk_liquefaction        0
has_geotechnical_risk_other               0
has_geotechnical_risk_rock_fall           0
has_repair_started                        0
vdcmun_id                                 0
ward_id                                   0
count_floors_pre_eq                       0
count_floors_post_eq                      0
age_building                              0
plinth_area_sq_ft                         0
height_ft_pre_eq                          0
height_ft_post_eq                         0
land_surface_condition                    0
foundation_type                           0
roof_type                       

In [19]:
data['count_families'][data['count_families'].isnull()] = 1

In [20]:
data['has_repair_started'][data['has_repair_started'].isnull()] = 2
test['has_repair_started'][test['has_repair_started'].isnull()] = 2

#### 4.2 Analyze each independent column w.r.t target

In [33]:
data['damage_grade'].value_counts(normalize = True)

Grade 5    0.333710
Grade 4    0.240984
Grade 3    0.193567
Grade 2    0.134678
Grade 1    0.097062
Name: damage_grade, dtype: float64

- damage_grade: No class imbalance

In [31]:
pd.crosstab(data['area_assesed'], data['damage_grade'])

damage_grade,Grade 1,Grade 2,Grade 3,Grade 4,Grade 5
area_assesed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Both,54020,77827,111187,118108,21850
Building removed,130,175,266,514,130261
Exterior,6714,6680,10166,31867,43603
Interior,263,324,503,570,158
Not able to inspect,193,78,166,1185,14953


- area_assesed:
    - Most values having 'Building removed' belong to Grade 5
    - Most values having 'Not able to inspect' belong to Grade 5