# Richter's Predictor: Modeling Earthquake Damage
## MMAI 869 - Team College

URL: https://www.drivendata.org/competitions/57/nepal-earthquake/

Hi Team.  Let's use this notebook to record our work for the ML challenge.

Please make a copy of this notebook and try building your models.  Once you have completed, please copy your code and post it as a section (one # in markdown) in this notebook.

Rememmber to append the name of your model the list **models** and the predicted target of the test data in the dataframe **benchmark**.

The last section of this notebook will show the score of the models and we can review each other's attempt and improve our models.

I have created 2 sample models for your reference.

You may condense some sections for easy viewing.

### Time line
Nov 15: Taught on Classifiers  
Nov 21: Taught on Ensembles (Boostings)    
Nov 29: Taught on Feature Engineering and Hyperparameter tuning  
Nov 30 - Dec 06: generate model on you own or small groups  
Dec 07: Submit ideas on different parts of the presentations (Point forms are okay)  
Dec 08 - Dec 13: Another iterations for modeling  
Dec 14: Decide on final model and provide another round of ideas on different parts of presentation  
Dec 16: Kenny provide the ppt / notebook (80%) for Jing to review  
Dec 17: (Team Presentation for Agile Project Management)  
Dec 18: Jing finalise the presentation (95%)  
Dec 20 - Presenation  


# Preparation

### Import Library

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

# from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.neighbors import KNeighborsClassifier

### Import Data files

In [34]:
train_values = pd.read_csv("train_values.csv")
train_values.set_index('building_id',inplace=True)
train_labels = pd.read_csv("train_labels.csv")
train_labels.set_index('building_id',inplace=True)
test_values = pd.read_csv("test_values.csv")
test_values.set_index('building_id',inplace=True)

train = train_values.join(train_labels)


In [35]:
train.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,3
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,2
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,3
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,2
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,3


In [36]:
all_columns = list(train.columns)
x_columns = list(train.columns[:-1])  

pdx = train[x_columns]
pdy = train['damage_grade']

# Let's use 869 as the common random seed
# split data into 70/30 training to testing
x_train,x_test,y_train,y_test = train_test_split(pdx,pdy,train_size = 0.7,random_state=869)

models = ['Perfect'] # Keep a dictionary of models we tried
benchmark = pd.DataFrame(y_test)

benchmark.columns = models # Keep a copy of the predicted y of the test data fro benchmarking

# x_columns has all names of all features
# x_train are values of the features for training data
# y_train are targets for training data
# x_test are values of the features for testing data
# y_test are targets for the testing data

# likely need to apply some transformation to the huge number of categorical veriable (? target encoding)
# maybe we needed optuna for the tuning of hyperparameters

# the scoring method for evaluation
scoring = "f1_micro"


### Help functions for model training and result recordning  - by Jing

In [37]:
# define a function to do the steps:
# can use some global variables, do not need define everything
def train_model( X, y, model, model_name, param_grid, scoring, data_prep, cv=5 ):
    # define pipeline
    pipeline = Pipeline(steps=[("data_prep", data_prep),
                              ("model", model)])
    
    # use gridSearchCV to get the best model
    
    gs = GridSearchCV(pipeline,
                    param_grid=param_grid,
                    scoring=scoring,
                    cv=cv)
    
    #fit model
    gs.fit(X, y)
        
    #print the best model score
    print("{} training data average f1-score: {}".format(model_name, gs.score(X, y)))
    
    #return the trained model in case needed
    return gs  


In [38]:
# define the function to predict data and put into benchmark

def pred_target(model_name, model, X):
    y_pred = model.predict(X)
    # Register model and save result for comparison
    models.append(model_name)
    benchmark[model_name] = y_pred
    return 

# Null Model that Assigns 2 as the Prediction - Kenny, Nov 11

### Feature Engineering

In [39]:
# Code for feature engineering

### Modelling

In [40]:
# Code for modelling and tuning with cross-validation
# scoring='f1_micro' can be used in cross_val_score function for tuning with respect to f1 micro average

# Please append the predicted y of the test data into the dataframne benchmark and name of the model into models
models.append('All 2')
benchmark['All 2']=2

# Plain Logistic Regression, ignore geo_id - Kenny, Nov 11

### Faeture Engineering - Plain Logictic Regression

In [41]:
x_train_dummies = pd.get_dummies(x_train).iloc[:,3:]  # transform categorical to dummy and ignore geo_id
x_test_dummies = pd.get_dummies(x_test).iloc[:,3:]  # transform categorical to dummy and ignore geo_id


### Modelling - Plain Logistic Regression without geo_id

In [42]:
# import library
from sklearn.linear_model import LogisticRegression

# fit the model
model_LR = LogisticRegression(max_iter=100000)
model_LR.fit(x_train_dummies, y_train)

# predict on test set
y_pred = model_LR.predict(x_test_dummies)

# Register model and save result for comparison
models.append('Plain LR - no geo_id')
benchmark['Plain LR - no geo_id'] = y_pred

# KNN, geo_id as continuous- Jing, Nov 12

In [43]:
# get category columns
cat_columns = train.select_dtypes(include=['object']).columns

In [44]:
# don't use get_dummy, instead use oneHotEncoder, train first then the same one will use for training and finial test
# use columnTransformer to applies transformers to columns of an array or pandas DataFrame

ct = ColumnTransformer([("one-hot-encoder", OneHotEncoder(), cat_columns)], remainder ="passthrough")

# should fit the column transfer first
ct.fit(x_train)

In [45]:
# build separate model
# build model for KNN
model_name = "KNN - geo_id as continuous"
model = KNeighborsClassifier()

k_range = list(range(1, 10))
params = dict(model__n_neighbors=k_range)

In [46]:
# get gs_knn

gs = train_model(x_train, y_train, model, model_name, params, scoring, ct)

KNN - geo_id as continuous training data average f1-score: 0.7706008113145488


In [47]:
# predict the value and put into benchmark

pred_target(model_name, gs, x_test)

# Your Modelling

In [48]:
# x_columns has all names of all features
# x_train are values of the features for training data
# y_train are targets for training data
# x_test are values of the features for testing data
# y_test are targets for the testing data

# Need to import library in your code
# scoring='f1_micro' can be used in cross_val_score function for tuning with respect to f1 micro average

# Please append the predicted y of the test data into the dataframne benchmark and name of the model into models

# You may refer to the examples above

# Performance Evaluation (Micro Averaged F1 Score) using test data

In [49]:
for i, model_name in enumerate(models):
    print(model_name,"- Micro Average F1 = ",
            f1_score(y_test, benchmark[model_name], average='micro'))

Perfect - Micro Average F1 =  1.0
All 2 - Micro Average F1 =  0.567439659252248
Plain LR - no geo_id - Micro Average F1 =  0.5860119466366508
KNN - geo_id as continuous - Micro Average F1 =  0.7081899694299126


# --------------------
# For Presentation (12 minutes)

## What cleaning and preprocessing steps did you try? Which worked, which didn’t?

### Data are not balanced - See below

In [50]:
# check the output to see whether balanced
train["damage_grade"].value_counts()

2    148259
3     87218
1     25124
Name: damage_grade, dtype: int64

2: 148259/260601 = 0.56  
3: 87218/260601 = 0.33  
1: 25124/260601 = 0.096 

** Add your views here

## What feature engineering and selection steps did you try? Which worked, which didn’t?

** Add your views here

## Which ML algorithms did you try? How well did they work?

** Add your views here

## What hyperparameter tuning procedure did you try? What range of values did you consider? How much did they help performance?

** Add your views here

## Describe the drivers (i.e., feature importances) of your model’s performance. What did your model “learn?” / Describe your best model / confusion matrices and the associated metrics / 

** Add your views here

## What's next steps?

** Add your views here

## Lessons learned?

** Add your views here