Team CoronaBoost
477631,  
Øyvind Samuelsen, Mikkel Nygard
Challenge: 3

In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor

In [2]:
# Load train/test data
train = pd.read_csv('data/challenge3_train.csv', index_col='id')
test = pd.read_csv('data/challenge3_test.csv', index_col='id')

In [3]:
# Helper lists for column types
features_nom_low = ['f5'] # nominal low cardinality (<=3)
features_nom_high = ['f12', 'f28'] # >= 26
features_ord_alph = ['f15', 'f20']

# columns to perform one hot encoding on 
ohe_columns = features_ord_alph + features_nom_low + features_nom_high

## Data cleaning
Based on exploration we have done these steps for data cleaning. After each step we tested to see if the accuracy in a basic XGB model improved, to decide if we would include it or not

In [4]:
# Fix 0 value noise, change to most common value in column. 
impute_0_columns = ['f3', 'f18', 'f21']

for column in impute_0_columns:
    train.loc[train[column] == 0, column] = train[column].mode() # mode() finds most common value
    test.loc[test[column] == 0, column] = test[column].mode()

# remove -1 from 'month' column f11
train.loc[train['f11'] == -1, 'f11'] = train['f11'].mode()
test.loc[test['f11'] == -1, 'f11'] = test['f11'].mode()

## Feature engineering
Similar to data cleaning, we have tested each step to see if it improves

In [5]:
# There is a positive correlation for target 1 between f6 and f25, which is not present in target 0.
train['f6_f25'] = train['f6'].fillna(1)**2*train['f25'].fillna(1)**2
test['f6_f25'] = test['f6'].fillna(1)**2*test['f25'].fillna(1)**2

In [6]:
# We hypothesise column f11 describes months, from 0-11
# so add cyclical feature
def cyc_enc(df, col, max_vals):
    df[col + '_sin'] = np.sin(2 * np.pi * df[col]/max_vals)
    df[col + '_cos'] = np.cos(2 * np.pi * df[col]/max_vals)
    return df
train = cyc_enc(train, 'f11', 11)
test = cyc_enc(test, 'f11', 11)

In [7]:
# For columns with dtype 'object' we need to represent them as numeric.
# One hot encoding gave us best result, compared to label encoding and target encoding.
train = pd.get_dummies(train, columns=ohe_columns)
test = pd.get_dummies(test, columns=ohe_columns)

# Modeling
We use XGBoost, with optimized parameters

In [8]:
# split train set into a train and test set
X = train.drop(['target'], axis=1)
y = train['target']

## Training final model, predict test set and delivery csv

In [9]:
%%time

# Fit model with best parameters over all training data. Then predict test data.
# Parameters were found by using Hyperopt, as detailed in long document.
xgbmodel = XGBRegressor(n_estimators=300, 
            scale_pos_weight=len(y[y==0]) / len(y[y==1]),        
            **{'colsample_bytree': 0.7541691347538767, 
          'eval_metric': 'auc', 
          'gamma': 0.4511313995198146, 
          'learning_rate': 0.03432377692457561, 
          'max_depth': 9, 
          'min_child_weight': 4.0,
          'reg_alpha': 0.33753466341724986,
          'reg_lambda': 0.3322614438833425, 
          'subsample': 0.8})

xgbmodel.fit(X, y, verbose=False)
predictions = xgbmodel.predict(test)

CPU times: user 11min 14s, sys: 2.63 s, total: 11min 17s
Wall time: 3min 43s


In [10]:
submission = pd.DataFrame({
    'id': test.index,
    'target': predictions,
}).to_csv('477631+477578_predictions.txt', index=False)