### This notebook contains the code to generate the submission for the "Pump it Up: Data Mining the Water Table" competition.

We use the preprocessed training data and corresponding values, as well as test data. We need to predict the ordinal variable 'status_group', with values 0, 1, 2. The error metric used in the competition is the classification rate (fraction of predictions that are correct).

In this script we train an XGBoost classifier model.

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier

In [None]:
X_train = pd.read_csv('../prep_data/X_train.csv')
y_train = pd.read_csv('../prep_data/y_train.csv')
X_val = pd.read_csv('../prep_data/X_val.csv')
y_val = pd.read_csv('../prep_data/y_val.csv')

In [None]:
# Define a test model
model = XGBClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=-1, early_stopping_rounds=5, random_state=42)
model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          verbose=False)

In [17]:
y_pred = model.predict(X_val)
# Calc how often y_pred is equal to y_val
class_rate = np.mean(y_pred == y_val.values.ravel())
print(f'Classification rate: {class_rate:.2f}')

Classification rate: 0.80


In [18]:
# Try optimising
param_grid = {
    'n_estimators': [500, 1000, 1200],
    'learning_rate': [0.01, 0.05, 0.1]
}

scores = []
for n_est in param_grid["n_estimators"]:
    for lr in param_grid["learning_rate"]:
        model = XGBClassifier(n_estimators=n_est, learning_rate=lr, n_jobs=-1, early_stopping_rounds=5, random_state=42)
        model.fit(X_train, y_train,
                  eval_set=[(X_val, y_val)],
                  verbose=False)
        y_pred = model.predict(X_val)
        class_rate = np.mean(y_pred == y_val.values.ravel())
        scores.append(class_rate)
        print(f'n_estimators: {n_est}, learning_rate: {lr}, classification rate: {class_rate:.4f}')
print(f'Best classification rate: {max(scores):.4f}')

n_estimators: 500, learning_rate: 0.01, classification rate: 0.7594
n_estimators: 500, learning_rate: 0.05, classification rate: 0.7903
n_estimators: 500, learning_rate: 0.1, classification rate: 0.8013
n_estimators: 1000, learning_rate: 0.01, classification rate: 0.7762
n_estimators: 1000, learning_rate: 0.05, classification rate: 0.8013
n_estimators: 1000, learning_rate: 0.1, classification rate: 0.8012
n_estimators: 1200, learning_rate: 0.01, classification rate: 0.7780
n_estimators: 1200, learning_rate: 0.05, classification rate: 0.8030
n_estimators: 1200, learning_rate: 0.1, classification rate: 0.8012
Best classification rate: 0.8030


In [19]:
# So we get the best results for 1200, 0.05. Try more values around that
param_grid = {
    'n_estimators': [1100, 1200, 1300],
    'learning_rate': [0.03, 0.04, 0.05, 0.06, 0.07]
}

scores = []
for n_est in param_grid["n_estimators"]:
    for lr in param_grid["learning_rate"]:
        if lr == 0.05 and n_est == 1200:
            pass
        else:
            model = XGBClassifier(n_estimators=n_est, learning_rate=lr, n_jobs=-1, early_stopping_rounds=5, random_state=42)
            model.fit(X_train, y_train,
                    eval_set=[(X_val, y_val)],
                    verbose=False)
            y_pred = model.predict(X_val)
            class_rate = np.mean(y_pred == y_val.values.ravel())
            scores.append(class_rate)
            print(f'n_estimators: {n_est}, learning_rate: {lr}, classification rate: {class_rate:.4f}')
print(f'Best classification rate: {max(scores):.4f}')

n_estimators: 1100, learning_rate: 0.03, classification rate: 0.7966
n_estimators: 1100, learning_rate: 0.04, classification rate: 0.7995
n_estimators: 1100, learning_rate: 0.05, classification rate: 0.8030
n_estimators: 1100, learning_rate: 0.06, classification rate: 0.8013
n_estimators: 1100, learning_rate: 0.07, classification rate: 0.8023
n_estimators: 1200, learning_rate: 0.03, classification rate: 0.7977
n_estimators: 1200, learning_rate: 0.04, classification rate: 0.7997
n_estimators: 1200, learning_rate: 0.06, classification rate: 0.8013
n_estimators: 1200, learning_rate: 0.07, classification rate: 0.8023
n_estimators: 1300, learning_rate: 0.03, classification rate: 0.7976
n_estimators: 1300, learning_rate: 0.04, classification rate: 0.7997
n_estimators: 1300, learning_rate: 0.05, classification rate: 0.8030
n_estimators: 1300, learning_rate: 0.06, classification rate: 0.8013
n_estimators: 1300, learning_rate: 0.07, classification rate: 0.8023
Best classification rate: 0.8030


In [20]:
# Best is then 1100, 0.05
model_fin = XGBClassifier(n_estimators=1100, learning_rate=0.05, n_jobs=-1, early_stopping_rounds=5, random_state=42)
model_fin.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              verbose=False)

In [21]:
y_pred = model_fin.predict(X_val)
class_rate = np.mean(y_pred == y_val.values.ravel())
print(f'Final classification rate: {class_rate:.4f}')

Final classification rate: 0.8030


In [None]:
# Load test data
X_test = pd.read_csv('../prep_data/X_test.csv')

# Prepare submission
output = pd.DataFrame(X_test["id"])
X_test.drop(columns=["id"], inplace=True)

y_test = model_fin.predict(X_test)
output["status_group"] = y_test
# Map to right strings again
output["status_group"] = output["status_group"].map({0: "non functional", 1: "functional needs repair", 2: "functional"})
output.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [None]:
# Save to csv
output.to_csv('../submissions/submission_xgb.csv', index=False)

### Final note:

After submission, the resulting score was 0.8030 as well. The best score on the leaderboard is 0.8299. I'm ranked 4580/7059 (at time of writing). Given that the best result is only about 3% more accurate, my result seems quite good as well.