# Benchmark

**Introduction:**
Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

This is also an intermediate-level competition by [DataDriven][1]! All code & support scripts are in [Github Repo][2]

[1]: https://www.drivendata.org/competitions/7/ "Link to Competetion Page"
[2]: https://github.com/msampathkumar/datadriven_pumpit "User Code"

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from scripts.tools import data_transformations, df_check_stats, game, sam_pickle_save, sam_pickle_load

np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)

%matplotlib inline

In [2]:
# data collection
RAW_X = pd.read_csv('data/traning_set_values.csv', index_col='id')
RAW_y = pd.read_csv('data/training_set_labels.csv', index_col='id')
RAW_TEST_X = pd.read_csv('data/test_set_values.csv', index_col='id')

df_check_stats(RAW_X, RAW_y, RAW_TEST_X)

# bool columns
tmp = ['public_meeting', 'permit']
RAW_X[tmp] = RAW_X[tmp].fillna(True)
RAW_TEST_X[tmp] = RAW_TEST_X[tmp].fillna(True)

# object columns list
obj_cols = RAW_X.dtypes[RAW_X.dtypes == 'O'].index.tolist()

# object columns
RAW_X[obj_cols] = RAW_X[obj_cols].fillna('Other')
RAW_TEST_X[obj_cols] = RAW_TEST_X[obj_cols].fillna('Other')

# Just assining new names to transformed dataframe pointers
X, y, TEST_X = data_transformations(RAW_X, RAW_y, RAW_TEST_X)

sam_pickle_save(X, y, TEST_X, prefix="tmp/Iteration0_")
df_check_stats(X, y, TEST_X)

Data Frame Shape: (59400, 39) TotColumns: 39 ObjectCols: 0
Data Frame Shape: (59400, 1) TotColumns: 1 ObjectCols: 0
Data Frame Shape: (14850, 39) TotColumns: 39 ObjectCols: 0
SAVE PREFIX USED:  tmp/Iteration0_
Data Frame Shape: (59400, 39) TotColumns: 39 ObjectCols: 0
Numpy Array Size: 59400
Data Frame Shape: (14850, 39) TotColumns: 39 ObjectCols: 0


In [3]:
# benchmark
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)
clf = game(X_train, X_test, y_train, y_test, algo='rf', )

------------------------------------------------
AC Score: 0.984848484848 F1 Score: 0.984906895824
------------------------------------------------
AC Score: 0.799865319865 F1 Score: 0.806462319165


In [4]:
# benchmark
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)
clf = game(X_train, X_test, y_train, y_test, algo='gb', )

------------------------------------------------
AC Score: 0.755757575758 F1 Score: 0.77760116502
------------------------------------------------
AC Score: 0.754074074074 F1 Score: 0.776074770727


In [5]:
# benchmark
knn_clf = game(X_train, X_test, y_train, y_test, algo='knn')

------------------------------------------------
AC Score: 0.72430976431 F1 Score: 0.733751767199
------------------------------------------------
AC Score: 0.594276094276 F1 Score: 0.607978244208
