# Refinement

**Introduction:**
Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

This is also an intermediate-level competition by [DataDriven][1]! All code & support scripts are in [Github Repo][2]

[1]: https://www.drivendata.org/competitions/7/ "Link to Competetion Page"
[2]: https://github.com/msampathkumar/datadriven_pumpit "User Code"

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV

from scripts.tools import df_check_stats, game, sam_pickle_save, sam_pickle_load

np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)

%matplotlib inline

In [25]:
crazy_list = dir()

In [26]:
for each in dir():
    if each not in crazy_list:
        del each

print('Length of dir():', len(dir()))

Length of dir(): 80


In [27]:
X, y, TEST_X = sam_pickle_load(prefix="tmp/Iteration2_final_")
df_check_stats(X, y, TEST_X)

LOAD PREFIX USED:  tmp/Iteration2_final_
Data Frame Shape: (59400, 43) TotColumns: 43 ObjectCols: 0
Numpy Array Size: 59400
Data Frame Shape: (14850, 43) TotColumns: 43 ObjectCols: 0


In [33]:
# preprocess dataset, split into training and test part
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=.25, random_state=42)

# MultiClass

### RF

In [30]:
clf = RandomForestClassifier(random_state=192)
scores = cross_val_score(clf, X, y, cv=5, n_jobs=-1)
print('AC Score:', scores.mean())

AC Score: 0.798063835306


In [None]:
clf = OneVsOneClassifier(RandomForestClassifier(random_state=192))
scores = cross_val_score(clf, X, y, cv=5)
print('OneVsOneClassifier - RF:', sum(scores) / 5)

clf = OneVsRestClassifier(RandomForestClassifier(random_state=192))
scores = cross_val_score(clf, X, y, cv=5)
print('OneVsRestClassifier - RF:', sum(scores) / 5)

OneVsOneClassifier - RF: 0.800959522006
OneVsRestClassifier - RF: 0.800572057251


### GB

In [None]:
clf = GradientBoostingClassifier(random_state=192)
scores = cross_val_score(clf, X, y, cv=5)
print('OneVsOneClassifier - RF:', sum(scores) / 5)

In [11]:
clf = OneVsOneClassifier(GradientBoostingClassifier(random_state=192), n_jobs=3)
scores = cross_val_score(clf, X, y, cv=5)
print('OneVsOneClassifier - RF:', sum(scores) / 5)

clf = OneVsRestClassifier(GradientBoostingClassifier(random_state=192), n_jobs=3)
scores = cross_val_score(clf, X, y, cv=5)
print('OneVsRestClassifier - RF:', sum(scores) / 5)

OneVsOneClassifier - RF: 0.750084423814
OneVsRestClassifier - RF: 0.74818216327


For GBT, as you can see these multi class Classifier is only reducing the performace and during the run time we have also noticed that it also consumes good amount of time.

But the same multi class classifier are performing well for RF, which had generisation issues

# Fine Tuning

In [None]:
print('Length of dir():', len(dir()))

for each in dir():
    if each not in crazy_list:
        del each

print('Length of dir():', len(dir()))

### Random Forest

In [9]:
X, y, TEST_X = sam_pickle_load(prefix="tmp/Iteration2_final_")
df_check_stats(X, y, TEST_X)

LOAD PREFIX USED:  tmp/Iteration2_final_
Data Frame Shape: (59400, 43) TotColumns: 43 ObjectCols: 0
Numpy Array Size: 59400
Data Frame Shape: (14850, 43) TotColumns: 43 ObjectCols: 0
--


In [5]:
parameters = {
    'n_estimators': [10, 50, 100, 150, 200],
    'class_weight': ['balanced_subsample', 'balanced'],
    'criterion': ['gini', 'entropy'],
    'max_features': ['log2', 'auto', 25],
    'random_state': [192]
}

# clf_rf = RandomForestClassifier(n_estimators=150, criterion='entropy', class_weight="balanced_subsample", n_jobs=-1, random_state=192)
# 0.81346801346801345

GS_CV = RandomizedSearchCV(RandomForestClassifier(), parameters)

GS_CV.fit(X, y)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'max_features': ['log2', 'auto', 25], 'random_state': [192], 'n_estimators': [10, 50, 100, 150, 200], 'class_weight': ['balanced_subsample', 'balanced'], 'criterion': ['gini', 'entropy']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [7]:
print(GS_CV.best_params_, GS_CV.best_score_)

cv_results = pd.DataFrame(GS_CV.cv_results_, columns=[u'mean_fit_time', u'mean_score_time', u'mean_test_score',
       u'mean_train_score', u'param_class_weight', u'param_criterion',
       u'param_max_features', u'param_n_estimators', u'params'])

{'class_weight': 'balanced', 'random_state': 192, 'n_estimators': 200, 'criterion': 'gini', 'max_features': 'auto'} 0.807053872054


### GBT

In [14]:
GradientBoostingClassifier?

    - If "auto", then `max_features=sqrt(n_features)`.
    - If "sqrt", then `max_features=sqrt(n_features)`.
    - If "log2", then `max_features=log2(n_features)`.
    - If None, then `max_features=n_features`.

In [22]:
parameters = {
    'learning_rate': [0.7, 0.5, 0.1, 0.05, 0.01], # default - 0.1
    'max_depth': [2, 3, 5, 7], # default - 3
    'max_features': [5, 10, 15],
    'random_state': [192]
}

In [23]:
GS_CV = RandomizedSearchCV(GradientBoostingClassifier(), parameters)

GS_CV.fit(X, y)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'max_features': [5, 10, 15], 'max_depth': [2, 3, 5, 7], 'learning_rate': [0.7, 0.5, 0.1, 0.05, 0.01], 'random_state': [192]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [35]:
GS_CV

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'max_features': [5, 10, 15], 'max_depth': [2, 3, 5, 7], 'learning_rate': [0.7, 0.5, 0.1, 0.05, 0.01], 'random_state': [192]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [36]:
GS_CV.best_params_

{'learning_rate': 0.5, 'max_depth': 7, 'max_features': 5, 'random_state': 192}

# Results

### RF

In [20]:
GS_CV.best_params_

{'class_weight': 'balanced',
 'criterion': 'gini',
 'max_features': 'auto',
 'n_estimators': 200,
 'random_state': 192}

In [10]:
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=200, criterion='gini', max_features='auto', random_state=192))
scores = cross_val_score(clf, X, y, cv=5)
print(scores)

[ 0.81794  0.81357  0.81288  0.81044  0.80872]


In [11]:
sum(scores)/ 5

0.8127102008672592

### GB

{'learning_rate': 0.5, 'max_depth': 7, 'max_features': 5, 'random_state': 192}

In [39]:
clf = GradientBoostingClassifier(learning_rate=0.5, max_depth=7, max_features=5, random_state=192)

In [40]:
scores = cross_val_score(clf, X, y, cv=5)
print('Score:', scores)
print('Avg Score:', sum(scores)/5)

Score: [ 0.79825  0.79522  0.8      0.79891  0.79694]
Avg Score: [ 0.15965  0.15904  0.16     0.15978  0.15939]


# Submit

In [12]:
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=200, criterion='gini', max_features='auto', random_state=192))
clf = clf.fit(X, y)

In [16]:
import pickle

le = pickle.load(open('tmp/le.pkl', 'rb'))

In [17]:

# saving the index
test_ids = TEST_X.index

# predicint the values
predictions = clf.predict(TEST_X)
print(predictions.shape)

# Converting int to its respective Labels
predictions_labels = le.inverse_transform(predictions)

# setting up column name & save file
sub = pd.DataFrame(predictions_labels, columns=['status_group'])
sub.head()
sub.insert(loc=0, column='id', value=test_ids)
sub.reset_index()
sub.to_csv('submit.csv', index=False)
sub.head()

(14850,)


Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
