###Random Forests

Random forests are a type of machine learning technique in which an ensemble of decision trees are built and the predictions of the decision tree are averaged or the majority vote is taken as the final prediction. Each decision tree is trained with some stochasticity to decrease bias at the cost of variance.

We do basic feature extraction in which we only keep 10 out of 39 features and convert these into 1-of-k encoding. Even with this basic feature extraction we get a prediction accuracy of 79%. 

By including more features and performing some more advanced feature engineering, we can reach prediction accuracies up to 83% (not shown in this notebook).

In [1]:
from sklearn.tree import DecisionTreeClassifier as Tree
from sklearn.ensemble import RandomForestClassifier as Forest
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib qt

In [4]:
train_data = pd.read_csv("WaterPump-training-values.csv")
train_labels = pd.read_csv("WaterPump-training-labels.csv")
N = train_data.shape[0]

In [23]:
#picking features that we want to keep
features = ['longitude','latitude','gps_height','population','construction_year','water_quality','quantity','region_code',
           'source','waterpoint_type']
train = train_data[features]
#converting categorical features to 1-of-k representation
train1 = pd.concat([train_data, pd.get_dummies(train['water_quality']), pd.get_dummies(train['quantity']), 
                    pd.get_dummies(train['source']), pd.get_dummies(train['waterpoint_type'])], axis=1)
#removing the categorical features after we converted them
train1 = train1.drop(['water_quality','quantity','region_code', 'source', 'waterpoint_type'], axis=1, inPlace=True)

In [24]:
#separating dataset into training and testing for cross-validation
test_idx = np.random.uniform(0, 1, len(train1)) <= 0.9
train = train1[test_idx==True]
trainLabels = train_labels[test_idx==True]
test = train1[test_idx==False]
testLabels = train_labels[test_idx==False]

In [26]:
#training the random forest
forest = Forest(n_estimators=100,criterion='gini')
forest.fit(train,trainLabels['status_group'])

RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [29]:
#making predictions on the withheld data
preds = forest.predict(test)
accuracy = np.where(preds==testLabels['status_group'], 1, 0).sum() / float(len(test))
#print "Neighbors: %d, Accuracy: %3f" % (n, accuracy)
print accuracy

0.790258449304


Above is the prediction accuracy of 79%!

Below we can look at which features have the most predictive power. These tend to be nearer to the root of the decision trees. We can also steal these results for our model!

In [33]:
#importance of each data feature that we kept
importances = zip(forest.feature_importances_,list(train.columns.values))

In [35]:
sorted(importances,key=lambda x: x[0])

[(4.157248750361075e-05, 'dam'),
 (8.5802760153608722e-05, 'fluoride abandoned'),
 (0.00041373472890740434, 'unknown'),
 (0.00045397805951696829, 'cattle trough'),
 (0.00065079516735029065, 'fluoride'),
 (0.00079645670256850995, 'other'),
 (0.001171466300827778, 'salty abandoned'),
 (0.0012568216446781314, 'hand dtw'),
 (0.0012682113324976728, 'coloured'),
 (0.0014836707289752279, 'milky'),
 (0.0015663430362202501, 'improved spring'),
 (0.0021102186909650288, 'dam'),
 (0.0022838528373634297, 'unknown'),
 (0.0037050525654342709, 'salty'),
 (0.0037206671572899861, 'lake'),
 (0.0041189533059318925, 'rainwater harvesting'),
 (0.0057334878694563027, 'river'),
 (0.0065885452857120906, 'soft'),
 (0.0068209060906524446, 'seasonal'),
 (0.0070876581603355419, 'machine dbh'),
 (0.0086162268978782607, 'shallow well'),
 (0.0088090275722504281, 'communal standpipe multiple'),
 (0.0088500016424302025, 'spring'),
 (0.0088694981054384843, 'unknown'),
 (0.013998056932112398, 'insufficient'),
 (0.0147532