# Tanzanian Waterpoint Analysis and Classification

# Problem Description

The Tanzanian Ministry of Water needs to ensure that clean, potable water is available to communities across Tanzania using limited resources.

That water can be provided by improving the maintenance of existing waterpoints and by expanding the number of waterpoints

If we can accurately classify a waterpoint, the Ministry will have a better understanding of their existing infrastructure, and because of cost savings, will be able to reallocate existing resources to expand the water infrastructure.

The ministry needs to be able to predict which class the waterpoints belong to: functional, functional but need some repairs, and non-functional.

# Data

Data is provided by [Taarifa](http://taarifa.org/) and the [Tanzanian Ministry of Water](http://maji.go.tz/) originally as part of a competition hosted by [DrivenData](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

# EDA / Data Cleaning

Initial EDA and Data Cleaning of the Waterpoint data is in [eda.ipynb](eda.ipynb "eda")

Visualization of Cleaned Data is in [data_viz.ipynb](data_viz.ipynb 'visuzlizations')

# Models

The following classification models were trained in the listed order.

* [Baseline - Logistic Regression](baseline_model.ipynb)
* [Logistic Regression with Oversampling](log_reg_smote.ipynb)
* [Decision Tree with Oversampling](dec_tree.ipynb)
* [Random Forest with Oversampling](rand_forest.ipynb)
* [XGBoost with Oversampling](xg_boost.ipynb)
* Random Forest with Oversampling and Tuning
    * [First Attempt](rand_forest-cv6.ipynb)
    * [Attempt using High Cardinality Categorical Features](rand_forest-cv3.ipynb)
    * [Attempt using SMOTENC - had the best results](rand_forest_cv8.ipynb)
    
A comparison of the primary metrics for the above models is in [model_metrics_comparison.ipynb](model_metrics_comparison.ipynb).

The most important features as determined by the best performing classifier are [here](feature_importance.ipynb).

# Conclusion

**Using a Random Forest Classifier I achieved an accuracy score of 0.73 with recall values of 0.74, 0.59 and 0.73 for the three classes, 'functional', 'functional needs repairs' and 'non functional' respectively.  The class with a recall of 0.59 was the minority class.**

**Being able to predict the status of a waterpoint with 73% accuracy will be a great help to the Ministry of Water, allowing them to prioritize site visits.**

*Recommendations*

* **I recommend the Ministry of Water use the predictive model to create a prioritization strategy for their waterpoint site visits.  Waterpoints that are predicted to be 'functional needs repairs' and 'non functional' should receive priority status.  Waterpoints predicted to be 'functional' should be visited when possible for two reasons, routine inspection and in case it is one of the waterpoints that was not correctly predicted.**


* **The Ministry can use the cost savings from a more efficient maintenance operation to expand the water infrastructure.**


* **The Ministry can use the accuracy of the model and its improved maintenance program as a selling point when soliciting international aid.**

To achieve the above results I took the following steps:
1. REVIEW THE PROVIDED DATA: 

I found close to 60,000 instances with 39 features, most of which were nominal categorical.  Of those features several were redundant, either containing exactly the same information, or the same information in slightly a slightly different categorization.  Digging deeper into the features, I found that 7 of the features contained NaN values, 24 of the features had placeholder values like 0 or 'unknown'.  There were 3 possible classes for each instance making this a multiclass classification problem.

2. CLEAN: 

I didn't want to drop any instances if possible, so I needed to find ways to impute the missing data, or as a last result drop a feature.  In all but one feature, I found a reasonable way to impute the data.  The feature that I dropped was missing 70% of its data, and I didn't see any way to impute its data without introducing a significant bias.  For all the other features with missing or placeholder data, I chose to impute the regional mean, median or mode depending on the feature type and distribution.  For continuous data with a close to normal distribution I used the mean, for continuous data that had a skewed distribution, I used median, and for categorical I used mode.  In a few cases the entire region was missing values, in those cases I took the mean, median or mode from the dataset instead of the region. Two of the categorical features contained names with inconsistent capitalization, so I made all the entries lowercase.

3. DATA EXPLORATION / FEATURE SELECTION: 

While examining the data during the cleaning process I became familiar with the values and did feature selection concurrently. I found the following categories of features, Useless to the model, Missing Too Much Data, Redundant, Potentially Relevant, Potentially Relevant but with high cardinality.  In the Useless to the model category, I found an irrelevant ID, a feature containing all the same value, names given to a waterpoint.  In the Missing Too Much data category, I found the one feature mentioned above that was missing 70% of its data.  In the Redundant category, I found features that were exact copies of each other, near copies of each other, or grouping the data in a way that overlapped other regional groupings in a confusing way.  In the Potentially Relevant category, I found a number of nominal categorical features and a few continuous numerical ones.  In the Potentially Relevant but High Cardinality category, I found several levels of granularity for regions in Tanzania, and the list of funders for the waterpoint projects. 

After selecting the Potentially Relevant features I looked deeper into them to see if there was much correlation to the classes.  While some values of many features showed a tendency to one of the majority classes, in more cases those values showed similar distributions to the distribution of class values.  If the classes were to be successfully predicted, a combination of the features would be needed.

4. PREDICTIVE MODELING
I chose a baseline model and several classifiers and then tuned the most promising of them.

The predictive metrics I chose to focus on were Recall and Accuracy, with the classifiers using One Vs. Rest.  I wanted each of the classes to have high but even Recall and a high Accuracy.  Having even Recall values was of particular importance because the classes were highly imbalanced, with there being two majority classes and 1 minority class.  I chose Recall because I wanted to focus on minimizing the False Negatives, while wanted overall accuracy to be high.

In all the models, I used sklearn.StandardScaler to scale the continuous data and sklearn.OneHotEndoder to encode the categorical data.

For the baseline model I chose a basic *Logistic Regression Classifier*.  As was to be expected it did a pretty good job predicting the majority classes but was useless at predicting the minority class.

Next I ran the same *Logistic Regression Classifier*, but with *SMOTE* oversampling. By creating synthetic data for the minority class I hoped to improve on the results.  There was improvement, Recall for the 3 classes was pretty even, but overall Accuracy had dropped.

Next I ran a *Decision Tree Classifier* again with *SMOTE* oversampling.  In this case the two majority classes saw their Recall increase but the minority class' recall decreased, while the overall accuracy increased close to the level of the baseline model.  The training set accuracy was significantly higher than the test set accuracy, indicating overfitting.

Next I ran a *Random Forest Classifier* with *SMOTE* oversampling.  It produced increases in the Recall for all the classes and increased the overall Accuracy above the baseline model.   The training set accuracy was significantly higher than the test set accuracy, indicating overfitting.

Next I ran an *XGBoost Classifier* with *SMOTE* oversampling.  It increased the Recall of the minority class, but reduced the Recall of the two majority classes, while significantly reducing the overall Accuracy.

Of the above classifiers, the *Random Forest Classifier* seemed the most promising, so I proceeded to tune its hyperparameters in hopes of improving its results, in particular reduce the amount of overfitting.  I used RandomizedSearchCV to provide Kfold Cross Validation and a random selection of HyperParameters to reduce computation time.  I attempted to tune the hyperparameters in several ways.  Some using SMOTE, some using SMOTENC (which is supposed to give better results for non-continuous features which I had) and them some including 4 one hot encoded high cardinality features.  Each method showed improvement over the untuned Random Forest, but did not have significant differences amongst themselves.  The end result was an increase in the Recall of the minority class, decreases in the Recall of the majority classes and a slight decrease in the overall accuracy, returning it to the baseline level.

**Future Work**

* Investigate Misclassifications

Like many multiclass classification problems, not all Misclassifications are equal.  In this case for example a 'Functional Needs Repair' being misclassified as 'non functional' would not be as bad as misclassifying it as 'functional'.  In the first case the maintenance working would visit a waterpoint expecting to repair a completely broken waterpoint, but instead it was a functional waterpoint in need of repair, while in the second case visiting the waterpoint would likely be delayed because it was predicted to be working.  There are three categories of predictions in this situation: correct, wrong but not a problem, wrong but a problem.  The instances that were classified 'wrong, a problem' would need to be reviewed to find what might be confusing the model and additional real or synthetic data used to retrain the model.

* Gather Maintenance Records for the Future

While the current model does a good job, as the waterpoints are visited for repairs, that information will need to be integrated into the model so that the same waterpoints aren't receiving the same predictions year to year.  An attempt should be made to local historical maintenance records to be integrated into the model the the future information.

* Further Split the non-functional waterpoints into visit or don't visit

The most important feature as found by the Random Forest Classifier was whether the water quantity was dry.  It overwhelmingly indicated a non-functional waterpoint, however it represented less than a third of the non-functional waterpoints.  When considering how to prioritize which waterpoints to visit for maintenance additional classifiers could be added to the model to indicate whether to visit a waterpoint.  A non-functional dry waterpoint does not need visited because it would only become a functional dry waterpoint and the local population would not have access to any more water. An attempt should be made to train the model to further classify non-functional waterpoints as to visit or ignore.  Dry is an obvious characteristic, but a classification model may be able to determine if there are others or combinations of others, to assist in further optimizing waterpoint maintenance operations. 
