# Detecting Tanzanian Water Wells in Need of Repair

Our project was aimed at helping the Tanzanian government (and support organizations) detect when a water well will be in need of repair. This is a substantial issue in Tanzania as it experiences severe dry seasons, and most of the population lives in rural communities.

We performed a thorough EDA of the dataset, and built several models to detect if a water is in need of repairs. We tried 5 different classification models, settling on a Random Forest Classifier as the best performer. It had a higher overall F1 Score, as well as the best recall for the 'Needs Repair' category. Higher recall means less false negatives - we believe that this is the best method to evaluate the model as a false negative (a well that needs repairs being labeled as not needing repairs could have devastating effects to a local community. And a well that does not need repairs being labeled as such, while being a waste of resources, would not have nearly as drastic of an effect)

### Load data and necessary packages

In [22]:
%load_ext autoreload
%autoreload 2
import functions_used as func
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
features = pd.read_csv("~/ds/proj3/tanzania-water-wells/data/raw/training-set-values.csv")
targets = pd.read_csv("~/ds/proj3/tanzania-water-wells/data/raw/training-labels.csv")
X_test = pd.read_csv("~/ds/proj3/tanzania-water-wells/data/raw/test-set.csv")

In [6]:
## This changes the target 'non functional' and 'functinoal needs repair' into 'needs repair'
targets['status_group'] = targets['status_group'].map({'non functional':'needs repair',
                                                       'functional needs repair':'needs repair',
                                                       'functional':'functional'})

### Build the model

In [7]:
# list of features used
features_list = ['basin', 'region', 'scheme_management', 'scheme_name',
       'extraction_type', 'management', 'payment', 'water_quality', 'quantity',
       'source', 'waterpoint_type','gps_height', 'longitude', 'latitude', 
       'region_code', 'district_code', 'population', 'construction_year', 'status_group']   

In [20]:
# Initialize a One Hot Encoder object
ohe = OneHotEncoder(handle_unknown = 'ignore')
# Train/Test split the data, then join X/Y, before preprocessing.  
X_train, X_test, y_train, y_test = train_test_split(features, targets, random_state=42)
joined_train = X_train.join(y_train, lsuffix='_l', rsuffix='_r')
joined_train_processed, y_train = func.model_preprocessing(joined_train, features_list, ohe, train = True)
# To Test data as well. This ensures the correct rows are dropped.
joined_test = X_test.join(y_test, lsuffix='_l', rsuffix='_r')
joined_test_processed, y_test = func.model_preprocessing(joined_test, features_list, ohe, train=False);

In [23]:
# Initialize a Random Forest Classifier Object
rfc = RandomForestClassifier(n_estimators=50, random_state=42, bootstrap=True, max_depth = 50)
# Fit training data to the object
rfc.fit(joined_train_processed,y_train)
# Make predictions on trained model
rfc_predicts = rfc.predict(joined_test_processed)
# Score the model
rfc.score(joined_test_processed, y_test)

0.8200598177644849

In [24]:
# Display classification metrics
func.calc_accuracy(y_test, rfc_predicts)

AttributeError: module 'functions_used' has no attribute 'calc_accuracy'