# Tanzanian Water Well's Status Predictions

#### Authors: Kyle Dufrane and Brad Horn

### Overview

#### This project analyzes the Tanzanian Water Wells datasets released by the Tanzanian Government. The dataset includes 59,400 rows each repressenting a unique well within the Tanzanian Governments realm. Our targets are broken down into three categories:

* Functional
* Non Functional
* Functional Needs Repair

#### We will attempt to predict the status of the wells condition through utilizing Exploratory Data Analysis (EDA) and building classification models tuned to the parameters that will have the largest impact on our predictive ability. 


### Business Understanding

#### Flatiron LLC has recently been awarded a contract to maintain wells in Tanzania. They're looking for a system to help develop preventative maintenance schedules by predicting pump failures and replacement schedules to better serve their client. Flatiron LLC would like key insights on:

* Regional impact on wells
* Area's with low water quantity
* Negatively impacting factors on wells 


### Data Understanding

#### This dataset comes with two applicable files training_set_labels and training_set_values. During our EDA we will join these tables together to give us one file to work with. The values dataset has 39 total columns and contains all of our predicitve features. Below is a description of each column. 

* amount_tsh : Total static head (amount water available to waterpoint)
* date_recorded : The date the row was entered
* funder : Who funded the well
* gps_height : Altitude of the well
* installer : Organization that installed the well
* longitude : GPS coordinate
* latitude : GPS coordinate
* wpt_name : Name of the waterpoint if there is one
* num_private :Private use or not
* basin : Geographic water basin
* subvillage : Geographic location
* region : Geographic location
* region_code : Geographic location (coded)
* district_code : Geographic location (coded)
* lga : Geographic location
* ward : Geographic location
* population : Population around the well
* public_meeting : True/False
* recorded_by : Group entering this row of data
* scheme_management : Who operates the waterpoint
* scheme_name : Who operates the waterpoint
* permit : If the waterpoint is permitted
* construction_year : Year the waterpoint was constructed
* extraction_type : The kind of extraction the waterpoint uses
* extraction_type_group : The kind of extraction the waterpoint uses
* extraction_type_class : The kind of extraction the waterpoint uses
* management : How the waterpoint is managed
* management_group : How the waterpoint is managed
* payment : What the water costs
* payment_type : What the water costs
* water_quality : The quality of the water
* quality_group : The quality of the water
* quantity : The quantity of water
* quantity_group : The quantity of water
* source : The source of the water
* source_type : The source of the water
* source_class : The source of the water
* waterpoint_type : The kind of waterpoint
* waterpoint_type_group : The kind of waterpoint

#### To start we will import all of our needed libraries and dive into our datasets.

In [1]:
# Import needed libraries

# Import libaries needed for EDA and visualizations
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Import Pickle to saved files giving us the ability to only run each model once.
import pickle

# Import needed SKLearn libraries for modeling, imputing, and pipelines
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, plot_confusion_matrix
from xgboost import XGBClassifier

import eli5

#Import py files
#Use the functions in the py file for preprocessing
import sys
sys.path.insert(0, 'src/')
import functions
from src.functions import *

# pd.set_option('display.max_columns', 999)

  from pandas import MultiIndex, Int64Index


### Import data

In [2]:
# Import training labels CSV
df_training_labels = pd.read_csv('data/Training_set_labels.csv')

In [3]:
df_training_labels.shape

(59400, 2)

In [4]:
df_training_labels.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [5]:
df_training_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            59400 non-null  int64 
 1   status_group  59400 non-null  object
dtypes: int64(1), object(1)
memory usage: 928.2+ KB


In [6]:
df_training_labels['status_group'].value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

#### Revewing the above information we do not have any nulls in our this dataset which is a good start as these are our targets. We can see that this dataset has two columns, one of which being 'id'. Hopefully we can use this to join our tables later in the EDA process. 

#### The big catch here is seeing the **class imbalance**, we will have to adjust our model accordingly to make up for the lack of values especially within the 'functional needs repair' category. 

### Now we'll review our predictors within the Training_set_values.cvs

In [7]:
# Import training values CSV
df_training_values = pd.read_csv('data/Training_set_values.csv')
df_training_values.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [8]:
df_training_values.shape

(59400, 40)

In [9]:
df_training_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

#### Reviewing the above output, we have a few columns with Null values. Going forward we will review these columns and identify the appropriate way to either replace or drop these columns. Also, we can see that we have an 'id' column which should allow us to join our tables. 

#### Below we will get a clearer understanding of what columns are missing columns. 

In [10]:
df_training_values.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

#### Out of the 39 features 7 of them are missing values. A few items stand out:

* Funder and installer have close to equal amounts of missing values
* subvillage has the least amount of missing values
* scheme_name is missing almost half of the values

#### Since scheme name is missing half of the data we will drop this column.

In [11]:
# df_training_values.drop(['scheme_name', 'wpt_name'], axis=1, inplace=True)

In [12]:
df = df_training_values.merge(df_training_labels, on='id')

In [13]:
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [14]:
df.drop('id', axis=1, inplace=True)

#### As seen above, 7 columns are missing data. Lets take a deeper look into these columns.

In [15]:
# Creating a list of columns with missing values
missing_values = ['funder', 'installer', 'subvillage', 'public_meeting',\
                  'scheme_management', 'permit', 'status_group']

# Creating a dataframe with above missing_values
df[missing_values]

Unnamed: 0,funder,installer,subvillage,public_meeting,scheme_management,permit,status_group
0,Roman,Roman,Mnyusi B,True,VWC,False,functional
1,Grumeti,GRUMETI,Nyamara,,Other,True,functional
2,Lottery Club,World vision,Majengo,True,VWC,True,functional
3,Unicef,UNICEF,Mahakamani,True,VWC,True,non functional
4,Action In A,Artisan,Kyanyamisa,True,,True,functional
...,...,...,...,...,...,...,...
59395,Germany Republi,CES,Kiduruni,True,Water Board,True,functional
59396,Cefa-njombe,Cefa,Igumbilo,True,VWC,True,functional
59397,,,Madungulu,True,VWC,False,functional
59398,Malec,Musa,Mwinyi,True,VWC,True,functional


In [16]:
df[missing_values].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   funder             55765 non-null  object
 1   installer          55745 non-null  object
 2   subvillage         59029 non-null  object
 3   public_meeting     56066 non-null  object
 4   scheme_management  55523 non-null  object
 5   permit             56344 non-null  object
 6   status_group       59400 non-null  object
dtypes: object(7)
memory usage: 3.6+ MB


In [17]:
df[missing_values].isna().sum()

funder               3635
installer            3655
subvillage            371
public_meeting       3334
scheme_management    3877
permit               3056
status_group            0
dtype: int64

#### We can now see that all of these features are of the dtype object which narrows down our options to dealing with the missing values. What are these features composed of?

#### To start, lets take a look at our previous mentioned insite of funders and installers having close to the same amount of missing values.

In [18]:
df.fillna('Unknown', inplace=True)

In [19]:
drop_columns = [
    'date_recorded', 'wpt_name', 'recorded_by', 'scheme_name', 'waterpoint_type_group', 'source_class', 'source', 'quantity_group','quality_group', 'payment_type','management_group','extraction_type', 'extraction_type_group',
]

df.drop(drop_columns, axis=1, inplace=True)

In [20]:
df['construction_year'] = df['construction_year'].replace(0, df['construction_year'].median())

In [21]:
df

Unnamed: 0,amount_tsh,funder,gps_height,installer,longitude,latitude,num_private,basin,subvillage,region,...,permit,construction_year,extraction_type_class,management,payment,water_quality,quantity,source_type,waterpoint_type,status_group
0,6000.0,Roman,1390,Roman,34.938093,-9.856322,0,Lake Nyasa,Mnyusi B,Iringa,...,False,1999,gravity,vwc,pay annually,soft,enough,spring,communal standpipe,functional
1,0.0,Grumeti,1399,GRUMETI,34.698766,-2.147466,0,Lake Victoria,Nyamara,Mara,...,True,2010,gravity,wug,never pay,soft,insufficient,rainwater harvesting,communal standpipe,functional
2,25.0,Lottery Club,686,World vision,37.460664,-3.821329,0,Pangani,Majengo,Manyara,...,True,2009,gravity,vwc,pay per bucket,soft,enough,dam,communal standpipe multiple,functional
3,0.0,Unicef,263,UNICEF,38.486161,-11.155298,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,...,True,1986,submersible,vwc,never pay,soft,dry,borehole,communal standpipe multiple,non functional
4,0.0,Action In A,0,Artisan,31.130847,-1.825359,0,Lake Victoria,Kyanyamisa,Kagera,...,True,1986,gravity,other,never pay,soft,seasonal,rainwater harvesting,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,10.0,Germany Republi,1210,CES,37.169807,-3.253847,0,Pangani,Kiduruni,Kilimanjaro,...,True,1999,gravity,water board,pay per bucket,soft,enough,spring,communal standpipe,functional
59396,4700.0,Cefa-njombe,1212,Cefa,35.249991,-9.070629,0,Rufiji,Igumbilo,Iringa,...,True,1996,gravity,vwc,pay annually,soft,enough,river/lake,communal standpipe,functional
59397,0.0,Unknown,0,Unknown,34.017087,-8.750434,0,Rufiji,Madungulu,Mbeya,...,False,1986,handpump,vwc,pay monthly,fluoride,enough,borehole,hand pump,functional
59398,0.0,Malec,0,Musa,35.861315,-6.378573,0,Rufiji,Mwinyi,Dodoma,...,True,1986,handpump,vwc,never pay,soft,insufficient,shallow well,hand pump,functional


In [22]:
funder_mask = df['funder'].map(df['funder'].value_counts()) < 800
installer_mask = df['installer'].map(df['installer'].value_counts()) < 100
subvillage_mask = df['subvillage'].map(df['subvillage'].value_counts()) < 5
lga_mask = df['lga'].map(df['lga'].value_counts()) < 200


df['funder'] = df['funder'].mask(funder_mask, 'other')
# df['lga'] = df['lga'].mask(lga_mask, 'other')

# df['installer'] = df['installer'].mask(installer_mask, 'other')
# df['subvillage'] = df['subvillage'].mask(subvillage_mask, 'other')

## Feature Selection

In [23]:
# temp_df = df[(df['longitude'] != 0) & (df['latitude'] != 0)]

In [24]:
# plt.figure(figsize=(15,15))
# plt.scatter(x=temp_df['longitude'], y=temp_df['latitude'])
# plt.plot

In [25]:
# step = 0.5
# to_bin = lambda x: '%0.3f' % (np.floor(x / step) * step)
# df["lat_long_bin"] = df['latitude'].map(to_bin) + 'x' + \
#               df['longitude'].map(to_bin)

In [26]:
# df['lat_long_bin'].unique()

In [27]:
# objects_ = ['region_code', 'district_code', 'construction_year']
df[['public_meeting', 'permit']] = df[['public_meeting','permit']].astype(bool)

# df[objects_] = df[objects_].astype('object')

In [28]:
# for col in df.select_dtypes(['int', 'float']):
#     # if col != 'longitude' or col != 'latitude':
#     df[f'{col}_log'] = (df[col] - df[col].min()+1).transform(np.log)

In [29]:
# drop = ['amount_tsh', 'gps_height', 'num_private',
#         'population', 'longitude_log', 'latitude_log']
# df.drop(drop, axis=1, inplace=True)

In [30]:
corr_df = df.corr()

df_corr=corr_df.abs().stack().reset_index().sort_values(0, ascending=False)
df_corr['pairs'] = list(zip(df_corr.level_0, df_corr.level_1))
df_corr.set_index(['pairs'], inplace = True)
df_corr.drop(columns=['level_1', 'level_0'], inplace = True)
df_corr.columns = ['cc']
df_corr.drop_duplicates(inplace=True)
df_corr = df_corr[df_corr['cc'] < 1.0000]
df_corr.head(10)

Unnamed: 0_level_0,cc
pairs,Unnamed: 1_level_1
"(region_code, district_code)",0.678602
"(latitude, longitude)",0.425802
"(gps_height, construction_year)",0.296245
"(latitude, region_code)",0.221018
"(district_code, latitude)",0.20102
"(construction_year, longitude)",0.188632
"(gps_height, region_code)",0.183521
"(gps_height, district_code)",0.171233
"(district_code, longitude)",0.151398
"(gps_height, longitude)",0.149155


In [31]:
drop_cols = ['longitude', 'population', 'permit', 'region_code', 'gps_height', 'latitude']

df.drop(drop_cols, axis=1, inplace=True)

In [32]:
corr_df = df.corr()

df_corr=corr_df.abs().stack().reset_index().sort_values(0, ascending=False)
df_corr['pairs'] = list(zip(df_corr.level_0, df_corr.level_1))
df_corr.set_index(['pairs'], inplace = True)
df_corr.drop(columns=['level_1', 'level_0'], inplace = True)
df_corr.columns = ['cc']
df_corr.drop_duplicates(inplace=True)
df_corr = df_corr[df_corr['cc'] < 1.0000]
df_corr.head(10)

Unnamed: 0_level_0,cc
pairs,Unnamed: 1_level_1
"(public_meeting, construction_year)",0.038857
"(construction_year, amount_tsh)",0.036297
"(construction_year, district_code)",0.027986
"(district_code, amount_tsh)",0.023599
"(public_meeting, amount_tsh)",0.015798
"(district_code, public_meeting)",0.012133
"(num_private, construction_year)",0.009013
"(public_meeting, num_private)",0.008618
"(district_code, num_private)",0.004478
"(amount_tsh, num_private)",0.002944


In [33]:
# X = df.drop(['status_group'], axis=1)
# y = df['status_group']
#
# X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify=y)
#
# numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
# cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()
#
# numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])
#
# cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')
#
# preprocessor = ColumnTransformer(transformers=[
#     ('num', numeric_transformer, numeric_features),
#     ('cat', cat_transformer, cat_features)])
#
# rf = RandomForestClassifier(random_state=42)
#
# rf_clf = Pipeline(steps=[('preprocessor', preprocessor),
#                          ('feature_selection', SequentialFeatureSelector(rf, n_features_to_select='auto', tol=0.1, direction='forward', cv=5, n_jobs=-1)),
#                          ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))])
#
# rf_clf.fit(X_train, y_train)
#
# y_hat = rf_clf.predict(X_test)
#
# print('Accuracy:', accuracy_score(y_test, y_hat))

# sfs = SequentialFeatureSelector(rf, n_features_to_select='auto', tol=0.1, direction='forward', cv=5, n_jobs=-1)
# sfs.fit(X_train,y_train)

In [34]:
df

Unnamed: 0,amount_tsh,funder,installer,num_private,basin,subvillage,region,district_code,lga,ward,...,scheme_management,construction_year,extraction_type_class,management,payment,water_quality,quantity,source_type,waterpoint_type,status_group
0,6000.0,other,Roman,0,Lake Nyasa,Mnyusi B,Iringa,5,Ludewa,Mundindi,...,VWC,1999,gravity,vwc,pay annually,soft,enough,spring,communal standpipe,functional
1,0.0,other,GRUMETI,0,Lake Victoria,Nyamara,Mara,2,Serengeti,Natta,...,Other,2010,gravity,wug,never pay,soft,insufficient,rainwater harvesting,communal standpipe,functional
2,25.0,other,World vision,0,Pangani,Majengo,Manyara,4,Simanjiro,Ngorika,...,VWC,2009,gravity,vwc,pay per bucket,soft,enough,dam,communal standpipe multiple,functional
3,0.0,Unicef,UNICEF,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,63,Nanyumbu,Nanyumbu,...,VWC,1986,submersible,vwc,never pay,soft,dry,borehole,communal standpipe multiple,non functional
4,0.0,other,Artisan,0,Lake Victoria,Kyanyamisa,Kagera,1,Karagwe,Nyakasimbi,...,Unknown,1986,gravity,other,never pay,soft,seasonal,rainwater harvesting,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,10.0,other,CES,0,Pangani,Kiduruni,Kilimanjaro,5,Hai,Masama Magharibi,...,Water Board,1999,gravity,water board,pay per bucket,soft,enough,spring,communal standpipe,functional
59396,4700.0,other,Cefa,0,Rufiji,Igumbilo,Iringa,4,Njombe,Ikondo,...,VWC,1996,gravity,vwc,pay annually,soft,enough,river/lake,communal standpipe,functional
59397,0.0,Unknown,Unknown,0,Rufiji,Madungulu,Mbeya,7,Mbarali,Chimala,...,VWC,1986,handpump,vwc,pay monthly,fluoride,enough,borehole,hand pump,functional
59398,0.0,other,Musa,0,Rufiji,Mwinyi,Dodoma,4,Chamwino,Mvumi Makulu,...,VWC,1986,handpump,vwc,never pay,soft,insufficient,shallow well,hand pump,functional


In [35]:
from sklearn.preprocessing import LabelEncoder
X = df.drop(['status_group'], axis=1)
y = df['status_group']

# le = LabelEncoder()
# y = le.fit_transform(y)

# ohe = OneHotEncoder(drop='first', handle_unknown='ignore')
# ohe_ = ohe.fit_transform(X.select_dtypes(['object', 'bool'])).toarray()
# ohe_ = pd.DataFrame(ohe_, columns=ohe.get_feature_names_out())
# X = pd.concat([X.select_dtypes(['int', 'float']), ohe_], axis=1)

In [36]:
idx_val = []

for idx, col in enumerate(X.select_dtypes(['object', 'bool']).columns.to_list()):
    for idx_, col_ in enumerate(X.columns.to_list()):
        if col == col_:
            idx_val.append(idx_)

In [37]:
idx_val

[1, 2, 4, 5, 6, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19]

In [38]:
X

Unnamed: 0,amount_tsh,funder,installer,num_private,basin,subvillage,region,district_code,lga,ward,public_meeting,scheme_management,construction_year,extraction_type_class,management,payment,water_quality,quantity,source_type,waterpoint_type
0,6000.0,other,Roman,0,Lake Nyasa,Mnyusi B,Iringa,5,Ludewa,Mundindi,True,VWC,1999,gravity,vwc,pay annually,soft,enough,spring,communal standpipe
1,0.0,other,GRUMETI,0,Lake Victoria,Nyamara,Mara,2,Serengeti,Natta,True,Other,2010,gravity,wug,never pay,soft,insufficient,rainwater harvesting,communal standpipe
2,25.0,other,World vision,0,Pangani,Majengo,Manyara,4,Simanjiro,Ngorika,True,VWC,2009,gravity,vwc,pay per bucket,soft,enough,dam,communal standpipe multiple
3,0.0,Unicef,UNICEF,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,63,Nanyumbu,Nanyumbu,True,VWC,1986,submersible,vwc,never pay,soft,dry,borehole,communal standpipe multiple
4,0.0,other,Artisan,0,Lake Victoria,Kyanyamisa,Kagera,1,Karagwe,Nyakasimbi,True,Unknown,1986,gravity,other,never pay,soft,seasonal,rainwater harvesting,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,10.0,other,CES,0,Pangani,Kiduruni,Kilimanjaro,5,Hai,Masama Magharibi,True,Water Board,1999,gravity,water board,pay per bucket,soft,enough,spring,communal standpipe
59396,4700.0,other,Cefa,0,Rufiji,Igumbilo,Iringa,4,Njombe,Ikondo,True,VWC,1996,gravity,vwc,pay annually,soft,enough,river/lake,communal standpipe
59397,0.0,Unknown,Unknown,0,Rufiji,Madungulu,Mbeya,7,Mbarali,Chimala,True,VWC,1986,handpump,vwc,pay monthly,fluoride,enough,borehole,hand pump
59398,0.0,other,Musa,0,Rufiji,Mwinyi,Dodoma,4,Chamwino,Mvumi Makulu,True,VWC,1986,handpump,vwc,never pay,soft,insufficient,shallow well,hand pump


In [None]:
# from imblearn.over_sampling import SMOTENC
#
# X_smote, y_smote = SMOTENC(categorical_features=idx_val,n_jobs=-1).fit_resample(X,y)
# # X_adasyn, y_adasyn = ADASYN(n_jobs=-1).fit_resample(X,y)

In [45]:
# smote_data = pd.concat([X_smote, y_smote], axis=1)
# smote_data.to_csv('data/smote_data.csv')


In [None]:
smote_data = pd.read_csv('data/smote_data.csv')

In [None]:

# X_train, X_test, y_train, y_test = train_test_split(X_smote,y_smote, random_state=42)
# numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
# cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()
#
# numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])
#
# cat_transformer = OneHotEncoder(drop='first', handle_unknown='infrequent_if_exist')
#
# preprocessor = ColumnTransformer(transformers=[
#     ('num', numeric_transformer, numeric_features),
#     ('cat', cat_transformer, cat_features)])
#
# rf_clf = Pipeline(steps=[('preprocessor', preprocessor),
#                          ('classifier', RandomForestClassifier(random_state=42))])
#
# rf_clf.fit(X_train, y_train)
# y_hat = rf_clf.predict(X_test)
#
# print('Accuracy:', accuracy_score(y_test, y_hat))

print('Accuracy: 0.8661706964248812')

In [None]:
# from sklearn.model_selection import cross_val_score
#
# scores = cross_val_score(rf_clf, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)

print('[0.85837294, 0.8554109 , 0.85781207, 0.85801874, 0.85788096]')

In [None]:
numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

cat_transformer = OneHotEncoder(drop='first', handle_unknown='infrequent_if_exist')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, cat_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=42))])

param_grid = {
    'classifier__n_estimators': range(0,5000,100),
    'classifier__max_depth': [None, 500, 1000, 1500, 2500, 5000],
    'classifier__min_samples_split': range(1,50,2),
    'classifier__min_samples_leaf': range(1,50,2),
    'classifier__max_features': ['sqrt','log2', None],
    'classifier__bootstrap': [True, False],
    'classifier__oob_score': [True, False],
    'classifier__ccp_alpha': np.linspace(0,1,9),
    # 'classifier__class_weight': ['balanced', 'balanced_subsample', None]
}

grid_search_rf = RandomizedSearchCV(clf,param_grid, n_iter=10, n_jobs=-1, scoring='accuracy')

grid_search_rf.fit(X_train, y_train)

y_hat = grid_search_rf.predict(X_test)

print('Accuracy Score:', accuracy_score(y_test, y_hat))

In [None]:
submission = pd.read_csv('data/Test_set.csv')
submission_ = pd.DataFrame(submission['id'])
submission = submission[X_train.columns.to_list()]
submission['public_meeting'] = submission['public_meeting'].astype('bool')

In [None]:
pred = rf_clf.predict(submission)

In [None]:
final_submission = pd.concat([submission_, pd.DataFrame(pred)], axis=1)

In [None]:
final_submission.rename(columns={0:'status_group'}, inplace=True)

In [None]:
final_submission

In [None]:
keys = submission_.id
final_submission.set_index('id', inplace=True)
final_submission = final_submission.reindex(list(keys))

In [None]:
final_submission.reset_index(inplace=True)

In [None]:

# id = pd.DataFrame(submission.index)
# id
# final_sub = pd.concat([id, submission_pred], axis = 1)
# final_sub.set_index('id', inplace=True)
# # final_sub.reset_index(inplace=True, drop=True)
final_submission.to_csv('final_submit_1.csv', index=False)

In [None]:
final_submission

In [None]:
X = df.drop(['status_group', 'num_private'], axis=1)
y = df['status_group']

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify=y)
numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, cat_features)])

rf_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))])

rf_clf.fit(X_train, y_train)

y_hat = rf_clf.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
asdfasd

In [None]:
feat_import_cat_names = list(rf_clf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names(input_features = cat_features))
feat_import = feat_import_cat_names + numeric_features

eli5.explain_weights(rf_clf.named_steps['classifier'], top=50, feature_names=feat_import, feature_filter=lambda x: x != '<BIAS>')

In [None]:
X = df.select_dtypes('object').drop(['status_group'], axis=1)
y = df['status_group']

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify=y)

numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, cat_features)])

xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression(random_state=42))])

xgb_clf.fit(X_train, y_train)

y_hat = xgb_clf.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
X = df.select_dtypes('object').drop(['status_group'], axis=1)
y = df['status_group']

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify=y)

numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, cat_features)])

xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', XGBClassifier(random_state=42))])

xgb_clf.fit(X_train, y_train)

y_hat = xgb_clf.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, cat_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', RandomForestClassifier(random_state=42))])

param_grid = {
    'classifier__n_estimators': range(0,5000,100),
    'classifier__max_depth': [None],
    'classifier__min_samples_split': range(1,50,2),
    'classifier__min_samples_leaf': range(1,50,2),
    'classifier__max_features': ['sqrt','log2', None],
    'classifier__bootstrap': [True, False],
    'classifier__oob_score': [True, False],
    'classifier__ccp_alpha': np.linspace(0,1,9),
    'classifier__class_weight': ['balanced', 'balanced_subsample', None]
}

grid_search_rf = RandomizedSearchCV(clf,param_grid, n_iter=25, n_jobs=-1, scoring='accuracy')

grid_search_rf.fit(X_train, y_train)

In [None]:
grid_search_rf.best_params_

In [None]:
y_hat = grid_search_rf.predict(X_test)

accuracy_score(y_test, y_hat)

In [None]:
submission_ = pd.read_csv('data/Test_set.csv')
keys = submission_.id
submission
submission.set_index('id', inplace=True)
submission = submission.reindex(list(keys))
submission_pred = num_clf.predict(submission)
submission_pred = pd.DataFrame(submission_pred, columns=['status_group'])
submission_pred
id = pd.DataFrame(submission.index)
id
final_sub = pd.concat([id, submission_pred], axis = 1)
final_sub.set_index('id', inplace=True)
# final_sub.reset_index(inplace=True, drop=True)
final_sub.to_csv('new_final_submit.csv', index=False)

In [None]:
final_sub

In [None]:
# feat_import_cat_names = list(num_clf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names(input_features = cat_features))
feat_import = numeric_features

eli5.explain_weights(num_clf.named_steps['classifier'], top=50, feature_names=feat_import, feature_filter=lambda x: x != '<BIAS>')

In [None]:
# low_weight = ['population', 'amount_tsh', 'district_code', 'region_code', 'num_private']
#
# X = df.drop(['status_group','district_code', 'region_code', 'num_private'], axis=1)
# y = df['status_group']
#
# X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify=y)
#
# numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
# cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()
#
# numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])
#
# cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')
#
# preprocessor = ColumnTransformer(transformers=[
#     ('num', numeric_transformer, numeric_features)])
# # ('cat', cat_transformer, cat_features)])
#
# num_clf = Pipeline(steps=[('preprocessor', preprocessor),
#                           ('classifier', RandomForestClassifier(random_state=42))])
#
# num_clf.fit(X_train, y_train)
#
# y_hat = num_clf.predict(X_test)
#
# print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    # ('num', numeric_transformer, numeric_features)])
    ('cat', cat_transformer, cat_features)])

cat_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=42))])

cat_clf.fit(X_train, y_train)

y_hat = cat_clf.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
feat_import_cat_names = list(cat_clf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names(input_features = cat_features))
feat_import = feat_import_cat_names

eli5.explain_weights(cat_clf.named_steps['classifier'], top=50, feature_names=feat_import, feature_filter=lambda x: x != '<BIAS>')

In [None]:
numeric_features = X_train.select_dtypes(['int64', 'float64']).columns.to_list()
cat_features = X_train.select_dtypes(['object', 'bool']).columns.to_list()

numeric_transformer = Pipeline(steps=[('scaler',  StandardScaler())])

cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, cat_features)])

cat_clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', RandomForestClassifier(random_state=42))])

cat_clf.fit(X_train, y_train)

y_hat = cat_clf.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_hat))

In [None]:
feat_import_cat_names = list(cat_clf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names(input_features = cat_features))
feat_import = feat_import_cat_names + numeric_features

eli5.explain_weights(cat_clf.named_steps['classifier'], top=50, feature_names=feat_import, feature_filter=lambda x: x != '<BIAS>')

In [None]:
df['funder'].value_counts()