# IS4242 Group Assignment Part 1
**November 11, 2020**

## Changelog from submission 2 to submission 3
    - Changed from NN to Random Forest
It is known to perform generally well out of the box for classification tasks. Additionally, it has low bias and moderate, and is relatively easy to tune due to the lower number of hyperparameters. It also will not overfit as long as there are enough trees in the forest. It will also allow me to check the importance of features.
    https://builtin.com/data-science/random-forest-algorithm

#### Name: LECK WEI SHENG IAN
#### NUS ID: A0168177R
#### Name: WOO KENG THONG
#### NUS ID: A0167991L

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided information about the waterpoints in order label them.

The labels in this dataset are simple. There are three possible values:
1. functional - the waterpoint is operational and there are no repairs needed
2. functional needs repair - the waterpoint is operational, but needs repairs
3. non functional - the waterpoint is not operational

The format for the submission file is simply the row id and the predicted label. 
- id	status_group
- 50785	functional
- 51630	functional

CSV would thus look like
- id,status_group
- 50785,functional
- 51630,functional

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Preprocessing & Feature Engineering

Load data, and merge data and labels together into one dataframe

In [2]:
labels = pd.read_csv('training-set-labels.csv')
df = pd.read_csv('training-set-values.csv')
test_df = pd.read_csv('test-set-values.csv')

df = pd.merge(df, labels, on='id')

Explore data set

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Check similar predictors and eliminate highly correlated predictors to reduce redundancy and reduce multicollinearity

In [4]:
df.groupby(['region','region_code']).size()

region         region_code
Arusha         2              3024
               24              326
Dar es Salaam  7               805
Dodoma         1              2201
Iringa         11             5294
Kagera         18             3316
Kigoma         16             2816
Kilimanjaro    3              4379
Lindi          8               300
               18                8
               80             1238
Manyara        21             1583
Mara           20             1969
Mbeya          12             4639
Morogoro       5              4006
Mtwara         9               390
               90              917
               99              423
Mwanza         17               55
               19             3047
Pwani          6              1609
               40                1
               60             1025
Rukwa          15             1808
Ruvuma         10             2640
Shinyanga      11                6
               14               20
               17           

Drop `region_code` as it seems to be identify regions, yet is not able to stand on its own as there are identical region codes in different regions. `drop` list is compiled for each column dropped for subsequent use with test values.

In [5]:
drop = []
drop.append('region_code')

df.drop('region_code',axis=1,inplace=True)
drop

['region_code']

In [6]:
df.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,80.0,30500.0,2013.0


Remove `num_private` as it does not seem to be meaningful - mostly zeros at 25%, 50% and 75%.

In [7]:
drop.append('num_private')

df.drop('num_private',axis=1,inplace=True)
drop

['region_code', 'num_private']

Check for null values in data

In [8]:
df.isnull().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
basin                        0
subvillage                 371
region                       0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity                     0
quantity

Deal with columns containing null values.
- #### funder

In [9]:
df['funder'].value_counts()

Government Of Tanzania    9084
Danida                    3114
Hesawa                    2202
Rwssp                     1374
World Bank                1349
                          ... 
Unhcr/government             1
Mbozi Secondary School       1
Rilayo Water Project         1
Nmdc India                   1
Japan Food Aid               1
Name: funder, Length: 1897, dtype: int64

Keep top 5 `funder` and set the rest to `other`, including missing values.<br>
Using only the main 5 `funder` and reverting the rest to `other` is to reduce computational cost and mitigate overfitting.

In [10]:
def update_funder(row):
    if row['funder']=='Government Of Tanzania':
        return 'gov'
    elif row['funder']=='Danida':
        return 'danida'
    elif row['funder']=='Hesawa':
        return 'hesawa'
    elif row['funder']=='Rwssp':
        return 'rwssp'
    elif row['funder']=='World Bank':
        return 'bank'
    else:
        return 'other'

df['funder'] = df.apply(lambda row: update_funder(row), axis=1)

df.groupby(['funder','status_group']).size()

funder  status_group           
bank    functional                   545
        functional needs repair       97
        non functional               707
danida  functional                  1713
        functional needs repair      159
        non functional              1242
gov     functional                  3720
        functional needs repair      701
        non functional              4663
hesawa  functional                   936
        functional needs repair      232
        non functional              1034
other   functional                 24540
        functional needs repair     3019
        non functional             14718
rwssp   functional                   805
        functional needs repair      109
        non functional               460
dtype: int64

Deal with columns containing null values.
- #### installer

In [11]:
df.installer.value_counts()

DWE                  17402
Government            1825
RWE                   1206
Commu                 1060
DANIDA                1050
                     ...  
Secondary school         1
Samweli Kitana           1
Centra Government        1
Muhochi Kissaka          1
KOYI                     1
Name: installer, Length: 2145, dtype: int64

Keep top 5 `installer` and set the rest to `other`, including missing values.<br>
Using only the main 5 `installer` and reverting the rest to `other` is to reduce computational cost and mitigate overfitting.

In [12]:
def update_installer(row):
    if row['installer']=='DWE':
        return 'dwe'
    elif row['installer']=='Government':
        return 'gov'
    elif row['installer']=='RWE':
        return 'rwe'
    elif row['installer']=='Commu':
        return 'commu'
    elif row['installer']=='DANIDA':
        return 'danida'
    else:
        return 'other'  

df['installer'] = df.apply(lambda row: update_installer(row), axis=1)

df.groupby(['installer','status_group']).size()

installer  status_group           
commu      functional                   724
           functional needs repair       32
           non functional               304
danida     functional                   542
           functional needs repair       83
           non functional               425
dwe        functional                  9433
           functional needs repair     1622
           non functional              6347
gov        functional                   535
           functional needs repair      256
           non functional              1034
other      functional                 20721
           functional needs repair     2187
           non functional             13949
rwe        functional                   304
           functional needs repair      137
           non functional               765
dtype: int64

Deal with columns containing null values.
- #### subvillage

In [13]:
df.subvillage.value_counts()

Madukani      508
Shuleni       506
Majengo       502
Kati          373
Mtakuja       262
             ... 
Mponda          1
Sanungu B       1
Nhunguruma      1
Mgomo           1
Matumika        1
Name: subvillage, Length: 19287, dtype: int64

There are 19287 unique `subvillage`, of which the largest group is only 508. As the total dataset only has around 59400 values, about a third of the data is unique. It is thus unlikely to be a meaningful feature, and will be dropped.

In [14]:
drop.append('subvillage')
df.drop('subvillage',axis=1,inplace=True)
drop

['region_code', 'num_private', 'subvillage']

Deal with columns containing null values.
- #### public_meeting

In [15]:
df.public_meeting.value_counts()

True     51011
False     5055
Name: public_meeting, dtype: int64

In [16]:
df.groupby(['public_meeting','status_group']).size()

public_meeting  status_group           
False           functional                  2173
                functional needs repair      442
                non functional              2440
True            functional                 28408
                functional needs repair     3719
                non functional             18884
dtype: int64

Convert `public_meeting` to binary predictor and impute with mode.

In [17]:
def convert_public_meeting(row):
    if row['public_meeting']==True:
        return 1
    elif row['public_meeting']==False:
        return 0
    else:
        return np.nan
    
df['public_meeting'] = df.apply(lambda row: convert_public_meeting(row), axis=1)
df['public_meeting'].fillna(df['public_meeting'].mode().item(),inplace=True)
df.groupby(['public_meeting','status_group']).size()

public_meeting  status_group           
0.0             functional                  2173
                functional needs repair      442
                non functional              2440
1.0             functional                 30086
                functional needs repair     3875
                non functional             20384
dtype: int64

Deal with columns containing null values.
- #### scheme_management

In [18]:
df.scheme_management.value_counts()

VWC                 36793
WUG                  5206
Water authority      3153
WUA                  2883
Water Board          2748
Parastatal           1680
Private operator     1063
Company              1061
Other                 766
SWC                    97
Trust                  72
None                    1
Name: scheme_management, dtype: int64

Keep top 5 `scheme_management` and set the rest to `other`, including missing values.<br>
Using only the main 5 `scheme_management` and reverting the rest to `other` is to reduce computational cost and mitigate overfitting.

In [19]:
def update_scheme_management(row):
    if row['scheme_management']=='VWC':
        return 'vwc'
    elif row['scheme_management']=='WUG':
        return 'wug'
    elif row['scheme_management']=='Water authority':
        return 'water_auth'
    elif row['scheme_management']=='WUA':
        return 'wua'
    elif row['scheme_management']=='Water Board':
        return 'water_bd'
    else:
        return 'other'  
    
df['scheme_management'] = df.apply(lambda row: update_scheme_management(row), axis=1)
df.groupby(['scheme_management','status_group']).size()

scheme_management  status_group           
other              functional                  4627
                   functional needs repair      513
                   non functional              3477
vwc                functional                 18960
                   functional needs repair     2334
                   non functional             15499
water_auth         functional                  1618
                   functional needs repair      448
                   non functional              1087
water_bd           functional                  2053
                   functional needs repair      111
                   non functional               584
wua                functional                  1995
                   functional needs repair      239
                   non functional               649
wug                functional                  3006
                   functional needs repair      672
                   non functional              1528
dtype: int64

Deal with columns containing null values.
- #### scheme_name

In [20]:
df.scheme_name.value_counts()

K                         682
None                      644
Borehole                  546
Chalinze wate             405
M                         400
                         ... 
Lerang'wa waterbsupply      1
Shirimatunda                1
Makondeko line              1
Kikole water supply         1
Community                   1
Name: scheme_name, Length: 2696, dtype: int64

There are 2696 unique `scheme_name`, of which the largest group is only 682. Additionally, there are 28166 null values in this column. As the total dataset only has around 59400 values, nearly half of the data is either unique or missing. It is thus unlikely to be a meaningful feature, and will be dropped.

In [21]:
drop.append('scheme_name')

df.drop('scheme_name',axis=1,inplace=True)
drop

['region_code', 'num_private', 'subvillage', 'scheme_name']

Deal with columns containing null values.
- #### permit

In [22]:
df.permit.value_counts()

True     38852
False    17492
Name: permit, dtype: int64

In [23]:
df.groupby(['permit','status_group']).size()

permit  status_group           
False   functional                  9045
        functional needs repair     1320
        non functional              7127
True    functional                 21541
        functional needs repair     2697
        non functional             14614
dtype: int64

Convert `permit` to binary predictor and impute with mode.

In [24]:
def convert_permit(row):
    if row['permit']==True:
        return 1
    elif row['permit']==False:
        return 0
    else:
        return np.nan
    
df['permit'] = df.apply(lambda row: convert_permit(row), axis=1)
df['permit'].fillna(df['permit'].mode().item(),inplace=True)
df.groupby(['permit','status_group']).size()

permit  status_group           
0.0     functional                  9045
        functional needs repair     1320
        non functional              7127
1.0     functional                 23214
        functional needs repair     2997
        non functional             15697
dtype: int64

Having removed all null, ensure that there is no other invalid data $-$ 0 $-$ that can be immediately obvious for relevant columns such as `population`, `gps_height`, `amount_tsh` and `construction_year`.

In [25]:
df['gps_height'].replace(0, np.nan, inplace=True)
df['population'].replace(0, np.nan, inplace=True)
df['amount_tsh'].replace(0, np.nan, inplace=True)
df['construction_year'].replace(0, np.nan, inplace=True)
df.isnull().sum()

id                           0
amount_tsh               41639
date_recorded                0
funder                       0
gps_height               20438
installer                    0
longitude                    0
latitude                     0
wpt_name                     0
basin                        0
region                       0
district_code                0
lga                          0
ward                         0
population               21381
public_meeting               0
recorded_by                  0
scheme_management            0
permit                       0
construction_year        20709
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity                     0
quantity_group               0
source                       0
source_t

`gps_height` is likely to be location-dependent as it is the height of an area is likely to be similar.<br>
`amount_tsh` is also likely to be location-dependent as the amount of water that can be drawn in the same area is likely to be similar.<br>
`population` is also likely to be location-dependent as communities in the same area will be subject to similar living conditions.<br>
Holding the above assumptions, we will assume that the three predictors are affected by `region` and `district_code`, which based on the name, indicate a general area.<br>
Hence, `region` and `district_code` will also be used when imputing missing data with mean values for `gps_height`, `amount_tsh` and `population`.<br>

#### There is no clear indication whether the other predictors imputed earlier are relevant to geographic area, and hence were not imputed with `region`/`district_code`.

In [26]:
df['amount_tsh'].fillna(df.groupby(['region', 'district_code'])['amount_tsh'].transform('mean'), inplace=True)
df['amount_tsh'].fillna(df.groupby(['region'])['amount_tsh'].transform('mean'), inplace=True)
df['amount_tsh'].fillna(df['amount_tsh'].mean(), inplace=True)
df['gps_height'].fillna(df.groupby(['region', 'district_code'])['gps_height'].transform('mean'), inplace=True)
df['gps_height'].fillna(df.groupby(['region'])['gps_height'].transform('mean'), inplace=True)
df['gps_height'].fillna(df['gps_height'].mean(), inplace=True)
df['population'].fillna(df.groupby(['region', 'district_code'])['population'].transform('mean'), inplace=True)
df['population'].fillna(df.groupby(['region'])['population'].transform('mean'), inplace=True)
df['population'].fillna(df['population'].mean(), inplace=True)

`construction_year` (as a numeric predictor) is also imputed with mean value. We assume that there is no relation between region and construction year as it is unlikely construction within each geographic area occurs in the same period due to resource constraint.

In [27]:
df['construction_year'].fillna(df['construction_year'].mean(),inplace=True)

df.isnull().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
basin                    0
region                   0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
waterpoint_type_group    0
status_group             0
d

`construction_year` provides the year that it is constructed and it is a good source for feature engineering. The longer a water point is operational, the more likely it is for the water point to be non functional or needs repair. <br> Convert `construction_year` and `date_recorded` into the number of years the waterpoint has been in operation for, and drop both features after as the `operational years` is likely to be a more useful predictor

In [28]:
df['date_recorded'] = pd.to_datetime(df['date_recorded'])
df['operational_years'] = df.date_recorded.dt.year - df.construction_year

df.drop('date_recorded', axis=1, inplace=True)
df.drop('construction_year', axis=1, inplace=True)
drop.append('date_recorded')
drop.append('construction_year')

## Take the same pre-processing steps for test data

In [29]:
test_df.funder.value_counts()

Government Of Tanzania    2215
Danida                     793
Hesawa                     580
World Bank                 352
Kkkt                       336
                          ... 
Medicine Lumundi             1
Ikela Wa                     1
Cbhi                         1
Crs                          1
Ba                           1
Name: funder, Length: 980, dtype: int64

In [30]:
def update_funder_test(row):
    if row['funder']=='Government Of Tanzania':
        return 'gov'
    elif row['funder']=='Danida':
        return 'danida'
    elif row['funder']=='Hesawa':
        return 'hesawa'
    elif row['funder']=='World Bank':
        return 'bank'
    elif row['funder']=='Kkkt':
        return 'kkkt'
    else:
        return 'other'

test_df['funder'] = test_df.apply(lambda row: update_funder(row), axis=1)

In [31]:
test_df.installer.value_counts()

DWE                    4349
Government              457
RWE                     292
Commu                   287
DANIDA                  255
                       ... 
Local technical tec       1
Water Department          1
CH                        1
HIMA                      1
KOYI                      1
Name: installer, Length: 1091, dtype: int64

In [32]:
def update_installer_test(row):
    if row['installer']=='DWE':
        return 'dwe'
    elif row['installer']=='Government':
        return 'gov'
    elif row['installer']=='RWE':
        return 'rwe'
    elif row['installer']=='Commu':
        return 'commu'
    elif row['installer']=='DANIDA':
        return 'danida'
    else:
        return 'other'  

test_df['installer'] = test_df.apply(lambda row: update_installer(row), axis=1)

In [33]:
test_df['public_meeting'] = test_df.apply(lambda row: convert_public_meeting(row), axis=1)
test_df['public_meeting'].fillna(test_df['public_meeting'].mode().item(),inplace=True)

In [34]:
test_df.scheme_management.value_counts()

VWC                 9124
WUG                 1290
Water authority      822
Water Board          714
WUA                  668
Parastatal           444
Company              280
Private operator     263
Other                230
SWC                   26
Trust                 20
Name: scheme_management, dtype: int64

In [35]:
def update_scheme_management_test(row):
    if row['scheme_management']=='VWC':
        return 'vwc'
    elif row['scheme_management']=='WUG':
        return 'wug'
    elif row['scheme_management']=='Water authority':
        return 'water_auth'
    elif row['scheme_management']=='Water Board':
        return 'water_bd'
    elif row['scheme_management']=='WUA':
        return 'wua'
    else:
        return 'other'  

test_df['scheme_management'] = test_df.apply(lambda row: update_scheme_management(row), axis=1)

Convert `permit` to binary predictor and impute with mode.

In [36]:
test_df['permit'] = test_df.apply(lambda row: convert_permit(row), axis=1)
test_df['permit'].fillna(test_df['permit'].mode().item(),inplace=True)

Having removed all null, ensure that there is no other invalid data $-$ 0 $-$ that can be immediately obvious for relevant columns such as `population`, `gps_height`, `amount_tsh` and `construction_year`.

In [37]:
test_df['gps_height'].replace(0, np.nan, inplace=True)
test_df['population'].replace(0, np.nan, inplace=True)
test_df['amount_tsh'].replace(0, np.nan, inplace=True)
test_df['construction_year'].replace(0, np.nan, inplace=True)
test_df.isnull().sum()

id                           0
amount_tsh               10410
date_recorded                0
funder                       0
gps_height                5211
installer                    0
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                  99
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                5453
public_meeting               0
recorded_by                  0
scheme_management            0
scheme_name               7092
permit                       0
construction_year         5260
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

`gps_height` is likely to be location-dependent as it is the height of an area is likely to be similar.<br>
`amount_tsh` is also likely to be location-dependent as the amount of water that can be drawn in the same area is likely to be similar.<br>
`population` is also likely to be location-dependent as communities in the same area will be subject to similar living conditions.<br>
Holding the above assumptions, we will assume that the three predictors are affected by `region` and `district_code`, which based on the name, indicate a general area.<br>
Hence, `region` and `district_code` will also be used when imputing missing data with mean values for `gps_height`, `amount_tsh` and `population`.<br>

#### There is no clear indication whether the other predictors imputed earlier are relevant to geographic area, and hence were not imputed with `region`/`district_code`.

In [38]:
test_df['amount_tsh'].fillna(test_df.groupby(['region', 'district_code'])['amount_tsh'].transform('mean'), inplace=True)
test_df['amount_tsh'].fillna(test_df.groupby(['region'])['amount_tsh'].transform('mean'), inplace=True)
test_df['amount_tsh'].fillna(test_df['amount_tsh'].mean(), inplace=True)
test_df['gps_height'].fillna(test_df.groupby(['region', 'district_code'])['gps_height'].transform('mean'), inplace=True)
test_df['gps_height'].fillna(test_df.groupby(['region'])['gps_height'].transform('mean'), inplace=True)
test_df['gps_height'].fillna(test_df['gps_height'].mean(), inplace=True)
test_df['population'].fillna(test_df.groupby(['region', 'district_code'])['population'].transform('mean'), inplace=True)
test_df['population'].fillna(test_df.groupby(['region'])['population'].transform('mean'), inplace=True)
test_df['population'].fillna(test_df['population'].mean(), inplace=True)

`construction_year` (as a numeric predictor) is also imputed with mean value. We assume that there is no relation between region and construction year as it is unlikely construction within each geographic area occurs in the same period due to resource constraint.

In [39]:
test_df['construction_year'].fillna(test_df['construction_year'].mean(),inplace=True)

`construction_year` provides the year that it is constructed and it is a good source for feature engineering. The longer a water point is operational, the more likely it is for the water point to be non functional or needs repair. <br> Convert `construction_year` and `date_recorded` into the number of years the waterpoint has been in operation for, and drop both features after as the `operational years` is likely to be a more useful predictor

In [40]:
test_df['date_recorded'] = pd.to_datetime(test_df['date_recorded'])
test_df['operational_years'] = test_df.date_recorded.dt.year - test_df.construction_year

Drop columns

In [41]:
for i in drop:
    test_df.drop(i, axis=1, inplace=True)

Export preprocessed data

In [42]:
pd.DataFrame(df).to_csv("clean.csv", index=False)
pd.DataFrame(test_df).to_csv("clean_test.csv", index=False)

## Model building (Third Submission) - RF

In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold

Read Data

In [44]:
X_test = pd.read_csv('clean_test.csv')
X_test

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,basin,region,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,operational_years
0,50785,7702.857143,other,1996.000000,other,35.290799,-4.059696,Dinamu Secondary School,Internal,Manyara,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other,1.0
1,51630,1325.000000,gov,1569.000000,dwe,36.656709,-3.309214,Kimnyak,Pangani,Arusha,...,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,13.0
2,17168,1440.961538,other,1567.000000,other,34.767863,-5.004344,Puma Secondary,Internal,Singida,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other,3.0
3,45559,398.333333,other,267.000000,other,38.058046,-9.418672,Kwa Mzee Pange,Ruvuma / Southern Coast,Lindi,...,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other,26.0
4,49871,500.000000,other,1260.000000,other,35.006123,-10.950412,Kwa Mzee Turuka,Ruvuma / Southern Coast,Ruvuma,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14845,39307,1417.980296,danida,34.000000,other,38.852669,-6.582841,Kwambwezi,Wami / Ruvu,Pwani,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,23.0
14846,18990,1000.000000,other,679.918735,other,37.451633,-5.350428,Bonde La Mkondoa,Pangani,Tanga,...,salty,salty,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,17.0
14847,28749,1440.961538,other,1476.000000,other,34.739804,-4.585587,Bwawani,Internal,Singida,...,soft,good,insufficient,insufficient,dam,dam,surface,communal standpipe,communal standpipe,3.0
14848,33492,4152.061856,other,998.000000,dwe,35.432732,-10.584159,Kwa John,Lake Nyasa,Ruvuma,...,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,4.0


In [45]:
X_train = pd.read_csv('clean.csv')
y_train = X_train.pop('status_group')
X_test = pd.read_csv('clean_test.csv')

In [46]:
X_train.isnull().sum()

id                       0
amount_tsh               0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
basin                    0
region                   0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
waterpoint_type_group    0
operational_years        0
dtype: int64

### Encode labels into categorical variables
https://stackoverflow.com/questions/40336502/want-to-know-the-diff-among-pd-factorize-pd-get-dummies-sklearn-preprocessing/40338956

In [47]:
X_train['funder'] = pd.factorize(X_train['funder'])[0]
X_train['installer'] = pd.factorize(X_train['installer'])[0]
X_train['wpt_name'] = pd.factorize(X_train['wpt_name'])[0]
X_train['basin'] = pd.factorize(X_train['basin'])[0]
X_train['region'] = pd.factorize(X_train['region'])[0]
X_train['lga'] = pd.factorize(X_train['lga'])[0]
X_train['ward'] = pd.factorize(X_train['ward'])[0]
X_train['recorded_by'] = pd.factorize(X_train['recorded_by'])[0]
X_train['scheme_management'] = pd.factorize(X_train['scheme_management'])[0]
X_train['extraction_type'] = pd.factorize(X_train['extraction_type'])[0]
X_train['extraction_type_group'] = pd.factorize(X_train['extraction_type_group'])[0]
X_train['extraction_type_class'] = pd.factorize(X_train['extraction_type_class'])[0]
X_train['management'] = pd.factorize(X_train['management'])[0]
X_train['management_group'] = pd.factorize(X_train['management_group'])[0]
X_train['payment'] = pd.factorize(X_train['payment'])[0]
X_train['payment_type'] = pd.factorize(X_train['payment_type'])[0]
X_train['water_quality'] = pd.factorize(X_train['water_quality'])[0]
X_train['quality_group'] = pd.factorize(X_train['quality_group'])[0]
X_train['quantity'] = pd.factorize(X_train['quantity'])[0]
X_train['quantity_group'] = pd.factorize(X_train['quantity_group'])[0]
X_train['source'] = pd.factorize(X_train['source'])[0]
X_train['source_type'] = pd.factorize(X_train['source_type'])[0]
X_train['source_class'] = pd.factorize(X_train['source_class'])[0]
X_train['waterpoint_type'] = pd.factorize(X_train['waterpoint_type'])[0]
X_train['waterpoint_type_group'] = pd.factorize(X_train['waterpoint_type_group'])[0]

In [48]:
X_test['funder'] = pd.factorize(X_test['funder'])[0]
X_test['installer'] = pd.factorize(X_test['installer'])[0]
X_test['wpt_name'] = pd.factorize(X_test['wpt_name'])[0]
X_test['basin'] = pd.factorize(X_test['basin'])[0]
X_test['region'] = pd.factorize(X_test['region'])[0]
X_test['lga'] = pd.factorize(X_test['lga'])[0]
X_test['ward'] = pd.factorize(X_test['ward'])[0]
X_test['recorded_by'] = pd.factorize(X_test['recorded_by'])[0]
X_test['scheme_management'] = pd.factorize(X_test['scheme_management'])[0]
X_test['extraction_type'] = pd.factorize(X_test['extraction_type'])[0]
X_test['extraction_type_group'] = pd.factorize(X_test['extraction_type_group'])[0]
X_test['extraction_type_class'] = pd.factorize(X_test['extraction_type_class'])[0]
X_test['management'] = pd.factorize(X_test['management'])[0]
X_test['management_group'] = pd.factorize(X_test['management_group'])[0]
X_test['payment'] = pd.factorize(X_test['payment'])[0]
X_test['payment_type'] = pd.factorize(X_test['payment_type'])[0]
X_test['water_quality'] = pd.factorize(X_test['water_quality'])[0]
X_test['quality_group'] = pd.factorize(X_test['quality_group'])[0]
X_test['quantity'] = pd.factorize(X_test['quantity'])[0]
X_test['quantity_group'] = pd.factorize(X_test['quantity_group'])[0]
X_test['source'] = pd.factorize(X_test['source'])[0]
X_test['source_type'] = pd.factorize(X_test['source_type'])[0]
X_test['source_class'] = pd.factorize(X_test['source_class'])[0]
X_test['waterpoint_type'] = pd.factorize(X_test['waterpoint_type'])[0]
X_test['waterpoint_type_group'] = pd.factorize(X_test['waterpoint_type_group'])[0]

Tune hyperparameters

https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html <br>
Random Forest generally works well on default settings. Hence most of the features will not be touched.  <br>
Increased `min_samples_split` to `6` to reduce over-fitting - in comparison to default 2. <br>
Set `oob_score` to `True` to estimate accuracy <br>
Set `random_state` to `1` to ensure performance is not affected by random initial state <br>
https://stackoverflow.com/questions/55070918/does-setting-a-random-state-in-sklearns-randomforestclassifier-bias-your-model <br>
Set `n_jobs` to `-1` to use all CPUs<br>
Staggered `n_estimators` to observe which relative amount of trees is best<br>
Grid search is used instead of random search as it performs better
https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85

In [49]:
rf = RandomForestClassifier(min_samples_split=6,
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)

param_grid = {"n_estimators" : [500, 750, 1000]}

gs = GridSearchCV(estimator=rf,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=5,
                  n_jobs=-1)

gs = gs.fit(X_train, y_train.values.ravel())

In [50]:
print(gs.best_score_)
print(gs.best_params_)
print(gs.cv_results_)

0.8112289562289561
{'n_estimators': 500}
{'mean_fit_time': array([134.46012821, 179.96803913, 123.86182895]), 'std_fit_time': array([ 0.9109661 ,  6.25060535, 61.21095462]), 'mean_score_time': array([1.41031013, 6.76451693, 0.94903069]), 'std_score_time': array([0.35396978, 1.00884801, 0.19385634]), 'param_n_estimators': masked_array(data=[500, 750, 1000],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_estimators': 500}, {'n_estimators': 750}, {'n_estimators': 1000}], 'split0_test_score': array([0.81599327, 0.81531987, 0.81574074]), 'split1_test_score': array([0.80959596, 0.81010101, 0.81010101]), 'split2_test_score': array([0.81321549, 0.81363636, 0.81363636]), 'split3_test_score': array([0.81060606, 0.81052189, 0.80993266]), 'split4_test_score': array([0.80673401, 0.80622896, 0.80639731]), 'mean_test_score': array([0.81122896, 0.81116162, 0.81116162]), 'std_test_score': array([0.00315925, 0.00313945, 0.00323811]), 'rank_test_

Fit model

oob_score_ is used as it might avoid overfitting in comparison to $R^2$
https://stats.stackexchange.com/questions/288699/r2-score-vs-oob-score-random-forest

In [51]:
X_train.drop('id',axis=1,inplace=True)
rf = RandomForestClassifier(criterion='gini',
                                min_samples_split=6,
                                n_estimators=500,
                                max_features='auto',
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)
                            
rf.fit(X_train, y_train.values.ravel())
print("%.4f" % rf.oob_score_)

0.8162


Inspect feature importance to observe which features are more important for next submission's reference

In [52]:
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']), 
           pd.DataFrame(rf.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:10]

Unnamed: 0,variable,importance
4,longitude,0.091504
5,latitude,0.08943
26,quantity,0.076828
27,quantity_group,0.065496
6,wpt_name,0.057683
33,operational_years,0.053906
2,gps_height,0.051665
11,ward,0.047778
12,population,0.037041
31,waterpoint_type,0.03278


Make prediction on test data

In [53]:
idx=X_test['id']
X_test.drop(['id'],axis=1, inplace=True)
y_pred = rf.predict(X_test)

Create new dataframe with predictions and id

In [54]:
y_pred=pd.DataFrame(y_pred)
y_pred['id']=idx
y_pred.columns=['status_group','id']
y_pred=y_pred[['id','status_group']]

Create new csv for submission

In [55]:
pd.DataFrame(y_pred).to_csv("submission_rf.csv", index=False)