# IS4242 Group Assignment Part 1
**November 11, 2020**

## Instructions

+ 2 Parts: Predictive Analytics (30 marks) & Prescriptive Analytics (10 marks)
+ Do all required data exploration - preprocessing - feature engineering - model building and evaluation steps
+ Submit first entry - predict on the given test data - see leaderboard position, then improve upon previous entry to improve test accuracy. Each team must have at least 2 submissions

+ At least one of the submitted models:
  + Must be a neural network implemented using PyTorch
  + Must use automated hyperparameter tuning
+ It is important to __explain each step__ you perform in preprocessing, feature engineering, model training. Ask yourself why you are performing the step and write the reason. While you may use any online resource, you have to cite them AND explanation should be in our own words.
+ __50% marks - explanation, 50% marks - code__
+ __Bonus points if your team has rank < 500__
+ **Submission deadline: November 11, 2020; 11:59 am**

#### Name: LECK WEI SHENG IAN
#### NUS ID: A0168177R
#### Name: WOO KENG THONG
#### NUS ID: A0167991L

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided information about the waterpoints in order label them.

The labels in this dataset are simple. There are three possible values:
1. functional - the waterpoint is operational and there are no repairs needed
2. functional needs repair - the waterpoint is operational, but needs repairs
3. non functional - the waterpoint is not operational

The format for the submission file is simply the row id and the predicted label. 
- id	status_group
- 50785	functional
- 51630	functional

CSV would thus look like
- id,status_group
- 50785,functional
- 51630,functional

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Preprocessing & Feature Engineering - 10

Load data, and merge data and labels together into one dataframe

In [2]:
labels = pd.read_csv('training-set-labels.csv')
df = pd.read_csv('training-set-values.csv')
test_df = pd.read_csv('test-set-values.csv')

df = pd.merge(df, labels, on='id')

Explore data set

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Check similar predictors and eliminate highly correlated predictors to reduce redundancy

In [4]:
df.groupby(['region','region_code']).size()

region         region_code
Arusha         2              3024
               24              326
Dar es Salaam  7               805
Dodoma         1              2201
Iringa         11             5294
Kagera         18             3316
Kigoma         16             2816
Kilimanjaro    3              4379
Lindi          8               300
               18                8
               80             1238
Manyara        21             1583
Mara           20             1969
Mbeya          12             4639
Morogoro       5              4006
Mtwara         9               390
               90              917
               99              423
Mwanza         17               55
               19             3047
Pwani          6              1609
               40                1
               60             1025
Rukwa          15             1808
Ruvuma         10             2640
Shinyanga      11                6
               14               20
               17           

Drop `region_code` as it seems to be identify regions, yet is not able to stand on its own as there are identical region codes in different regions. `drop` list is compiled for each column dropped for subsequent use with test values.

In [5]:
drop = []
drop.append('region_code')

df.drop('region_code',axis=1,inplace=True)
drop

['region_code']

In [6]:
df.groupby(['extraction_type','extraction_type_group','extraction_type_class']).size()

extraction_type            extraction_type_group  extraction_type_class
afridev                    afridev                handpump                  1770
cemo                       other motorpump        motorpump                   90
climax                     other motorpump        motorpump                   32
gravity                    gravity                gravity                  26780
india mark ii              india mark ii          handpump                  2400
india mark iii             india mark iii         handpump                    98
ksb                        submersible            submersible               1415
mono                       mono                   motorpump                 2865
nira/tanira                nira/tanira            handpump                  8154
other                      other                  other                     6430
other - mkulima/shinyanga  other handpump         handpump                     2
other - play pump          other hand

In [7]:
df.groupby(['management','management_group']).size()

management        management_group
company           commercial            685
other             other                 844
other - school    other                  99
parastatal        parastatal           1768
private operator  commercial           1971
trust             commercial             78
unknown           unknown               561
vwc               user-group          40507
water authority   commercial            904
water board       user-group           2933
wua               user-group           2535
wug               user-group           6515
dtype: int64

In [8]:
df.groupby(['payment','payment_type']).size()

payment                payment_type
never pay              never pay       25348
other                  other            1054
pay annually           annually         3642
pay monthly            monthly          8300
pay per bucket         per bucket       8985
pay when scheme fails  on failure       3914
unknown                unknown          8157
dtype: int64

In [9]:
df.groupby(['quantity','quantity_group']).size()

quantity      quantity_group
dry           dry                6246
enough        enough            33186
insufficient  insufficient      15129
seasonal      seasonal           4050
unknown       unknown             789
dtype: int64

In [10]:
df.groupby(['source','source_type','source_class']).size()

source                source_type           source_class
dam                   dam                   surface           656
hand dtw              borehole              groundwater       874
lake                  river/lake            surface           765
machine dbh           borehole              groundwater     11075
other                 other                 unknown           212
rainwater harvesting  rainwater harvesting  surface          2295
river                 river/lake            surface          9612
shallow well          shallow well          groundwater     16824
spring                spring                groundwater     17021
unknown               other                 unknown            66
dtype: int64

In [11]:
df.groupby(['waterpoint_type','waterpoint_type_group']).size()

waterpoint_type              waterpoint_type_group
cattle trough                cattle trough              116
communal standpipe           communal standpipe       28522
communal standpipe multiple  communal standpipe        6103
dam                          dam                          7
hand pump                    hand pump                17488
improved spring              improved spring            784
other                        other                     6380
dtype: int64

In [12]:
# Above this can drop more but I didn't because I thought can use this to improve second submission besides using a different model

In [13]:
df.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,80.0,30500.0,2013.0


Remove `num_private` as it does not seem to be meaningful - mostly zeros at 25%, 50% and 75%.

In [14]:
drop.append('num_private')

df.drop('num_private',axis=1,inplace=True)
drop

['region_code', 'num_private']

In [15]:
type(drop)

list

Check for null values in data

In [16]:
df.isnull().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
basin                        0
subvillage                 371
region                       0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity                     0
quantity

Deal with columns containing null values.
- #### funder

In [17]:
df.funder.value_counts()

Government Of Tanzania    9084
Danida                    3114
Hesawa                    2202
Rwssp                     1374
World Bank                1349
                          ... 
John Skwese                  1
Municipal Council            1
Prince Medium School         1
Wsdo                         1
Kegocha                      1
Name: funder, Length: 1897, dtype: int64

Keep top 5 funders and set the rest to other, including missing values. Then, check the difference between the funders.

In [18]:
def update_funder(row):
    if row['funder']=='Government Of Tanzania':
        return 'gov'
    elif row['funder']=='Danida':
        return 'danida'
    elif row['funder']=='Hesawa':
        return 'hesawa'
    elif row['funder']=='Rwssp':
        return 'rwssp'
    elif row['funder']=='World Bank':
        return 'bank'
    else:
        return 'other'
# Explain why top 5, and also include explanation for why reverting the rest to other
df['funder'] = df.apply(lambda row: update_funder(row), axis=1)

df.groupby(['funder','status_group']).size()

funder  status_group           
bank    functional                   545
        functional needs repair       97
        non functional               707
danida  functional                  1713
        functional needs repair      159
        non functional              1242
gov     functional                  3720
        functional needs repair      701
        non functional              4663
hesawa  functional                   936
        functional needs repair      232
        non functional              1034
other   functional                 24540
        functional needs repair     3019
        non functional             14718
rwssp   functional                   805
        functional needs repair      109
        non functional               460
dtype: int64

Observe the percentage of functional waterpoints by each funder

In [19]:
functional_fund_bank = (len(df[(df['status_group'] == 'functional') & (df['funder']=='bank')]))/(len(df[df['funder']=='bank']))*100
functional_fund_danida = (len(df[(df['status_group'] == 'functional') & (df['funder']=='danida')]))/(len(df[df['funder']=='danida']))*100
functional_fund_gov = (len(df[(df['status_group'] == 'functional') & (df['funder']=='gov')]))/(len(df[df['funder']=='gov']))*100
functional_fund_hesawa = (len(df[(df['status_group'] == 'functional') & (df['funder']=='hesawa')]))/(len(df[df['funder']=='hesawa']))*100
functional_fund_rwssp = (len(df[(df['status_group'] == 'functional') & (df['funder']=='rwssp')]))/(len(df[df['funder']=='rwssp']))*100
functional_fund_other = (len(df[(df['status_group'] == 'functional') & (df['funder']=='other')]))/(len(df[df['funder']=='other']))*100

print('functional bank waterpoints = ', round(functional_fund_bank,3),'%')
print('functional danida waterpoints = ', round(functional_fund_danida,3),'%')
print('functional gov waterpoints = ', round(functional_fund_gov,3),'%')
print('functional hesawa waterpoints = ', round(functional_fund_hesawa,3),'%')
print('functional rwssp waterpoints = ', round(functional_fund_rwssp,3),'%')
print('functional other waterpoints = ', round(functional_fund_other,3),'%')

functional bank waterpoints =  40.4 %
functional danida waterpoints =  55.01 %
functional gov waterpoints =  40.951 %
functional hesawa waterpoints =  42.507 %
functional rwssp waterpoints =  58.588 %
functional other waterpoints =  58.046 %


Deal with columns containing null values.
- #### installer

In [20]:
df.installer.value_counts()

DWE                        17402
Government                  1825
RWE                         1206
Commu                       1060
DANIDA                      1050
                           ...  
DWE/Ubalozi wa Marekani        1
VICF                           1
Schoo                          1
DMMD                           1
TCRS/ TASSAF                   1
Name: installer, Length: 2145, dtype: int64

Keep top 5 installers and set the rest to other, including missing values. Then, check the difference between the installers.

In [21]:
def update_installer(row):
    if row['installer']=='DWE':
        return 'dwe'
    elif row['installer']=='Government':
        return 'gov'
    elif row['installer']=='RWE':
        return 'rwe'
    elif row['installer']=='Commu':
        return 'commu'
    elif row['installer']=='DANIDA':
        return 'danida'
    else:
        return 'other'  

df['installer'] = df.apply(lambda row: update_installer(row), axis=1)

df.groupby(['installer','status_group']).size()

installer  status_group           
commu      functional                   724
           functional needs repair       32
           non functional               304
danida     functional                   542
           functional needs repair       83
           non functional               425
dwe        functional                  9433
           functional needs repair     1622
           non functional              6347
gov        functional                   535
           functional needs repair      256
           non functional              1034
other      functional                 20721
           functional needs repair     2187
           non functional             13949
rwe        functional                   304
           functional needs repair      137
           non functional               765
dtype: int64

Observe the percentage of functional waterpoints by each installer

In [22]:
functional_install_commu = (len(df[(df['status_group'] == 'functional') & (df['installer']=='commu')]))/(len(df[df['installer']=='commu']))*100
functional_install_danida = (len(df[(df['status_group'] == 'functional') & (df['installer']=='danida')]))/(len(df[df['installer']=='danida']))*100
functional_install_dwe = (len(df[(df['status_group'] == 'functional') & (df['installer']=='dwe')]))/(len(df[df['installer']=='dwe']))*100
functional_install_gov = (len(df[(df['status_group'] == 'functional') & (df['installer']=='gov')]))/(len(df[df['installer']=='gov']))*100
functional_install_rwe = (len(df[(df['status_group'] == 'functional') & (df['installer']=='rwe')]))/(len(df[df['installer']=='rwe']))*100
functional_install_other = (len(df[(df['status_group'] == 'functional') & (df['installer']=='other')]))/(len(df[df['installer']=='other']))*100

print('functional commu waterpoints = ', round(functional_install_commu,3),'%')
print('functional danida waterpoints = ', round(functional_install_danida,3),'%')
print('functional dwe waterpoints = ', round(functional_install_dwe,3),'%')
print('functional gov waterpoints = ', round(functional_install_gov,3),'%')
print('functional rwe waterpoints = ', round(functional_install_rwe,3),'%')
print('functional other waterpoints = ', round(functional_install_other,3),'%')

functional commu waterpoints =  68.302 %
functional danida waterpoints =  51.619 %
functional dwe waterpoints =  54.206 %
functional gov waterpoints =  29.315 %
functional rwe waterpoints =  25.207 %
functional other waterpoints =  56.22 %


Note that there are some differences between the functional waterpoints funded and installed by the same organisations - gov/danida.

Deal with columns containing null values.
- #### subvillage

In [23]:
df.subvillage.value_counts()

Madukani    508
Shuleni     506
Majengo     502
Kati        373
Mtakuja     262
           ... 
Kiniha        1
Itanga        1
Buhanuzi      1
Mwanya A      1
Omoche B      1
Name: subvillage, Length: 19287, dtype: int64

There are 19287 unique <code>subvillage</code>, of which the largest group is only 508. It is unlikely to be a meaningful feature, and will thus be dropped.

In [24]:
drop.append('subvillage')
# Explain why 19k unique subvillage is not meaningful - eg dataset size only 50k
df.drop('subvillage',axis=1,inplace=True)
drop

['region_code', 'num_private', 'subvillage']

Deal with columns containing null values.
- #### public_meeting

In [25]:
df.public_meeting.value_counts()

True     51011
False     5055
Name: public_meeting, dtype: int64

In [26]:
df.groupby(['public_meeting','status_group']).size()

public_meeting  status_group           
False           functional                  2173
                functional needs repair      442
                non functional              2440
True            functional                 28408
                functional needs repair     3719
                non functional             18884
dtype: int64

Convert `public_meeting` to binary predictor and impute with median.

In [27]:
def convert_public_meeting(row):
    if row['public_meeting']==True:
        return 1
    elif row['public_meeting']==False:
        return 0
    else:
        return np.nan
    
df['public_meeting'] = df.apply(lambda row: convert_public_meeting(row), axis=1)
df['public_meeting'].fillna(df['public_meeting'].median(),inplace=True)
df.groupby(['public_meeting','status_group']).size()

public_meeting  status_group           
0.0             functional                  2173
                functional needs repair      442
                non functional              2440
1.0             functional                 30086
                functional needs repair     3875
                non functional             20384
dtype: int64

Deal with columns containing null values.
- #### scheme_management

In [28]:
df.scheme_management.value_counts()

VWC                 36793
WUG                  5206
Water authority      3153
WUA                  2883
Water Board          2748
Parastatal           1680
Private operator     1063
Company              1061
Other                 766
SWC                    97
Trust                  72
None                    1
Name: scheme_management, dtype: int64

Keep top 5 scheme management and set the rest to other, including missing values. Then, check the difference between the scheme management.

In [29]:
def update_scheme_management(row):
    if row['scheme_management']=='VWC':
        return 'vwc'
    elif row['scheme_management']=='WUG':
        return 'wug'
    elif row['scheme_management']=='Water authority':
        return 'water_auth'
    elif row['scheme_management']=='WUA':
        return 'wua'
    elif row['scheme_management']=='Water Board':
        return 'water_bd'
    else:
        return 'other'  
# Consider not keeping just the top 5. Might not be necessary given the limited number 
df['scheme_management'] = df.apply(lambda row: update_scheme_management(row), axis=1)

df.groupby(['scheme_management','status_group']).size()

scheme_management  status_group           
other              functional                  4627
                   functional needs repair      513
                   non functional              3477
vwc                functional                 18960
                   functional needs repair     2334
                   non functional             15499
water_auth         functional                  1618
                   functional needs repair      448
                   non functional              1087
water_bd           functional                  2053
                   functional needs repair      111
                   non functional               584
wua                functional                  1995
                   functional needs repair      239
                   non functional               649
wug                functional                  3006
                   functional needs repair      672
                   non functional              1528
dtype: int64

Observe the percentage of functional waterpoints by each scheme management

In [30]:
functional_vwc = (len(df[(df['status_group'] == 'functional') & (df['scheme_management']=='vwc')]))/(len(df[df['scheme_management']=='vwc']))*100
functional_water_auth = (len(df[(df['status_group'] == 'functional') & (df['scheme_management']=='water_auth')]))/(len(df[df['scheme_management']=='water_auth']))*100
functional_water_bd = (len(df[(df['status_group'] == 'functional') & (df['scheme_management']=='water_bd')]))/(len(df[df['scheme_management']=='water_bd']))*100
functional_wua = (len(df[(df['status_group'] == 'functional') & (df['scheme_management']=='wua')]))/(len(df[df['scheme_management']=='wua']))*100
functional_wug = (len(df[(df['status_group'] == 'functional') & (df['scheme_management']=='wug')]))/(len(df[df['scheme_management']=='wug']))*100
functional_other = (len(df[(df['status_group'] == 'functional') & (df['scheme_management']=='other')]))/(len(df[df['scheme_management']=='other']))*100

print('functional vwc waterpoints = ', round(functional_vwc,3),'%')
print('functional water_auth waterpoints = ', round(functional_water_auth,3),'%')
print('functional water_bd waterpoints = ', round(functional_water_bd,3),'%')
print('functional wua waterpoints = ', round(functional_wua,3),'%')
print('functional wug waterpoints = ', round(functional_wug,3),'%')
print('functional other waterpoints = ', round(functional_other,3),'%')

functional vwc waterpoints =  51.532 %
functional water_auth waterpoints =  51.316 %
functional water_bd waterpoints =  74.709 %
functional wua waterpoints =  69.199 %
functional wug waterpoints =  57.741 %
functional other waterpoints =  53.696 %


Deal with columns containing null values.
- #### scheme_name

In [31]:
df.scheme_name.value_counts()

K                               682
None                            644
Borehole                        546
Chalinze wate                   405
M                               400
                               ... 
Mtumbei mpopera water supply      1
Itoo water supply                 1
BL Siha Sec                       1
Ihanda spring box                 1
Mjimwema                          1
Name: scheme_name, Length: 2696, dtype: int64

There are 2696 unique `scheme_name`, of which the largest group is only 682. Additionally, there are 28166 null values in this column. It is unlikely for this column to yield meaningful results and we will be dropping it as well.

In [32]:
drop.append('scheme_name')

df.drop('scheme_name',axis=1,inplace=True)
drop

['region_code', 'num_private', 'subvillage', 'scheme_name']

Deal with columns containing null values.
- #### permit

In [33]:
df.permit.value_counts()

True     38852
False    17492
Name: permit, dtype: int64

In [34]:
df.groupby(['permit','status_group']).size()

permit  status_group           
False   functional                  9045
        functional needs repair     1320
        non functional              7127
True    functional                 21541
        functional needs repair     2697
        non functional             14614
dtype: int64

Convert `permit` to binary predictor and impute with median.

In [35]:
def convert_permit(row):
    if row['permit']==True:
        return 1
    elif row['permit']==False:
        return 0
    else:
        return np.nan
    
df['permit'] = df.apply(lambda row: convert_permit(row), axis=1)
df['permit'].fillna(df['permit'].median(),inplace=True)
df.groupby(['permit','status_group']).size()

permit  status_group           
0.0     functional                  9045
        functional needs repair     1320
        non functional              7127
1.0     functional                 23214
        functional needs repair     2997
        non functional             15697
dtype: int64

Having removed all null, ensure that there is no other invalid data $-$ 0 $-$ that can be immediately obvious for relevant columns such as `population`, `gps_height`, `amount_tsh` and `construction_year`.

In [36]:
df['gps_height'].replace(0, np.nan, inplace=True)
df['population'].replace(0, np.nan, inplace=True)
df['amount_tsh'].replace(0, np.nan, inplace=True)
df['construction_year'].replace(0, np.nan, inplace=True)
df.isnull().sum()

id                           0
amount_tsh               41639
date_recorded                0
funder                       0
gps_height               20438
installer                    0
longitude                    0
latitude                     0
wpt_name                     0
basin                        0
region                       0
district_code                0
lga                          0
ward                         0
population               21381
public_meeting               0
recorded_by                  0
scheme_management            0
permit                       0
construction_year        20709
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity                     0
quantity_group               0
source                       0
source_t

Assuming `water_tsh`, `gps_height` and `population` are affected by `region_code` and `district_code`, impute missing values with mean values.

In [37]:
df['amount_tsh'].fillna(df.groupby(['region', 'district_code'])['amount_tsh'].transform('mean'), inplace=True)
df['amount_tsh'].fillna(df.groupby(['region'])['amount_tsh'].transform('mean'), inplace=True)
df['amount_tsh'].fillna(df['amount_tsh'].mean(), inplace=True)
df['gps_height'].fillna(df.groupby(['region', 'district_code'])['gps_height'].transform('mean'), inplace=True)
df['gps_height'].fillna(df.groupby(['region'])['gps_height'].transform('mean'), inplace=True)
df['gps_height'].fillna(df['gps_height'].mean(), inplace=True)
df['population'].fillna(df.groupby(['region', 'district_code'])['population'].transform('mean'), inplace=True)
df['population'].fillna(df.groupby(['region'])['population'].transform('mean'), inplace=True)
df['population'].fillna(df['population'].mean(), inplace=True)
# Explain why region/district, check correlation. Explain why not using region etc for the other imputations.

`construction_year` can also be imputed with mean value

In [38]:
df['construction_year'].fillna(df['construction_year'].mean(),inplace=True)

df.isnull().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
basin                    0
region                   0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
waterpoint_type_group    0
status_group             0
d

Convert `construction_year` and `date_recorded` into the number of years the waterpoint has been in operation for, and drop both features after 

In [39]:
df['date_recorded'] = pd.to_datetime(df['date_recorded'])
df['operational_years'] = df.date_recorded.dt.year - df.construction_year

df.drop('date_recorded', axis=1, inplace=True)
df.drop('construction_year', axis=1, inplace=True)
drop.append('date_recorded')
drop.append('construction_year')

#### Take the same steps for test data

In [40]:
test_df.funder.value_counts()

Government Of Tanzania         2215
Danida                          793
Hesawa                          580
World Bank                      352
Kkkt                            336
                               ... 
Aveda                             1
Ripat                             1
Water Project Mbawala Chini       1
Omari Abdallah                    1
Nyanokwi                          1
Name: funder, Length: 980, dtype: int64

In [41]:
def update_funder_test(row):
    if row['funder']=='Government Of Tanzania':
        return 'gov'
    elif row['funder']=='Danida':
        return 'danida'
    elif row['funder']=='Hesawa':
        return 'hesawa'
    elif row['funder']=='World Bank':
        return 'bank'
    elif row['funder']=='Kkkt':
        return 'kkkt'
    else:
        return 'other'

test_df['funder'] = test_df.apply(lambda row: update_funder(row), axis=1)

In [42]:
test_df.installer.value_counts()

DWE                  4349
Government            457
RWE                   292
Commu                 287
DANIDA                255
                     ... 
UNIVERSAL COMPANY       1
Matimo Sangi            1
Hesewa                  1
Hadija Makame           1
NWE                     1
Name: installer, Length: 1091, dtype: int64

In [43]:
def update_installer_test(row):
    if row['installer']=='DWE':
        return 'dwe'
    elif row['installer']=='Government':
        return 'gov'
    elif row['installer']=='RWE':
        return 'rwe'
    elif row['installer']=='Commu':
        return 'commu'
    elif row['installer']=='DANIDA':
        return 'danida'
    else:
        return 'other'  

test_df['installer'] = test_df.apply(lambda row: update_installer(row), axis=1)

In [44]:
test_df['public_meeting'] = test_df.apply(lambda row: convert_public_meeting(row), axis=1)
test_df['public_meeting'].fillna(test_df['public_meeting'].median(),inplace=True)

In [45]:
test_df.scheme_management.value_counts()

VWC                 9124
WUG                 1290
Water authority      822
Water Board          714
WUA                  668
Parastatal           444
Company              280
Private operator     263
Other                230
SWC                   26
Trust                 20
Name: scheme_management, dtype: int64

In [46]:
def update_scheme_management_test(row):
    if row['scheme_management']=='VWC':
        return 'vwc'
    elif row['scheme_management']=='WUG':
        return 'wug'
    elif row['scheme_management']=='Water authority':
        return 'water_auth'
    elif row['scheme_management']=='Water Board':
        return 'water_bd'
    elif row['scheme_management']=='WUA':
        return 'wua'
    else:
        return 'other'  

test_df['scheme_management'] = test_df.apply(lambda row: update_scheme_management(row), axis=1)

In [47]:
test_df['permit'] = test_df.apply(lambda row: convert_permit(row), axis=1)
test_df['permit'].fillna(test_df['permit'].median(),inplace=True)

In [48]:
test_df['gps_height'].replace(0, np.nan, inplace=True)
test_df['population'].replace(0, np.nan, inplace=True)
test_df['amount_tsh'].replace(0, np.nan, inplace=True)
test_df['construction_year'].replace(0, np.nan, inplace=True)
test_df.isnull().sum()

id                           0
amount_tsh               10410
date_recorded                0
funder                       0
gps_height                5211
installer                    0
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                  99
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                5453
public_meeting               0
recorded_by                  0
scheme_management            0
scheme_name               7092
permit                       0
construction_year         5260
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [49]:
test_df['amount_tsh'].fillna(test_df.groupby(['region', 'district_code'])['amount_tsh'].transform('mean'), inplace=True)
test_df['amount_tsh'].fillna(test_df.groupby(['region'])['amount_tsh'].transform('mean'), inplace=True)
test_df['amount_tsh'].fillna(test_df['amount_tsh'].mean(), inplace=True)
test_df['gps_height'].fillna(test_df.groupby(['region', 'district_code'])['gps_height'].transform('mean'), inplace=True)
test_df['gps_height'].fillna(test_df.groupby(['region'])['gps_height'].transform('mean'), inplace=True)
test_df['gps_height'].fillna(test_df['gps_height'].mean(), inplace=True)
test_df['population'].fillna(test_df.groupby(['region', 'district_code'])['population'].transform('mean'), inplace=True)
test_df['population'].fillna(test_df.groupby(['region'])['population'].transform('mean'), inplace=True)
test_df['population'].fillna(test_df['population'].mean(), inplace=True)

In [50]:
test_df['construction_year'].fillna(test_df['construction_year'].mean(),inplace=True)

In [51]:
test_df['date_recorded'] = pd.to_datetime(test_df['date_recorded'])
test_df['operational_years'] = test_df.date_recorded.dt.year - test_df.construction_year

Drop columns

In [52]:
for i in drop:
    test_df.drop(i, axis=1, inplace=True)

Export preprocessed data

In [53]:
pd.DataFrame(df).to_csv("clean.csv", index=False)
pd.DataFrame(test_df).to_csv("clean_test.csv", index=False)

## Model building (First Submission) - 10

In [199]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder


Read Data

In [156]:
X_train = pd.read_csv('clean.csv')
y_train = X_train.pop('status_group')
X_test = pd.read_csv('clean_test.csv')

X_train.drop('Unnamed: 0', axis=1, inplace=True)
X_test.drop('Unnamed: 0', axis=1, inplace=True)

In [157]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 35 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   funder                 59400 non-null  object 
 3   gps_height             59400 non-null  float64
 4   installer              59400 non-null  object 
 5   longitude              59400 non-null  float64
 6   latitude               59400 non-null  float64
 7   wpt_name               59400 non-null  object 
 8   basin                  59400 non-null  object 
 9   region                 59400 non-null  object 
 10  district_code          59400 non-null  int64  
 11  lga                    59400 non-null  object 
 12  ward                   59400 non-null  object 
 13  population             59400 non-null  float64
 14  public_meeting         59400 non-null  float64
 15  re

### Encode labels into categorical variables
https://stackoverflow.com/questions/40336502/want-to-know-the-diff-among-pd-factorize-pd-get-dummies-sklearn-preprocessing/40338956

In [158]:
le = LabelEncoder()

X_train['funder'] = pd.factorize(X_train['funder'])[0]
X_train['installer'] = pd.factorize(X_train['installer'])[0]
X_train['wpt_name'] = pd.factorize(X_train['wpt_name'])[0]
X_train['basin'] = pd.factorize(X_train['basin'])[0]
X_train['region'] = pd.factorize(X_train['region'])[0]
X_train['lga'] = pd.factorize(X_train['lga'])[0]
X_train['ward'] = pd.factorize(X_train['ward'])[0]
X_train['recorded_by'] = pd.factorize(X_train['recorded_by'])[0]
X_train['scheme_management'] = pd.factorize(X_train['scheme_management'])[0]
X_train['extraction_type'] = pd.factorize(X_train['extraction_type'])[0]
X_train['extraction_type_group'] = pd.factorize(X_train['extraction_type_group'])[0]
X_train['extraction_type_class'] = pd.factorize(X_train['extraction_type_class'])[0]
X_train['management'] = pd.factorize(X_train['management'])[0]
X_train['management_group'] = pd.factorize(X_train['management_group'])[0]
X_train['payment'] = pd.factorize(X_train['payment'])[0]
X_train['payment_type'] = pd.factorize(X_train['payment_type'])[0]
X_train['water_quality'] = pd.factorize(X_train['water_quality'])[0]
X_train['quality_group'] = pd.factorize(X_train['quality_group'])[0]
X_train['quantity'] = pd.factorize(X_train['quantity'])[0]
X_train['quantity_group'] = pd.factorize(X_train['quantity_group'])[0]
X_train['source'] = pd.factorize(X_train['source'])[0]
X_train['source_type'] = pd.factorize(X_train['source_type'])[0]
X_train['source_class'] = pd.factorize(X_train['source_class'])[0]
X_train['waterpoint_type'] = pd.factorize(X_train['waterpoint_type'])[0]
X_train['waterpoint_type_group'] = pd.factorize(X_train['waterpoint_type_group'])[0]

In [159]:
X_test['funder'] = pd.factorize(X_test['funder'])[0]
X_test['installer'] = pd.factorize(X_test['installer'])[0]
X_test['wpt_name'] = pd.factorize(X_test['wpt_name'])[0]
X_test['basin'] = pd.factorize(X_test['basin'])[0]
X_test['region'] = pd.factorize(X_test['region'])[0]
X_test['lga'] = pd.factorize(X_test['lga'])[0]
X_test['ward'] = pd.factorize(X_test['ward'])[0]
X_test['recorded_by'] = pd.factorize(X_test['recorded_by'])[0]
X_test['scheme_management'] = pd.factorize(X_test['scheme_management'])[0]
X_test['extraction_type'] = pd.factorize(X_test['extraction_type'])[0]
X_test['extraction_type_group'] = pd.factorize(X_test['extraction_type_group'])[0]
X_test['extraction_type_class'] = pd.factorize(X_test['extraction_type_class'])[0]
X_test['management'] = pd.factorize(X_test['management'])[0]
X_test['management_group'] = pd.factorize(X_test['management_group'])[0]
X_test['payment'] = pd.factorize(X_test['payment'])[0]
X_test['payment_type'] = pd.factorize(X_test['payment_type'])[0]
X_test['water_quality'] = pd.factorize(X_test['water_quality'])[0]
X_test['quality_group'] = pd.factorize(X_test['quality_group'])[0]
X_test['quantity'] = pd.factorize(X_test['quantity'])[0]
X_test['quantity_group'] = pd.factorize(X_test['quantity_group'])[0]
X_test['source'] = pd.factorize(X_test['source'])[0]
X_test['source_type'] = pd.factorize(X_test['source_type'])[0]
X_test['source_class'] = pd.factorize(X_test['source_class'])[0]
X_test['waterpoint_type'] = pd.factorize(X_test['waterpoint_type'])[0]
X_test['waterpoint_type_group'] = pd.factorize(X_test['waterpoint_type_group'])[0]

In [99]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 35 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   funder                 59400 non-null  int64  
 3   gps_height             59400 non-null  float64
 4   installer              59400 non-null  int64  
 5   longitude              59400 non-null  float64
 6   latitude               59400 non-null  float64
 7   wpt_name               59400 non-null  int64  
 8   basin                  59400 non-null  int64  
 9   region                 59400 non-null  int64  
 10  district_code          59400 non-null  int64  
 11  lga                    59400 non-null  int64  
 12  ward                   59400 non-null  int64  
 13  population             59400 non-null  float64
 14  public_meeting         59400 non-null  float64
 15  re

Tune Hyperparameters

Explain why hyperparameters are chosen. Mention what trials you took. Which values you tried. If you changed from default, explain why
https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html <br>
Random Forest generally works well on default settings. Hence most of the features will not be touched.  <br>
Increased `min_samples_split` to `6` to reduce over-fitting - in comparison to default 2. <br>
Set `oob_score` to `True` to estimate accuracy <br>
Set `random_state` to `1` to ensure performance is not affected by random initial state <br>
https://stackoverflow.com/questions/55070918/does-setting-a-random-state-in-sklearns-randomforestclassifier-bias-your-model <br>
Set `n_jobs` to `-1` to use all CPUs<br>
Staggered `n_estimators` to observe which relative amount of trees is best

In [100]:
rf = RandomForestClassifier(min_samples_split=6,
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)

param_grid = {"n_estimators" : [500, 750, 1000]}

gs = GridSearchCV(estimator=rf,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=5,
                  n_jobs=-1)

gs = gs.fit(X_train, y_train.values.ravel())

In [101]:
print(gs.best_score_)
print(gs.best_params_)
print(gs.cv_results_)

0.8005892255892256
{'n_estimators': 500}
{'mean_fit_time': array([40.66975927, 46.07213914, 50.15852165]), 'std_fit_time': array([0.18662643, 0.09424675, 0.20301247]), 'mean_score_time': array([2.29038763, 1.56403172, 1.82366753]), 'std_score_time': array([0.11621261, 0.07472312, 0.02482247]), 'param_n_estimators': masked_array(data=[500, 750, 1000],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_estimators': 500}, {'n_estimators': 750}, {'n_estimators': 1000}], 'split0_test_score': array([0.80178451, 0.80218855, 0.80205387]), 'split1_test_score': array([0.79939394, 0.79841751, 0.79838384]), 'mean_test_score': array([0.80058923, 0.80030303, 0.80021886]), 'std_test_score': array([0.00119529, 0.00188552, 0.00183502]), 'rank_test_score': array([1, 2, 3])}


Fit model

In [111]:
X_train.drop('id',axis=1,inplace=True)
rf = RandomForestClassifier(criterion='gini',
                                min_samples_split=6,
                                n_estimators=1000,
                                max_features='auto',
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)
                            
rf.fit(X_train, y_train.values.ravel())
print("%.4f" % rf.oob_score_)

0.8168


Inspect feature importance

In [112]:
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']), 
           pd.DataFrame(rf.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:10]
# Screenshot of submission on board at the end

Unnamed: 0,variable,importance
4,longitude,0.091455
5,latitude,0.089158
26,quantity,0.074807
27,quantity_group,0.068052
6,wpt_name,0.057641
33,operational_years,0.053453
2,gps_height,0.051422
11,ward,0.047653
12,population,0.037149
31,waterpoint_type,0.032905


Generate submission file

In [114]:
idx=X_test['id']
X_test.drop(['id'],axis=1, inplace=True)
y_pred = rf.predict(X_test)

In [115]:
y_pred=pd.DataFrame(y_pred)
y_pred['id']=idx
y_pred.columns=['status_group','id']
y_pred=y_pred[['id','status_group']]

In [167]:
pd.DataFrame(y_pred).to_csv("submission_rf.csv", index=False)

## Steps taken to improve accuracy (Second Submission) - 10

In [227]:
X_train = pd.read_csv('clean.csv')
y_train = X_train.pop('status_group')
X_test = pd.read_csv('clean_test.csv')

X_train.drop('Unnamed: 0', axis=1, inplace=True)
X_test.drop('Unnamed: 0', axis=1, inplace=True)

In [229]:
le = LabelEncoder()

X_train['funder'] = pd.factorize(X_train['funder'])[0]
X_train['installer'] = pd.factorize(X_train['installer'])[0]
X_train['wpt_name'] = pd.factorize(X_train['wpt_name'])[0]
X_train['basin'] = pd.factorize(X_train['basin'])[0]
X_train['region'] = pd.factorize(X_train['region'])[0]
X_train['lga'] = pd.factorize(X_train['lga'])[0]
X_train['ward'] = pd.factorize(X_train['ward'])[0]
X_train['recorded_by'] = pd.factorize(X_train['recorded_by'])[0]
X_train['scheme_management'] = pd.factorize(X_train['scheme_management'])[0]
X_train['extraction_type'] = pd.factorize(X_train['extraction_type'])[0]
X_train['extraction_type_group'] = pd.factorize(X_train['extraction_type_group'])[0]
X_train['extraction_type_class'] = pd.factorize(X_train['extraction_type_class'])[0]
X_train['management'] = pd.factorize(X_train['management'])[0]
X_train['management_group'] = pd.factorize(X_train['management_group'])[0]
X_train['payment'] = pd.factorize(X_train['payment'])[0]
X_train['payment_type'] = pd.factorize(X_train['payment_type'])[0]
X_train['water_quality'] = pd.factorize(X_train['water_quality'])[0]
X_train['quality_group'] = pd.factorize(X_train['quality_group'])[0]
X_train['quantity'] = pd.factorize(X_train['quantity'])[0]
X_train['quantity_group'] = pd.factorize(X_train['quantity_group'])[0]
X_train['source'] = pd.factorize(X_train['source'])[0]
X_train['source_type'] = pd.factorize(X_train['source_type'])[0]
X_train['source_class'] = pd.factorize(X_train['source_class'])[0]
X_train['waterpoint_type'] = pd.factorize(X_train['waterpoint_type'])[0]
X_train['waterpoint_type_group'] = pd.factorize(X_train['waterpoint_type_group'])[0]

In [230]:
X_test['funder'] = pd.factorize(X_test['funder'])[0]
X_test['installer'] = pd.factorize(X_test['installer'])[0]
X_test['wpt_name'] = pd.factorize(X_test['wpt_name'])[0]
X_test['basin'] = pd.factorize(X_test['basin'])[0]
X_test['region'] = pd.factorize(X_test['region'])[0]
X_test['lga'] = pd.factorize(X_test['lga'])[0]
X_test['ward'] = pd.factorize(X_test['ward'])[0]
X_test['recorded_by'] = pd.factorize(X_test['recorded_by'])[0]
X_test['scheme_management'] = pd.factorize(X_test['scheme_management'])[0]
X_test['extraction_type'] = pd.factorize(X_test['extraction_type'])[0]
X_test['extraction_type_group'] = pd.factorize(X_test['extraction_type_group'])[0]
X_test['extraction_type_class'] = pd.factorize(X_test['extraction_type_class'])[0]
X_test['management'] = pd.factorize(X_test['management'])[0]
X_test['management_group'] = pd.factorize(X_test['management_group'])[0]
X_test['payment'] = pd.factorize(X_test['payment'])[0]
X_test['payment_type'] = pd.factorize(X_test['payment_type'])[0]
X_test['water_quality'] = pd.factorize(X_test['water_quality'])[0]
X_test['quality_group'] = pd.factorize(X_test['quality_group'])[0]
X_test['quantity'] = pd.factorize(X_test['quantity'])[0]
X_test['quantity_group'] = pd.factorize(X_test['quantity_group'])[0]
X_test['source'] = pd.factorize(X_test['source'])[0]
X_test['source_type'] = pd.factorize(X_test['source_type'])[0]
X_test['source_class'] = pd.factorize(X_test['source_class'])[0]
X_test['waterpoint_type'] = pd.factorize(X_test['waterpoint_type'])[0]
X_test['waterpoint_type_group'] = pd.factorize(X_test['waterpoint_type_group'])[0]

Adding more parameters to be tuned automatically.
`n_estimators` - number of trees in the forest.

We are using `oob_score_` to optimize as it is generally unbiased, whereas $R^2$ might overfit on training data.
https://stats.stackexchange.com/questions/288699/r2-score-vs-oob-score-random-forest

In [222]:
def rf_gs_score_ax(parameterization, weight=None):

    p_names = ['max_features', 'min_samples_split',
              'min_samples_leaf', 'n_estimators']
    params = {}
    
    for p in p_names:
        params[p] = parameterization.get(p)
    
    print(params)

    rf = RandomForestClassifier(criterion='gini',
                                max_features=params['max_features'],
                                min_samples_split=params['min_samples_split'],
                                min_samples_leaf=params['min_samples_leaf'],
                                n_estimators=params['n_estimators'],
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)
    rf.fit(X_train, y_train)
    
    print(rf.oob_score_)
    return rf.oob_score_

def evaluate_rf(parameters):
    return {"rf_gs": rf_gs_score_ax(parameters)}

In [173]:
parameters=[
    {
        "name": "max_features",
        "type": "range",
        "bounds": [1, 34],
        "log_scale": False,
    },
    {
        "name": "min_samples_split",
        "type": "range",
        "bounds": [2, 10],
    },
    {
        "name": "min_samples_leaf",
        "type": "range",
        "bounds": [1, 10],
    },
    {
        "name": "n_estimators",
        "type": "range",
        "bounds": [100, 1500],
    },
]

In [224]:
from ax import optimize

best_parameters, values, experiment, model = optimize(
    parameters=parameters,
    evaluation_function=rf_gs_score_ax,
    objective_name='rf_gs',
    total_trials=15
)

[INFO 11-09 23:58:03] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter max_features. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 11-09 23:58:03] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter min_samples_split. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 11-09 23:58:03] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter min_samples_leaf. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 11-09 23:58:03] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter n_estimators. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' 

{'max_features': 31, 'min_samples_split': 6, 'min_samples_leaf': 10, 'n_estimators': 1285}


[INFO 11-10 00:00:15] ax.service.managed_loop: Running optimization trial 2...


0.804040404040404
{'max_features': 15, 'min_samples_split': 6, 'min_samples_leaf': 5, 'n_estimators': 680}


[INFO 11-10 00:01:00] ax.service.managed_loop: Running optimization trial 3...


0.8114478114478114
{'max_features': 22, 'min_samples_split': 2, 'min_samples_leaf': 1, 'n_estimators': 908}


[INFO 11-10 00:02:22] ax.service.managed_loop: Running optimization trial 4...


0.8109090909090909
{'max_features': 11, 'min_samples_split': 5, 'min_samples_leaf': 2, 'n_estimators': 1387}
0.81503367003367


[INFO 11-10 00:03:36] ax.service.managed_loop: Running optimization trial 5...


{'max_features': 30, 'min_samples_split': 9, 'min_samples_leaf': 1, 'n_estimators': 1262}


[INFO 11-10 00:05:59] ax.service.managed_loop: Running optimization trial 6...


0.8132323232323232
{'max_features': 15, 'min_samples_split': 7, 'min_samples_leaf': 1, 'n_estimators': 1363}


[INFO 11-10 00:07:29] ax.service.managed_loop: Running optimization trial 7...


0.8151851851851852
{'max_features': 6, 'min_samples_split': 7, 'min_samples_leaf': 2, 'n_estimators': 1256}


[INFO 11-10 00:08:16] ax.service.managed_loop: Running optimization trial 8...


0.8146801346801347
{'max_features': 14, 'min_samples_split': 7, 'min_samples_leaf': 2, 'n_estimators': 1500}


[INFO 11-10 00:09:45] ax.service.managed_loop: Running optimization trial 9...


0.8146464646464646
{'max_features': 7, 'min_samples_split': 6, 'min_samples_leaf': 1, 'n_estimators': 1485}


[INFO 11-10 00:10:43] ax.service.managed_loop: Running optimization trial 10...


0.816010101010101
{'max_features': 1, 'min_samples_split': 4, 'min_samples_leaf': 1, 'n_estimators': 1500}


[INFO 11-10 00:11:19] ax.service.managed_loop: Running optimization trial 11...


0.8132323232323232
{'max_features': 11, 'min_samples_split': 6, 'min_samples_leaf': 1, 'n_estimators': 1500}


[INFO 11-10 00:12:33] ax.service.managed_loop: Running optimization trial 12...


0.8157239057239057
{'max_features': 5, 'min_samples_split': 7, 'min_samples_leaf': 1, 'n_estimators': 1500}


[INFO 11-10 00:13:23] ax.service.managed_loop: Running optimization trial 13...


0.8156734006734007
{'max_features': 7, 'min_samples_split': 6, 'min_samples_leaf': 1, 'n_estimators': 1359}


[INFO 11-10 00:14:15] ax.service.managed_loop: Running optimization trial 14...


0.8155892255892256
{'max_features': 15, 'min_samples_split': 10, 'min_samples_leaf': 1, 'n_estimators': 1045}


[INFO 11-10 00:15:24] ax.service.managed_loop: Running optimization trial 15...


0.814006734006734
{'max_features': 8, 'min_samples_split': 6, 'min_samples_leaf': 1, 'n_estimators': 1500}
0.8155555555555556


In [225]:
best_parameters

{'max_features': 7,
 'min_samples_split': 6,
 'min_samples_leaf': 1,
 'n_estimators': 1485}

In [226]:
values

({'rf_gs': 0.8160100950904645}, {'rf_gs': {'rf_gs': 9.21624643215365e-13}})

In [232]:
X_train.drop('id',axis=1,inplace=True)
rf = RandomForestClassifier(criterion='gini',
                                max_features=best_parameters['max_features'],
                                min_samples_split=best_parameters['min_samples_split'],
                                min_samples_leaf=best_parameters['min_samples_leaf'],
                                n_estimators=best_parameters['n_estimators'],
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)
                            
rf.fit(X_train, y_train.values.ravel())
print("%.4f" % rf.oob_score_)

0.8166


In [233]:
idx=X_test['id']
X_test.drop(['id'],axis=1, inplace=True)
y_pred = rf.predict(X_test)

In [234]:
y_pred=pd.DataFrame(y_pred)
y_pred['id']=idx
y_pred.columns=['status_group','id']
y_pred=y_pred[['id','status_group']]

In [235]:
pd.DataFrame(y_pred).to_csv("submission_rf_ax.csv", index=False)

## Model 3 - Using AX

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical =True)

In [216]:
y_train

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

#### Define objective function for hyperparameter optimization
https://www.justintodata.com/hyperparameter-tuning-with-python-complete-step-by-step-guide/

In [218]:
def xgboost_cv_score_ax(parameterization, weight=None):
    NFOLD = 7
    NUM_BOOST_ROUND = 500

    p_names = ['learning_rate', 'max_depth' 'subsample', 'min_split_loss', 'min_child_weight', 'colsample_bytree', 
              'colsample_bylevel', 'colsample_bynode', 'lambda', 'alpha']
    params = {}
    params['objective'] = 'reg:squarederror'
    
    for p in p_names:
        params[p] = parameterization.get(p)
    print(params)
    # K-Fold cross validation score.
    cv_results = xgb.cv(dtrain=dtrain,
                        params=params,
                        nfold=NFOLD,
                        num_boost_round=NUM_BOOST_ROUND,
                        metrics="roc_auc", 
                        as_pandas=True,
                        seed=987)
    print(cv_results)
    return roc_auc_score(cv_results)

def evaluate_xgboost(parameters):
    return {"xgboost_cv": xgboost_cv_score_ax(parameters)}

Define the parameters for each hyperparameter

In [162]:
parameters=[
    {
        "name": "learning_rate",
        "type": "range",
        "bounds": [0.075, 0.7],
        "log_scale": False,
    },
    {
        "name": "max_depth",
        "type": "range",
        "bounds": [6, 14],
    },
    {
        "name": "subsample",
        "type": "range",
        "bounds": [0.0, 1.0],
    },
    {
        "name": "min_split_loss",
        "type": "range",
        "bounds": [0.0, 50.0],
    },
    {
        "name": "min_child_weight",
        "type": "range",
        "bounds": [0.0, 50.0],
    },
    {
        "name": "colsample_bytree",
        "type": "range",
        "bounds": [0.0, 1.0],
    },
    {
        "name": "colsample_bylevel",
        "type": "range",
        "bounds": [0.0, 1.0],
    },
    {
        "name": "colsample_bynode",
        "type": "range",
        "bounds": [0.0, 1.0],
    },
    {
        "name": "lambda",
        "type": "range",
        "bounds": [0.0, 10.0],
    },
    {
        "name": "alpha",
        "type": "range",
        "bounds": [0.0, 10.0],
    },
]

In [219]:
from ax import optimize

best_parameters, values, experiment, model = optimize(
    parameters=parameters,
    evaluation_function=evaluate,
    objective_name='xgboost_cv',
    total_trials=15
)

[INFO 11-09 23:41:45] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter max_features. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 11-09 23:41:45] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter min_samples_split. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 11-09 23:41:45] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter min_samples_leaf. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 11-09 23:41:45] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter n_estimators. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' 

{'objective': 'reg:squarederror', 'learning_rate': None, 'max_depthsubsample': None, 'min_split_loss': None, 'min_child_weight': None, 'colsample_bytree': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'lambda': None, 'alpha': None}


[INFO 11-09 23:43:20] ax.service.managed_loop: Running optimization trial 2...


{'objective': 'reg:squarederror', 'learning_rate': None, 'max_depthsubsample': None, 'min_split_loss': None, 'min_child_weight': None, 'colsample_bytree': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'lambda': None, 'alpha': None}


KeyboardInterrupt: 

In [164]:
best_parameters

{'learning_rate': 0.270311491212343,
 'max_depth': 7,
 'subsample': 0.5823466491431419,
 'min_split_loss': 26.322423993172862,
 'min_child_weight': 33.617789631982745,
 'colsample_bytree': 0.0,
 'colsample_bylevel': 0.9018853233320182,
 'colsample_bynode': 0.07086578553329038,
 'lambda': 0.5263585357863017,
 'alpha': 7.875683627497082}

In [165]:
values

({'xgboost_cv': 0.5627194686242211},
 {'xgboost_cv': {'xgboost_cv': 1.1637216544583767e-05}})

In [166]:
from sklearn.metrics import mean_squared_error

xgb_model = xgb.train(params=best_parameters, dtrain=dtrain, num_boost_round=500)

dtest = xgb.DMatrix(X_test)
xgb_test_pred = xgb_model.predict(dtest)
xgb_test_pred

array([0.06752759, 0.34840482, 0.19542778, ..., 0.33143944, 0.50460416,
       0.6522131 ], dtype=float32)

In [None]:
# def model_for_submission(features, target, test):
#     if __name__ == '__main__':

#          best_params = {'learning_rate': [0.075],
#                         'max_depth': [14],
#                         'min_samples_leaf': [16],
#                         'max_features': [1.0],
#                         'n_estimators': [100]}                      

#          estimator = GridSearchCV(estimator=GradientBoostingClassifier(),
#                                  param_grid=best_params,
#                                  n_jobs=-1)

#          estimator.fit(features, target)     

#          predictions = estimator.predict(test)

#          data = {'id': a, 'status_group': predictions}

#          submit = pd.DataFrame(data=data)

#          vals_to_replace = {2:'functional', 1:'functional needs repair',
#                            0:'non functional'}

#          submit.status_group = submit.status_group.replace(vals_to_replace)        

#          submit=submit[['id','status_group']]
         
#          pd.DataFrame(submit).to_csv("submission_xg.csv")


In [None]:
# Run model for submission.

# model_for_submission(features, target, test)

## NN with AX