# IS4242 Group Assignment Part 1
**November 11, 2020**

## Instructions

+ 2 Parts: Predictive Analytics (30 marks) & Prescriptive Analytics (10 marks)
+ Do all required data exploration - preprocessing - feature engineering - model building and evaluation steps
+ Submit first entry - predict on the given test data - see leaderboard position, then improve upon previous entry to improve test accuracy. Each team must have at least 2 submissions

+ At least one of the submitted models:
  + Must be a neural network implemented using PyTorch
  + Must use automated hyperparameter tuning
+ It is important to __explain each step__ you perform in preprocessing, feature engineering, model training. Ask yourself why you are performing the step and write the reason. While you may use any online resource, you have to cite them AND explanation should be in our own words.
+ __50% marks - explanation, 50% marks - code__
+ __Bonus points if your team has rank < 500__
+ **Submission deadline: November 11, 2020; 11:59 am**

#### Name: LECK WEI SHENG IAN
#### NUS ID: A0168177R
#### Name: WOO KENG THONG
#### NUS ID: A0167991L

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided information about the waterpoints in order label them.

The labels in this dataset are simple. There are three possible values:
1. functional - the waterpoint is operational and there are no repairs needed
2. functional needs repair - the waterpoint is operational, but needs repairs
3. non functional - the waterpoint is not operational

The format for the submission file is simply the row id and the predicted label. 
- id	status_group
- 50785	functional
- 51630	functional

CSV would thus look like
- id,status_group
- 50785,functional
- 51630,functional

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Preprocessing & Feature Engineering

Load data, and merge data and labels together into one dataframe

In [2]:
labels = pd.read_csv('training-set-labels.csv')
df = pd.read_csv('training-set-values.csv')
test_df = pd.read_csv('test-set-values.csv')

df = pd.merge(df, labels, on='id')

Explore data set

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Check similar predictors and eliminate highly correlated predictors to reduce redundancy and reduce multicollinearity

In [4]:
df.groupby(['region','region_code']).size()

region         region_code
Arusha         2              3024
               24              326
Dar es Salaam  7               805
Dodoma         1              2201
Iringa         11             5294
Kagera         18             3316
Kigoma         16             2816
Kilimanjaro    3              4379
Lindi          8               300
               18                8
               80             1238
Manyara        21             1583
Mara           20             1969
Mbeya          12             4639
Morogoro       5              4006
Mtwara         9               390
               90              917
               99              423
Mwanza         17               55
               19             3047
Pwani          6              1609
               40                1
               60             1025
Rukwa          15             1808
Ruvuma         10             2640
Shinyanga      11                6
               14               20
               17           

Drop `region_code` as it seems to be identify regions, yet is not able to stand on its own as there are identical region codes in different regions. `drop` list is compiled for each column dropped for subsequent use with test values.

In [5]:
drop = []
drop.append('region_code')

df.drop('region_code',axis=1,inplace=True)
drop

['region_code']

In [6]:
df.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,80.0,30500.0,2013.0


Remove `num_private` as it does not seem to be meaningful - mostly zeros at 25%, 50% and 75%.

In [7]:
drop.append('num_private')

df.drop('num_private',axis=1,inplace=True)
drop

['region_code', 'num_private']

Check for null values in data

In [8]:
df.isnull().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
basin                        0
subvillage                 371
region                       0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity                     0
quantity

Deal with columns containing null values.
- #### funder

In [9]:
df['funder'].value_counts()

Government Of Tanzania     9084
Danida                     3114
Hesawa                     2202
Rwssp                      1374
World Bank                 1349
                           ... 
Mzee Salum Bakari Darus       1
Mwanamisi Ally                1
Morad                         1
Ndolezi                       1
Samwel                        1
Name: funder, Length: 1897, dtype: int64

Keep top 5 `funder` and set the rest to `other`, including missing values.<br>
Using only the main 5 `funder` and reverting the rest to `other` is to reduce computational cost and mitigate overfitting.

In [10]:
def update_funder(row):
    if row['funder']=='Government Of Tanzania':
        return 'gov'
    elif row['funder']=='Danida':
        return 'danida'
    elif row['funder']=='Hesawa':
        return 'hesawa'
    elif row['funder']=='Rwssp':
        return 'rwssp'
    elif row['funder']=='World Bank':
        return 'bank'
    else:
        return 'other'

df['funder'] = df.apply(lambda row: update_funder(row), axis=1)

df.groupby(['funder','status_group']).size()

funder  status_group           
bank    functional                   545
        functional needs repair       97
        non functional               707
danida  functional                  1713
        functional needs repair      159
        non functional              1242
gov     functional                  3720
        functional needs repair      701
        non functional              4663
hesawa  functional                   936
        functional needs repair      232
        non functional              1034
other   functional                 24540
        functional needs repair     3019
        non functional             14718
rwssp   functional                   805
        functional needs repair      109
        non functional               460
dtype: int64

Deal with columns containing null values.
- #### installer

In [11]:
df.installer.value_counts()

DWE                 17402
Government           1825
RWE                  1206
Commu                1060
DANIDA               1050
                    ...  
Atlas                   1
TGT                     1
RDDC                    1
Lindi contractor        1
Siza Mayengo            1
Name: installer, Length: 2145, dtype: int64

Keep top 5 `installer` and set the rest to `other`, including missing values.<br>
Using only the main 5 `installer` and reverting the rest to `other` is to reduce computational cost and mitigate overfitting.

In [12]:
def update_installer(row):
    if row['installer']=='DWE':
        return 'dwe'
    elif row['installer']=='Government':
        return 'gov'
    elif row['installer']=='RWE':
        return 'rwe'
    elif row['installer']=='Commu':
        return 'commu'
    elif row['installer']=='DANIDA':
        return 'danida'
    else:
        return 'other'  

df['installer'] = df.apply(lambda row: update_installer(row), axis=1)

df.groupby(['installer','status_group']).size()

installer  status_group           
commu      functional                   724
           functional needs repair       32
           non functional               304
danida     functional                   542
           functional needs repair       83
           non functional               425
dwe        functional                  9433
           functional needs repair     1622
           non functional              6347
gov        functional                   535
           functional needs repair      256
           non functional              1034
other      functional                 20721
           functional needs repair     2187
           non functional             13949
rwe        functional                   304
           functional needs repair      137
           non functional               765
dtype: int64

Deal with columns containing null values.
- #### subvillage

In [13]:
df.subvillage.value_counts()

Madukani     508
Shuleni      506
Majengo      502
Kati         373
Mtakuja      262
            ... 
Kasunga        1
Seketule       1
Itegetela      1
Mulungu A      1
Nyamusta       1
Name: subvillage, Length: 19287, dtype: int64

There are 19287 unique `subvillage`, of which the largest group is only 508. As the total dataset only has around 59400 values, about a third of the data is unique. It is thus unlikely to be a meaningful feature, and will be dropped.

In [14]:
drop.append('subvillage')
df.drop('subvillage',axis=1,inplace=True)
drop

['region_code', 'num_private', 'subvillage']

Deal with columns containing null values.
- #### public_meeting

In [15]:
df.public_meeting.value_counts()

True     51011
False     5055
Name: public_meeting, dtype: int64

In [16]:
df.groupby(['public_meeting','status_group']).size()

public_meeting  status_group           
False           functional                  2173
                functional needs repair      442
                non functional              2440
True            functional                 28408
                functional needs repair     3719
                non functional             18884
dtype: int64

Convert `public_meeting` to binary predictor and impute with mode.

In [17]:
def convert_public_meeting(row):
    if row['public_meeting']==True:
        return 1
    elif row['public_meeting']==False:
        return 0
    else:
        return np.nan
    
df['public_meeting'] = df.apply(lambda row: convert_public_meeting(row), axis=1)
df['public_meeting'].fillna(df['public_meeting'].mode().item(),inplace=True)
df.groupby(['public_meeting','status_group']).size()

public_meeting  status_group           
0.0             functional                  2173
                functional needs repair      442
                non functional              2440
1.0             functional                 28408
                functional needs repair     3719
                non functional             18884
dtype: int64

Deal with columns containing null values.
- #### scheme_management

In [18]:
df.scheme_management.value_counts()

VWC                 36793
WUG                  5206
Water authority      3153
WUA                  2883
Water Board          2748
Parastatal           1680
Private operator     1063
Company              1061
Other                 766
SWC                    97
Trust                  72
None                    1
Name: scheme_management, dtype: int64

Keep top 5 `scheme_management` and set the rest to `other`, including missing values.<br>
Using only the main 5 `scheme_management` and reverting the rest to `other` is to reduce computational cost and mitigate overfitting.

In [19]:
def update_scheme_management(row):
    if row['scheme_management']=='VWC':
        return 'vwc'
    elif row['scheme_management']=='WUG':
        return 'wug'
    elif row['scheme_management']=='Water authority':
        return 'water_auth'
    elif row['scheme_management']=='WUA':
        return 'wua'
    elif row['scheme_management']=='Water Board':
        return 'water_bd'
    else:
        return 'other'  
    
df['scheme_management'] = df.apply(lambda row: update_scheme_management(row), axis=1)
df.groupby(['scheme_management','status_group']).size()

scheme_management  status_group           
other              functional                  4627
                   functional needs repair      513
                   non functional              3477
vwc                functional                 18960
                   functional needs repair     2334
                   non functional             15499
water_auth         functional                  1618
                   functional needs repair      448
                   non functional              1087
water_bd           functional                  2053
                   functional needs repair      111
                   non functional               584
wua                functional                  1995
                   functional needs repair      239
                   non functional               649
wug                functional                  3006
                   functional needs repair      672
                   non functional              1528
dtype: int64

Deal with columns containing null values.
- #### scheme_name

In [20]:
df.scheme_name.value_counts()

K                           682
None                        644
Borehole                    546
Chalinze wate               405
M                           400
                           ... 
Kwam                          1
Nabaiye pipe broken           1
Kifumangao Water supply       1
Fufulamsuri water supply      1
Kapu chini water supply       1
Name: scheme_name, Length: 2696, dtype: int64

There are 2696 unique `scheme_name`, of which the largest group is only 682. Additionally, there are 28166 null values in this column. As the total dataset only has around 59400 values, nearly half of the data is either unique or missing. It is thus unlikely to be a meaningful feature, and will be dropped.

In [21]:
drop.append('scheme_name')

df.drop('scheme_name',axis=1,inplace=True)
drop

['region_code', 'num_private', 'subvillage', 'scheme_name']

Deal with columns containing null values.
- #### permit

In [22]:
df.permit.value_counts()

True     38852
False    17492
Name: permit, dtype: int64

In [23]:
df.groupby(['permit','status_group']).size()

permit  status_group           
False   functional                  9045
        functional needs repair     1320
        non functional              7127
True    functional                 21541
        functional needs repair     2697
        non functional             14614
dtype: int64

Convert `permit` to binary predictor and impute with mode.

In [24]:
def convert_permit(row):
    if row['permit']==True:
        return 1
    elif row['permit']==False:
        return 0
    else:
        return np.nan
    
df['permit'] = df.apply(lambda row: convert_permit(row), axis=1)
df['permit'].fillna(df['permit'].mode().item(),inplace=True)
df.groupby(['permit','status_group']).size()

permit  status_group           
0.0     functional                  9045
        functional needs repair     1320
        non functional              7127
1.0     functional                 21541
        functional needs repair     2697
        non functional             14614
dtype: int64

Having removed all null, ensure that there is no other invalid data $-$ 0 $-$ that can be immediately obvious for relevant columns such as `population`, `gps_height`, `amount_tsh` and `construction_year`.

In [25]:
df['gps_height'].replace(0, np.nan, inplace=True)
df['population'].replace(0, np.nan, inplace=True)
df['amount_tsh'].replace(0, np.nan, inplace=True)
df['construction_year'].replace(0, np.nan, inplace=True)
df.isnull().sum()

id                           0
amount_tsh               41639
date_recorded                0
funder                       0
gps_height               20438
installer                    0
longitude                    0
latitude                     0
wpt_name                     0
basin                        0
region                       0
district_code                0
lga                          0
ward                         0
population               21381
public_meeting            3334
recorded_by                  0
scheme_management            0
permit                    3056
construction_year        20709
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity                     0
quantity_group               0
source                       0
source_t

`gps_height` is likely to be location-dependent as it is the height of an area is likely to be similar.<br>
`amount_tsh` is also likely to be location-dependent as the amount of water that can be drawn in the same area is likely to be similar.<br>
`population` is also likely to be location-dependent as communities in the same area will be subject to similar living conditions.<br>
Holding the above assumptions, we will assume that the three predictors are affected by `region` and `district_code`, which based on the name, indicate a general area.<br>
Hence, `region` and `district_code` will also be used when imputing missing data with mean values for `gps_height`, `amount_tsh` and `population`.<br>

#### There is no clear indication whether the other predictors imputed earlier are relevant to geographic area, and hence were not imputed with `region`/`district_code`.

In [26]:
df['amount_tsh'].fillna(df.groupby(['region', 'district_code'])['amount_tsh'].transform('mean'), inplace=True)
df['amount_tsh'].fillna(df.groupby(['region'])['amount_tsh'].transform('mean'), inplace=True)
df['amount_tsh'].fillna(df['amount_tsh'].mean(), inplace=True)
df['gps_height'].fillna(df.groupby(['region', 'district_code'])['gps_height'].transform('mean'), inplace=True)
df['gps_height'].fillna(df.groupby(['region'])['gps_height'].transform('mean'), inplace=True)
df['gps_height'].fillna(df['gps_height'].mean(), inplace=True)
df['population'].fillna(df.groupby(['region', 'district_code'])['population'].transform('mean'), inplace=True)
df['population'].fillna(df.groupby(['region'])['population'].transform('mean'), inplace=True)
df['population'].fillna(df['population'].mean(), inplace=True)

`construction_year` (as a numeric predictor) is also imputed with mean value. We assume that there is no relation between region and construction year as it is unlikely construction within each geographic area occurs in the same period due to resource constraint.

In [27]:
df['construction_year'].fillna(df['construction_year'].mean(),inplace=True)

df.isnull().sum()

id                          0
amount_tsh                  0
date_recorded               0
funder                      0
gps_height                  0
installer                   0
longitude                   0
latitude                    0
wpt_name                    0
basin                       0
region                      0
district_code               0
lga                         0
ward                        0
population                  0
public_meeting           3334
recorded_by                 0
scheme_management           0
permit                   3056
construction_year           0
extraction_type             0
extraction_type_group       0
extraction_type_class       0
management                  0
management_group            0
payment                     0
payment_type                0
water_quality               0
quality_group               0
quantity                    0
quantity_group              0
source                      0
source_type                 0
source_cla

`construction_year` provides the year that it is constructed and it is a good source for feature engineering. The longer a water point is operational, the more likely it is for the water point to be non functional or needs repair. <br> Convert `construction_year` and `date_recorded` into the number of years the waterpoint has been in operation for, and drop both features after as the `operational years` is likely to be a more useful predictor

In [28]:
df['date_recorded'] = pd.to_datetime(df['date_recorded'])
df['operational_years'] = df.date_recorded.dt.year - df.construction_year

df.drop('date_recorded', axis=1, inplace=True)
df.drop('construction_year', axis=1, inplace=True)
drop.append('date_recorded')
drop.append('construction_year')

## Take the same pre-processing steps for test data

In [29]:
test_df.funder.value_counts()

Government Of Tanzania    2215
Danida                     793
Hesawa                     580
World Bank                 352
Kkkt                       336
                          ... 
Mount Meru Flowers           1
H/w                          1
Private Co                   1
Agness                       1
Jaic                         1
Name: funder, Length: 980, dtype: int64

In [30]:
def update_funder_test(row):
    if row['funder']=='Government Of Tanzania':
        return 'gov'
    elif row['funder']=='Danida':
        return 'danida'
    elif row['funder']=='Hesawa':
        return 'hesawa'
    elif row['funder']=='World Bank':
        return 'bank'
    elif row['funder']=='Kkkt':
        return 'kkkt'
    else:
        return 'other'

test_df['funder'] = test_df.apply(lambda row: update_funder(row), axis=1)

In [31]:
test_df.installer.value_counts()

DWE                               4349
Government                         457
RWE                                292
Commu                              287
DANIDA                             255
                                  ... 
Tober and friends from Austral       1
Regional water                       1
DE &                                 1
AIC                                  1
Atlas Company                        1
Name: installer, Length: 1091, dtype: int64

In [32]:
def update_installer_test(row):
    if row['installer']=='DWE':
        return 'dwe'
    elif row['installer']=='Government':
        return 'gov'
    elif row['installer']=='RWE':
        return 'rwe'
    elif row['installer']=='Commu':
        return 'commu'
    elif row['installer']=='DANIDA':
        return 'danida'
    else:
        return 'other'  

test_df['installer'] = test_df.apply(lambda row: update_installer(row), axis=1)

In [33]:
test_df['public_meeting'] = test_df.apply(lambda row: convert_public_meeting(row), axis=1)
test_df['public_meeting'].fillna(test_df['public_meeting'].mode().item(),inplace=True)

In [34]:
test_df.scheme_management.value_counts()

VWC                 9124
WUG                 1290
Water authority      822
Water Board          714
WUA                  668
Parastatal           444
Company              280
Private operator     263
Other                230
SWC                   26
Trust                 20
Name: scheme_management, dtype: int64

In [35]:
def update_scheme_management_test(row):
    if row['scheme_management']=='VWC':
        return 'vwc'
    elif row['scheme_management']=='WUG':
        return 'wug'
    elif row['scheme_management']=='Water authority':
        return 'water_auth'
    elif row['scheme_management']=='Water Board':
        return 'water_bd'
    elif row['scheme_management']=='WUA':
        return 'wua'
    else:
        return 'other'  

test_df['scheme_management'] = test_df.apply(lambda row: update_scheme_management(row), axis=1)

Convert `permit` to binary predictor and impute with mode.

In [36]:
test_df['permit'] = test_df.apply(lambda row: convert_permit(row), axis=1)
test_df['permit'].fillna(test_df['permit'].mode().item(),inplace=True)

Having removed all null, ensure that there is no other invalid data $-$ 0 $-$ that can be immediately obvious for relevant columns such as `population`, `gps_height`, `amount_tsh` and `construction_year`.

In [37]:
test_df['gps_height'].replace(0, np.nan, inplace=True)
test_df['population'].replace(0, np.nan, inplace=True)
test_df['amount_tsh'].replace(0, np.nan, inplace=True)
test_df['construction_year'].replace(0, np.nan, inplace=True)
test_df.isnull().sum()

id                           0
amount_tsh               10410
date_recorded                0
funder                       0
gps_height                5211
installer                    0
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                  99
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                5453
public_meeting             821
recorded_by                  0
scheme_management            0
scheme_name               7092
permit                     737
construction_year         5260
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

`gps_height` is likely to be location-dependent as it is the height of an area is likely to be similar.<br>
`amount_tsh` is also likely to be location-dependent as the amount of water that can be drawn in the same area is likely to be similar.<br>
`population` is also likely to be location-dependent as communities in the same area will be subject to similar living conditions.<br>
Holding the above assumptions, we will assume that the three predictors are affected by `region` and `district_code`, which based on the name, indicate a general area.<br>
Hence, `region` and `district_code` will also be used when imputing missing data with mean values for `gps_height`, `amount_tsh` and `population`.<br>

#### There is no clear indication whether the other predictors imputed earlier are relevant to geographic area, and hence were not imputed with `region`/`district_code`.

In [38]:
test_df['amount_tsh'].fillna(test_df.groupby(['region', 'district_code'])['amount_tsh'].transform('mean'), inplace=True)
test_df['amount_tsh'].fillna(test_df.groupby(['region'])['amount_tsh'].transform('mean'), inplace=True)
test_df['amount_tsh'].fillna(test_df['amount_tsh'].mean(), inplace=True)
test_df['gps_height'].fillna(test_df.groupby(['region', 'district_code'])['gps_height'].transform('mean'), inplace=True)
test_df['gps_height'].fillna(test_df.groupby(['region'])['gps_height'].transform('mean'), inplace=True)
test_df['gps_height'].fillna(test_df['gps_height'].mean(), inplace=True)
test_df['population'].fillna(test_df.groupby(['region', 'district_code'])['population'].transform('mean'), inplace=True)
test_df['population'].fillna(test_df.groupby(['region'])['population'].transform('mean'), inplace=True)
test_df['population'].fillna(test_df['population'].mean(), inplace=True)

`construction_year` (as a numeric predictor) is also imputed with mean value. We assume that there is no relation between region and construction year as it is unlikely construction within each geographic area occurs in the same period due to resource constraint.

In [39]:
test_df['construction_year'].fillna(test_df['construction_year'].mean(),inplace=True)

`construction_year` provides the year that it is constructed and it is a good source for feature engineering. The longer a water point is operational, the more likely it is for the water point to be non functional or needs repair. <br> Convert `construction_year` and `date_recorded` into the number of years the waterpoint has been in operation for, and drop both features after as the `operational years` is likely to be a more useful predictor

In [40]:
test_df['date_recorded'] = pd.to_datetime(test_df['date_recorded'])
test_df['operational_years'] = test_df.date_recorded.dt.year - test_df.construction_year

Drop columns

In [41]:
for i in drop:
    test_df.drop(i, axis=1, inplace=True)

Export preprocessed data

In [42]:
pd.DataFrame(df).to_csv("clean.csv", index=False)
pd.DataFrame(test_df).to_csv("clean_test.csv", index=False)

## Model building (First Submission) - NN

### Acknowledgements
Changing normal datatypes to tensors: 
https://towardsdatascience.com/deep-learning-on-dataframes-with-pytorch-66b21be54ef6
https://stackoverflow.com/questions/44617871/how-to-convert-a-list-of-strings-into-a-tensor-in-pytorch

pytorch nn model
https://machinelearningmastery.com/pytorch-tutorial-develop-deep-learning-models/

In [43]:
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

import torch.optim as optim
from numpy import vstack
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder

from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Tensor
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import CrossEntropyLoss

Label encode predictors

In [44]:
#SG encoder is declared so that it can be used to inverse transform the predicted result
sg_encoder = LabelEncoder()
df = pd.read_csv('clean.csv')

# Label encode all predictors for the training data
for col in df.columns:
    if df.dtypes[col] == "object" and col != 'status_group':
        df[col] = LabelEncoder().fit_transform(df[col])
    if col == 'status_group':
        df['status_group'] = sg_encoder.fit_transform(df['status_group'])

cols_at_end = ['status_group']
df = df[[c for c in df if c not in cols_at_end] 
        + [c for c in cols_at_end if c in df]]

#Store it into a temporary csv
pd.DataFrame(df).to_csv("clean_nn1.csv")
print(df.shape)

print(df.status_group.unique())

(59400, 36)
[0 2 1]


In [45]:
df

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,basin,region,...,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,operational_years,status_group
0,69572,6000.000000,4,1390.000000,4,34.938093,-9.856322,37399,1,3,...,2,1,1,8,6,0,1,1,12.000000,0
1,8776,374.652174,4,1399.000000,4,34.698766,-2.147466,37195,4,9,...,2,2,2,5,3,1,1,1,3.000000,0
2,34310,25.000000,4,686.000000,4,37.460664,-3.821329,14572,5,8,...,2,1,1,0,1,1,2,1,4.000000,0
3,67743,100.454545,4,263.000000,4,38.486161,-11.155298,37285,7,12,...,2,0,0,3,0,0,2,1,27.000000,2
4,19728,1392.735151,4,1057.545585,4,31.130847,-1.825359,35529,4,4,...,2,3,3,5,3,1,1,1,14.185314,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.000000,4,1210.000000,4,37.169807,-3.253847,513,5,6,...,2,1,1,8,6,0,1,1,14.000000,0
59396,27263,4700.000000,4,1212.000000,4,35.249991,-9.070629,24074,6,3,...,2,1,1,6,4,1,1,1,15.000000,0
59397,37057,1392.735151,4,1057.545585,4,34.017087,-8.750434,27926,6,10,...,1,1,1,3,0,0,4,3,14.185314,0
59398,31282,1392.735151,4,1057.545585,4,35.861315,-6.378573,29693,6,2,...,2,2,2,7,5,0,4,3,14.185314,0


Repeat for test data

In [46]:
df = pd.read_csv('clean_test.csv')

# Label encode all predictors
for col in df.columns:
    if df.dtypes[col] == "object" and col != 'status_group':
        df[col] = LabelEncoder().fit_transform(df[col])

#Store it into a temporary csv
print(df.shape)
pd.DataFrame(df).to_csv("clean_test_nn1.csv")

(14850, 35)


In [47]:
class CSVDataset(Dataset):
    # load the dataset
    def __init__(self, path):
        # load the csv file as a dataframe
        df = pd.read_csv(path)
        
        #Drop the unused columns. Unnamed: 0 is generated after saving the dataset.
        df = df.drop('Unnamed: 0', axis=1)
        df = df.drop('id', axis=1)
        
        #Assign x to all input values
        self.X = df.values[:, :-1]
        
        #Assign y to all target values
        self.y = df.values[:, -1]
        
        # ensure input data is floats
        self.X = self.X.astype('float32')        
 
    # number of rows in the dataset
    def __len__(self):
        return len(self.X)
 
    # get a row at an index
    def __getitem__(self, idx):
        return [self.X[idx], self.y[idx]]
 
    # get indexes for train and test rows
    def get_splits(self, n_test=0.33):
        # determine sizes
        test_size = round(n_test * len(self.X))
        train_size = len(self.X) - test_size
        # calculate the split
        return random_split(self, [train_size, test_size])
 
# model definition
class MLP(Module):
    # define model elements
    def __init__(self, n_inputs):
        super(MLP, self).__init__()
        
        #Determine the input and output of each layer. Could also be passed as params for optimization
        layers = [300,200,100]
        total_layers = []
        input_size = n_inputs
        
        for i in layers:
            total_layers.append(nn.Linear(input_size, i))
            total_layers.append(nn.ReLU(inplace=True))
            total_layers.append(nn.BatchNorm1d(i))
            total_layers.append(nn.Dropout(0.2))
            input_size = i
        
        total_layers.append(nn.Linear(layers[-1], 3))

        self.layers = nn.Sequential(*total_layers)


 
    # forward propagate input
    def forward(self, X):
        X = self.layers(X)
        return X

In [48]:
class CSVTestDataset(Dataset):
    # load the dataset
    def __init__(self, path):
        # load the csv file as a dataframe
        df = read_csv(path)

        # Drop unused columns
        df = df.drop('id', axis=1)
        df = df.drop('Unnamed: 0', axis=1)
        if 'Unnamed: 0.1' in df.columns:
            df = df.drop('Unnamed: 0.1', axis=1)
        print('CSVTestDataset =', df.shape)
        print(df.columns)
        
        #Assign x all input values
        self.X = df.values[:, :]
        # ensure input data is floats
        self.X = self.X.astype('float32')
 
    # number of rows in the dataset
    def __len__(self):
        return len(self.X)
 
    # get a row at an index
    def __getitem__(self, idx):
        return [self.X[idx]]
 
    # returns all inputs from test dataset
    def get_test(self):
        return self.X;

In [49]:
#   prepare the dataset
def prepare_data(path):
    # load the dataset
    dataset = CSVDataset(path)
    # calculate split
    train, test = dataset.get_splits()
    # prepare data loaders
    train_dl = DataLoader(train, batch_size=32, shuffle=True)
    test_dl = DataLoader(test, batch_size=1024, shuffle=False)
    return train_dl, test_dl


In [50]:
path = 'clean_nn1.csv'
train_dl, test_dl = prepare_data(path)

In [51]:
## helper train/fit function

def train(model, parameterization, train_dl):
    
    #Gradient descent optimizer
    optimizer = optim.SGD(model.parameters(), lr=parameterization["lr"], momentum=parameterization["momentum"])
    
    #Use cross entropy loss function
    criterion = CrossEntropyLoss()
    
    for epoch in range(2):  # loop over the dataset multiple times

        running_loss = 0.0
        for i, (inputs, targets) in enumerate(train_dl, 0):

            # clear the gradients
            optimizer.zero_grad()
            
            # compute the model output
            yhat = model(inputs)
            
            # calculate loss
            loss = criterion(yhat, targets.long())

            # credit assignment
            loss.backward()
            
            # update model weights
            optimizer.step()
            
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
        
    return model

            
## helper function to evaluate the accuracy for the tested model
def evaluate(model, test_dl):
    predictions, actuals = list(), list()
    for i, (inputs, targets) in enumerate(test_dl):
        # evaluate the model on the test set
        yhat = model(inputs)
        _, predicted = torch.max(yhat, axis=1)

        # retrieve numpy array
        yhat = yhat.detach().numpy()
        actual = targets.numpy()

        # round to class values
        yhat = yhat.round()

        # store predictions
        predictions.append(yhat)
        
        # transform 1d data and store eg. [0,1,2] => [[1,0,0] [0,1,0] [0,0,1]]
        actual = actual.astype(int)
        act = np.zeros((actual.size, actual.max()+1))
        act[np.arange(actual.size),actual] = 1
        actuals.append(act)

    predictions, actuals = vstack(predictions), vstack(actuals)

    # transform result to a numpy array of results eg. [0,1,2]
    actuals = np.argmax(actuals, axis=1)
    predictions = np.argmax(predictions, axis=1)
    
    # Determine accuracy
    acc = np.sum(predictions == actuals) / actuals.shape[0]
    print('Accuracy of model:' , np.sum(predictions == actuals)/ actuals.shape[0])

    return acc

# make a class prediction for one row of data
def _predict(data, model):
    predictions = list()
    for i, inputs in enumerate(data):
        yhat = model(inputs)
        # retrieve numpy array
        yhat = yhat.detach().numpy()
        predictions.append(yhat)

    # Get prediction numpy arr with 3 columns
    prediction_list = vstack(predictions)
    
    # Get prediction results eg [0,1,2]
    results = np.argmax(prediction_list, axis=1)
    return results

In [52]:
# The code below is to test if my model is working properly.
model = MLP(34)
print(model.parameters())
# Small step size for sgd
trained_model = train(model, {"lr": 1e-6, "momentum": 0.5}, train_dl)
acc = evaluate(trained_model,test_dl)


<generator object Module.parameters at 0x000001E9124A2F20>
Accuracy of model: 0.5411182532394654


In [53]:
# Get test dataset from the csv file.
test_dataset = CSVTestDataset('clean_test_nn1.csv')

# Get self.x of CSVTestDataset
test_df = test_dataset.get_test()

# Pass test_df into dataloader
test_df = DataLoader(test_df, batch_size=10, shuffle=False)

# Predict results
test_results = _predict(test_df, trained_model)

CSVTestDataset = (14850, 34)
Index(['amount_tsh', 'funder', 'gps_height', 'installer', 'longitude',
       'latitude', 'wpt_name', 'basin', 'region', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'permit', 'extraction_type',
       'extraction_type_group', 'extraction_type_class', 'management',
       'management_group', 'payment', 'payment_type', 'water_quality',
       'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
       'source_class', 'waterpoint_type', 'waterpoint_type_group',
       'operational_years'],
      dtype='object')


In [54]:
#Check data prediction
print(test_results)
np.unique(test_results)
results_numpy = sg_encoder.inverse_transform(test_results)
print(results_numpy)

[0 0 0 ... 0 0 0]
['functional' 'functional' 'functional' ... 'functional' 'functional'
 'functional']


In [55]:
# Get dataframe
test_data = pd.read_csv('clean_test_nn1.csv')

# Create dataframe with predictions and id
submission_df = pd.DataFrame(results_numpy, columns=['status_group'])
submission_df['id'] = test_data.id
submission_df = submission_df[['id','status_group']]

# Create new csv
pd.DataFrame(submission_df).to_csv("nnmodel_results.csv", index=False)