This script accomplishes a few key tasks
 - Convert all categorical features to numeric (using dummy variables)
 - fill in missing values
 - scale all the features

In [33]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.base import TransformerMixin

# Preprocessing the features 

We define the following class that imputes missing numerical features with the mean and the categorical features with the most often occuring value

In [4]:
class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """*Impute missing values*.

        Columns of dtype object are imputed with the most frequent value
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [14]:
wells_features_train = pd.read_csv('processed/wells_features_train.csv')

In [15]:
wells_features_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 30 columns):
gps_height               59400 non-null object
longitude                59400 non-null float64
latitude                 59400 non-null float64
basin                    59400 non-null object
region                   59400 non-null object
lga                      59400 non-null object
public_meeting           56066 non-null object
scheme_management        55523 non-null object
permit                   56344 non-null object
construction_year        59400 non-null object
extraction_type          59400 non-null object
extraction_type_group    59400 non-null object
management_group         59400 non-null object
payment_type             59400 non-null object
quality_group            59400 non-null object
quantity_group           59400 non-null object
source_type              59400 non-null object
source_class             59400 non-null object
waterpoint_type          59400 non-null obj

The initial shortlist of features did not deal with missing values. We check the situation and then apply the corerection.

In [16]:
wells_features_train.isnull().sum()

gps_height                  0
longitude                   0
latitude                    0
basin                       0
region                      0
lga                         0
public_meeting           3334
scheme_management        3877
permit                   3056
construction_year           0
extraction_type             0
extraction_type_group       0
management_group            0
payment_type                0
quality_group               0
quantity_group              0
source_type                 0
source_class                0
waterpoint_type             0
tsh                         0
tsh_zero                    0
funded_by                   0
data_collec_at              0
installer_cat               0
wpt_name_cat                0
num_private_cat             0
subvillage_cat              0
ward_cat                    0
pop                         0
pop_zero                    0
dtype: int64

In [17]:
wells_features_train = DataFrameImputer().fit_transform(wells_features_train)

We now check that the missing values have been taken care of

In [18]:
wells_features_train.isnull().sum()

gps_height               0
longitude                0
latitude                 0
basin                    0
region                   0
lga                      0
public_meeting           0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
management_group         0
payment_type             0
quality_group            0
quantity_group           0
source_type              0
source_class             0
waterpoint_type          0
tsh                      0
tsh_zero                 0
funded_by                0
data_collec_at           0
installer_cat            0
wpt_name_cat             0
num_private_cat          0
subvillage_cat           0
ward_cat                 0
pop                      0
pop_zero                 0
dtype: int64

Now we convert all categorical features to dummies

In [19]:
wells_features_train = pd.get_dummies(wells_features_train)

Now check that the dummies have been created

In [22]:
wells_features_train.shape

(59400, 279)

Next step is to scale and center the data

In [23]:
wells_features_train = StandardScaler().fit_transform(wells_features_train)

Now, we need to store this matrix and import it when we do model exploration

In [34]:
np.save('processed/wells_feature_matrix', wells_features_train)

# Preprocessing the labels

In [25]:
wells_labels_train = pd.read_csv('processed/wells_labels_train.csv')

In [26]:
wells_labels_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 1 columns):
status    59400 non-null int64
dtypes: int64(1)
memory usage: 464.1 KB


There is nothing to do in this case