## Initial Pre-processing

This notebook demonstrates my pre-processing steps for water main break data. I download and import the most recently available data from the [regional database](https://open-kitchenergis.opendata.arcgis.com/datasets/KitchenerGIS::water-main-breaks/about) and get rid of the unnecessary columns so we can focus on the features that are going to essential for our model later on.

Let's dive in!

In [100]:
# Basic imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
%matplotlib inline

In [101]:
# Import our most recent data and check out a sample
break_data = pd.read_csv("../data/raw/water_data.csv")
break_data.sample(10)


Columns (7,18,20,21,40,69) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,ROAD_CLOSED,SIDEWALK_CLOSED,HOUR_IMPACTED,UNITS_IMPACTED,CW_SERVICE_REQUEST,STATUS,...,CRITICALITY,REL_CLEANING_AREA,REL_CLEANING_SUBAREA,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
9233,10352,768,1997-01-27 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,8.0,5.0,2,N,N,4.05,N,N,2e58b906-6cae-43f4-9336-28d19429d7f8,19.936346
5841,9480,1043,1998-08-11 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,7.0,5.0,33,N,N,8.5,N,N,0312c44c-fbf9-4df9-9de4-bb1f65223fcb,46.189461
3205,8819,1725,2012-01-15 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,5.0,1.0,31,N,N,7.75,N,N,fb74fd7b-c60d-47e9-aad9-61f22d5d9f1d,1.527551
8593,10184,2221,2016-08-07 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,6.0,1.0,30,N,N,9.1,N,N,6bcac044-db58-492a-b1ea-d6fcd6d10772,113.920065
3294,8844,709,2000-04-14 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,6.0,4.0,29,N,N,3.93,N,N,8cb2d0b3-4376-4189-b07e-8b38b14380f9,7.337425
10117,39681,142154,2021-05-28 03:20:10,MAIN,Open,Closed,4-8 hours,0-50,,REPAIR COMPLETED,...,0.0,1.0,24,N,N,9.35,N,N,df5482a9-3846-461e-bd3a-0cdd233333ca,29.079649
8185,10083,1438,2010-01-08 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,0.0,1.0,6,N,N,7.75,N,N,9a5dfbd7-455c-4f34-b2fb-18a8d625cfb4,7.811732
244,8058,1866,2013-01-30 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,6.0,1.0,8,N,N,8.5,N,N,c03432c2-4b95-4cf9-9f11-606628358ffb,4.568067
6960,9766,1841,2013-03-04 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,6.0,4.0,6B,N,N,7.75,N,N,6954a697-b9e4-434e-b693-44784b998176,7.620663
3436,8876,1926,2014-02-13 00:00:00,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,-1.0,5.0,9,N,N,-1.0,N,N,b5641d4f-9c3f-462a-8c77-8b8e900fbb98,1.507739


In [102]:
break_data.shape

(10721, 80)

In [103]:
break_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10721 entries, 0 to 10720
Data columns (total 80 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   OBJECTID                     10721 non-null  int64  
 1   WATBREAKINCIDENTID           10721 non-null  int64  
 2   INCIDENT_DATE                10721 non-null  object 
 3   BREAK_TYPE                   10721 non-null  object 
 4   ROAD_CLOSED                  10721 non-null  object 
 5   SIDEWALK_CLOSED              10721 non-null  object 
 6   HOUR_IMPACTED                10721 non-null  object 
 7   UNITS_IMPACTED               1126 non-null   object 
 8   CW_SERVICE_REQUEST           75 non-null     float64
 9   STATUS                       10721 non-null  object 
 10  STATUS_DATE                  10097 non-null  float64
 11  WORKORDER                    6092 non-null   float64
 12  RETURN_TO_NORMAL             382 non-null    float64
 13  BREAK_NATURE    

In [104]:
break_data.columns

Index(['OBJECTID', 'WATBREAKINCIDENTID', 'INCIDENT_DATE', 'BREAK_TYPE',
       'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'HOUR_IMPACTED', 'UNITS_IMPACTED',
       'CW_SERVICE_REQUEST', 'STATUS', 'STATUS_DATE', 'WORKORDER',
       'RETURN_TO_NORMAL', 'BREAK_NATURE', 'BREAK_APPARENT_CAUSE',
       'REPAIR_TYPE', 'NEW_SECTION_LENGTH', 'MAINTENANCE_DESC',
       'VALVES_CLOSED', 'VALVES_OPENED', 'HYDRANTS_CALLED_OUT',
       'HYDRANTS_CALLED_BACK_IN', 'POSITIVE_PRESSURE_MAINTANED',
       'AIR_GAP_MAINTANED', 'DISINFECTED', 'MECHANICAL_REMOVAL',
       'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING', 'ANODE_INSTALLED',
       'BREAK_CATEGORIZATION', 'BACTERIA_TESTING_DATE',
       'HEALTH_DEPT_NOTIFICATION', 'MOECC_SAC_NOTIFICATION',
       'SAC_REFERENCE_NO', 'LOCAL_MOE_OFFICE', 'BWA_DWA', 'BWA_DWA_DECLARED',
       'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'ROADSEGMENTID',
       'CIVIC_NUMBER', 'STREET', 'ASSETID', 'ASSET_DEPTH', 'FROST_DEPTH',
       'ASSET_SIZE', 'ASSET_YEAR_INSTALLED',

In [105]:
break_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10721 entries, 0 to 10720
Data columns (total 80 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   OBJECTID                     10721 non-null  int64  
 1   WATBREAKINCIDENTID           10721 non-null  int64  
 2   INCIDENT_DATE                10721 non-null  object 
 3   BREAK_TYPE                   10721 non-null  object 
 4   ROAD_CLOSED                  10721 non-null  object 
 5   SIDEWALK_CLOSED              10721 non-null  object 
 6   HOUR_IMPACTED                10721 non-null  object 
 7   UNITS_IMPACTED               1126 non-null   object 
 8   CW_SERVICE_REQUEST           75 non-null     float64
 9   STATUS                       10721 non-null  object 
 10  STATUS_DATE                  10097 non-null  float64
 11  WORKORDER                    6092 non-null   float64
 12  RETURN_TO_NORMAL             382 non-null    float64
 13  BREAK_NATURE    

In [106]:
def print_null_values(df, column=None):
    if column is None:
        # print null values for all columns
        for col in df.columns:
            null_values = df[col].isnull().sum()
            print(f"{col} - ", null_values)
    else:
        # print null values for a single column
        null_values = df[column].isnull().sum()
        print(f"{col} - ", null_values)

In [107]:
# break_data.isna().sum()
print_null_values(break_data)

OBJECTID -  0
WATBREAKINCIDENTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
ROAD_CLOSED -  0
SIDEWALK_CLOSED -  0
HOUR_IMPACTED -  0
UNITS_IMPACTED -  9595
CW_SERVICE_REQUEST -  10646
STATUS -  0
STATUS_DATE -  624
WORKORDER -  4629
RETURN_TO_NORMAL -  10339
BREAK_NATURE -  507
BREAK_APPARENT_CAUSE -  691
REPAIR_TYPE -  10138
NEW_SECTION_LENGTH -  10639
MAINTENANCE_DESC -  10721
VALVES_CLOSED -  10717
VALVES_OPENED -  10721
HYDRANTS_CALLED_OUT -  10695
HYDRANTS_CALLED_BACK_IN -  10699
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  542
BACTERIA_TESTING_DATE -  10684
HEALTH_DEPT_NOTIFICATION -  10721
MOECC_SAC_NOTIFICATION -  10712
SAC_REFERENCE_NO -  10721
LOCAL_MOE_OFFICE -  10712
BWA_DWA -  0
BWA_DWA_DECLARED -  10713
PROCEEDURES_FOLLOWED -  0
RECORD_CHANGE_REQD -  0
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_DEP

In [91]:
# print the null values that have greater than 80 percent null values
def print_null_thresh_values(df, column=None, threshold=0.7):
    if column is None:
        # print null values for all columns
        for col in df.columns:
            null_values = df[col].isnull().sum()
            if null_values > threshold * len(df):
                print(f"{col} - ", null_values)
    else:
        # print null values for a single column
        null_values = df[column].isnull().sum()
        print(f"{column} - ", null_values)

print_null_thresh_values(break_data)

UNITS_IMPACTED -  9595
CW_SERVICE_REQUEST -  10646
RETURN_TO_NORMAL -  10339
REPAIR_TYPE -  10138
NEW_SECTION_LENGTH -  10639
MAINTENANCE_DESC -  10721
VALVES_CLOSED -  10717
VALVES_OPENED -  10721
HYDRANTS_CALLED_OUT -  10695
HYDRANTS_CALLED_BACK_IN -  10699
BACTERIA_TESTING_DATE -  10684
HEALTH_DEPT_NOTIFICATION -  10721
MOECC_SAC_NOTIFICATION -  10712
SAC_REFERENCE_NO -  10721
LOCAL_MOE_OFFICE -  10712
BWA_DWA_DECLARED -  10713
ASSET_DEPTH -  10665
FROST_DEPTH -  10440
LINED_DATE -  10586
CONSULTANT -  7557
BRIDGE_DETAILS -  10717


In [108]:
def drop_null_columns(df, threshold=0.7):
    for col in df.columns:
        null_values = df[col].isnull().sum()
        if null_values > threshold * len(df):
            df.drop(col, axis=1, inplace=True)

drop_null_columns(break_data)

In [109]:
break_data.columns

Index(['OBJECTID', 'WATBREAKINCIDENTID', 'INCIDENT_DATE', 'BREAK_TYPE',
       'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'HOUR_IMPACTED', 'STATUS',
       'STATUS_DATE', 'WORKORDER', 'BREAK_NATURE', 'BREAK_APPARENT_CAUSE',
       'POSITIVE_PRESSURE_MAINTANED', 'AIR_GAP_MAINTANED', 'DISINFECTED',
       'MECHANICAL_REMOVAL', 'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING',
       'ANODE_INSTALLED', 'BREAK_CATEGORIZATION', 'BWA_DWA',
       'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'ROADSEGMENTID',
       'CIVIC_NUMBER', 'STREET', 'ASSETID', 'ASSET_SIZE',
       'ASSET_YEAR_INSTALLED', 'ASSET_MATERIAL', 'ASSET_EXISTS', 'GLOBALID',
       'longitude', 'latitude', 'OBJECTID.1', 'WATMAINID', 'STATUS.1',
       'PRESSURE_ZONE', 'ROADSEGMENTID.1', 'MAP_LABEL', 'CATEGORY',
       'PIPE_SIZE', 'MATERIAL', 'LINED', 'LINED_MATERIAL', 'INSTALLATION_DATE',
       'ACQUISITION', 'OWNERSHIP', 'BRIDGE_MAIN', 'CRITICALITY',
       'REL_CLEANING_AREA', 'REL_CLEANING_SUBAREA', 'UNDERSIZED',
       'SHALLOW

In [110]:
break_data.drop(['WATBREAKINCIDENTID', 'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'STATUS_DATE', 'WORKORDER',
                 'BWA_DWA', 'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'OBJECTID.1', 'WATMAINID', 'STATUS.1',
                 'ROADSEGMENTID.1', 'ACQUISITION', 'OWNERSHIP', 'REL_CLEANING_AREA', 'REL_CLEANING_SUBAREA',], axis=1, inplace=True)

In [111]:
break_data.shape

(10721, 43)

In [112]:
break_data.head()

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
0,1,2017-12-01 15:15:00,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,1923-01-01 00:00:00,N,6.0,N,N,3.85,N,N,7032c165-215f-4cf6-ae35-65c9320aab4e,138.194599
1,1,2017-12-01 15:15:00,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,2017-01-11 00:00:00,N,-1.0,N,N,9.85,N,N,791c47f5-9a9a-48b1-94bb-cad82bec682c,4.721263
2,1,2017-12-01 15:15:00,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,2017-01-11 00:00:00,N,-1.0,N,N,9.85,N,N,fbc17d4d-3636-4b58-823b-54b3cbc77470,1.399993
3,1,2017-12-01 15:15:00,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,2017-01-11 00:00:00,N,-1.0,N,N,5.93,N,N,e67cbabe-4883-4c85-854b-f0367fff4f1a,1.579268
4,7874,2001-03-26 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1965-01-01 00:00:00,N,5.0,N,N,7.85,N,Y,c99abdb4-c3cc-409d-be83-5b42e4029bc9,12.798598


In [113]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  507
BREAK_APPARENT_CAUSE -  691
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  542
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_SIZE -  720
ASSET_YEAR_INSTALLED -  739
ASSET_MATERIAL -  720
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  76
MAP_LABEL -  76
CATEGORY -  76
PIPE_SIZE -  76
MATERIAL -  76
LINED -  76
LINED_MATERIAL -  76
INSTALLATION_DATE -  77
BRIDGE_MAIN -  76
CRITICALITY -  76
UNDERSIZED -  76
SHALLOW_MAIN -  76
CONDITION_SCORE -  76
OVERSIZED -  76
CLEANED -  76
GlobalID -  76
Shape__Length -  76


Let's investigate the types of values there are in the `BREAK_APPARENT_CAUSE` variable and `BREAK_NATURE` variable. There aren't a lot of missing values so I might be able to easily impute them.

In [114]:
break_data.BREAK_APPARENT_CAUSE.unique()

array(['AGE', 'OTHER', 'COMBINATION', 'CORROSION', 'SOILS', 'UNKNOWN',
       'PRESSURE', 'FAULTY INSTALL', nan], dtype=object)

In [115]:
break_data.BREAK_APPARENT_CAUSE.value_counts()

OTHER             7962
AGE                825
COMBINATION        503
CORROSION          405
UNKNOWN            132
PRESSURE            92
SOILS               72
FAULTY INSTALL      39
Name: BREAK_APPARENT_CAUSE, dtype: int64

I'll fill the nan values with 'UNKNOWN'

In [116]:
break_data.BREAK_APPARENT_CAUSE.fillna('UNKNOWN', inplace=True)

In [117]:
break_data.BREAK_APPARENT_CAUSE.value_counts()


OTHER             7962
AGE                825
UNKNOWN            823
COMBINATION        503
CORROSION          405
PRESSURE            92
SOILS               72
FAULTY INSTALL      39
Name: BREAK_APPARENT_CAUSE, dtype: int64

In [118]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      7936
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
OTHER                                          46
CORROSION AND FITTING/JOINT                    43
OTHER: WATER SERVICE                           36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [119]:
break_data['BREAK_NATURE'] = break_data['BREAK_NATURE'].replace({'OTHER: WATER SERVICE': 'WATER SERVICE'})

In [120]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      7936
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
OTHER                                          46
CORROSION AND FITTING/JOINT                    43
WATER SERVICE                                  36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [121]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL', 'OTHER', nan,
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [122]:
break_data.BREAK_NATURE.fillna('UNKNOWN', inplace=True)

In [123]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      8443
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
OTHER                                          46
CORROSION AND FITTING/JOINT                    43
WATER SERVICE                                  36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [124]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL', 'OTHER',
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [125]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  542
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_SIZE -  720
ASSET_YEAR_INSTALLED -  739
ASSET_MATERIAL -  720
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  76
MAP_LABEL -  76
CATEGORY -  76
PIPE_SIZE -  76
MATERIAL -  76
LINED -  76
LINED_MATERIAL -  76
INSTALLATION_DATE -  77
BRIDGE_MAIN -  76
CRITICALITY -  76
UNDERSIZED -  76
SHALLOW_MAIN -  76
CONDITION_SCORE -  76
OVERSIZED -  76
CLEANED -  76
GlobalID -  76
Shape__Length -  76


In [126]:
break_data.BREAK_CATEGORIZATION.value_counts()

CATEGORY 1    10114
CATEGORY 2       65
Name: BREAK_CATEGORIZATION, dtype: int64

In [127]:
break_data.BREAK_CATEGORIZATION.unique()

array(['CATEGORY 2', 'CATEGORY 1', nan], dtype=object)

In [128]:
def fill_null_values(df, column, value):
    # fill null values in the specified column with the specified value
    df[column].fillna(value, inplace=True)

In [129]:
# break_data.BREAK_CATEGORIZATION.fillna('UNKNOWN', inplace=True)
fill_null_values(break_data, 'BREAK_CATEGORIZATION', 'UNKNOWN')

In [130]:
break_data.BREAK_CATEGORIZATION.value_counts()

CATEGORY 1    10114
UNKNOWN         542
CATEGORY 2       65
Name: BREAK_CATEGORIZATION, dtype: int64

In [131]:
break_data['BREAK_NATURE'] = break_data['BREAK_NATURE'].replace({'OTHER': 'UNKNOWN'})

In [132]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL',
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [133]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      8489
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
CORROSION AND FITTING/JOINT                    43
WATER SERVICE                                  36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [134]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_SIZE -  720
ASSET_YEAR_INSTALLED -  739
ASSET_MATERIAL -  720
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  76
MAP_LABEL -  76
CATEGORY -  76
PIPE_SIZE -  76
MATERIAL -  76
LINED -  76
LINED_MATERIAL -  76
INSTALLATION_DATE -  77
BRIDGE_MAIN -  76
CRITICALITY -  76
UNDERSIZED -  76
SHALLOW_MAIN -  76
CONDITION_SCORE -  76
OVERSIZED -  76
CLEANED -  76
GlobalID -  76
Shape__Length -  76


In [135]:
break_data['CIVIC_NUMBER'].sample(5)

10666    206.0
10423    607.0
2672        37
3884        14
9953      38.0
Name: CIVIC_NUMBER, dtype: object

In [136]:
break_data.drop(['CIVIC_NUMBER'], axis=1, inplace=True)

We can see there's some `STREET` names missing. It would be best to explore these missing attributes and see how we can possibly impute them. They still contain `LATITUDE` and `LONGITUDE` values so we can possibly identify them by those. Let's see what we can do...

In [137]:
no_street = break_data.loc[break_data.STREET.isna()]
no_street

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
548,8140,2009-10-22 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,,,,,,,,,,
1066,8278,2001-03-14 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1974-09-01 00:00:00,N,6.0,N,N,4.9,N,N,f9f9bd80-ceff-4ce0-951c-03c02a4cc6e7,31.843715
1067,8278,2001-03-14 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1974-09-01 00:00:00,N,0.0,N,N,8.5,N,N,9e411899-71a2-441f-bd0d-ccec7a529549,1.63387
1068,8278,2001-03-14 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1974-09-01 00:00:00,N,6.0,N,N,8.5,N,N,0e30f363-a96b-49c5-a8a0-e6ac38693bc6,5.867424
1900,8478,2014-01-10 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1977-11-01 00:00:00,N,6.0,N,N,8.5,N,N,a8bc6f06-948d-49b9-9fa6-b3ff49af237f,6.594864
1901,8478,2014-01-10 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1977-11-01 00:00:00,N,7.0,N,N,6.9,N,N,ee2352cf-7fdf-4b1c-ab21-fcf4a3c636bf,117.352822
1902,8478,2014-01-10 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1977-11-01 00:00:00,N,7.0,N,N,8.5,N,N,e0706f51-cc0b-4d07-9535-93597255e037,7.917657
2660,8678,2011-12-15 15:38:27,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,,,,,,,,,,
5659,9428,2011-12-15 15:38:28,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,,,,,,,,,,
6294,9592,2011-12-15 15:38:28,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,,,,,,,,,,


Looking at the latitude and longitude of the missing street addresses, there are actually only 5 distinct missing addresses. We can plot them on a map and see where they are located.

In [138]:
import plotly
import plotly.express as px

In [139]:
from config import token
api_token = token
px.set_mapbox_access_token(token)
fig = px.scatter_mapbox(data_frame=no_street, lat='latitude', lon='longitude')
fig.update_layout(mapbox_style="carto-positron", mapbox_accesstoken=token)
fig.show();

In [140]:
# drop the rows with no street
break_data.dropna(subset=['STREET'], inplace=True)

In [141]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  707
ASSET_YEAR_INSTALLED -  726
ASSET_MATERIAL -  707
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  62
MAP_LABEL -  62
CATEGORY -  62
PIPE_SIZE -  62
MATERIAL -  62
LINED -  62
LINED_MATERIAL -  62
INSTALLATION_DATE -  63
BRIDGE_MAIN -  62
CRITICALITY -  62
UNDERSIZED -  62
SHALLOW_MAIN -  62
CONDITION_SCORE -  62
OVERSIZED -  62
CLEANED -  62
GlobalID -  62
Shape__Length -  62


In [142]:
break_data.ASSET_SIZE.value_counts()

150.0     6429
300.0     1635
200.0     1059
450.0      292
100.0      265
600.0      102
250.0       76
25.0        41
750.0       26
0.0         20
13.0        18
50.0        17
1200.0      14
Name: ASSET_SIZE, dtype: int64

So there's 707 missing values for `ASSET_SIZE`. We could simply fill the null values with the mean value, the most frequent value, or just set them to 0. Another way that I feel might be the most appropriate approach is visualize on a map where these asset sizes occur, categorizing by street name, and then seeing if the null values occur on the same streets as some of those already known asset sizes, and then fill in the missing sizes with which neighbours their near.

In [143]:
# pd.set_option("display.max_rows", None)
break_data.loc[break_data.ASSET_SIZE.isna()]

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
29,7881,2015-03-12 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2012-11-28 00:00:00,N,-1.0,N,N,9.35,N,N,6610f5df-daf4-4740-a895-6998427c4db1,17.641098
30,7881,2015-03-12 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,067a573a-2de1-4b0c-a9dd-e1c47a4c215a,95.003715
31,7881,2015-03-12 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,91af793e-7013-4ec2-81cd-5124d597c909,89.270670
32,7881,2015-03-12 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,e7350b7b-8bb5-4abb-86ef-a07c09f6978e,84.284014
76,8007,2003-04-28 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,2edaca94-c56e-44ea-b779-47ba2a9c9d43,155.740656
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9722,19841,2020-01-17 10:24:37,SERVICE,20-24 hours,REPAIR COMPLETED,UNKNOWN,UNKNOWN,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,067a573a-2de1-4b0c-a9dd-e1c47a4c215a,95.003715
9723,19841,2020-01-17 10:24:37,SERVICE,20-24 hours,REPAIR COMPLETED,UNKNOWN,UNKNOWN,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,91af793e-7013-4ec2-81cd-5124d597c909,89.270670
9724,19841,2020-01-17 10:24:37,SERVICE,20-24 hours,REPAIR COMPLETED,UNKNOWN,UNKNOWN,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,e7350b7b-8bb5-4abb-86ef-a07c09f6978e,84.284014
9969,33925,2021-01-16 08:30:32,MAIN,4-8 hours,CANCELLED,UNKNOWN,UNKNOWN,Y,Y,Y,...,2021-06-01 00:00:00,N,-1.0,N,N,-1.00,N,N,e55dd405-e462-4dd7-b2fb-b2a2e59ea19e,15.511798


What I will try to do next is look at a sort of breakdown of the asset sizes, listed with their asset ID's and corresponding street names, and see if we can match up any non-null asset sizes from the same street as asset sizes with null values. There's a possibility of imputing the average asset size value from the same street for the null values. For now I am jusy hypothesizing this, I'm not entirely sure if it'll work but why not try right?

In [144]:
asset_size = break_data[['STREET', 'ASSETID', 'ASSET_SIZE']]
# cherry picking null asset sizes to see which streets we could look at
asset_size.loc[asset_size.ASSET_SIZE.isna()].sample(10)

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
6767,FLORENCE AVE,14690,
5286,OTTAWA ST S,64330,
3802,WEBER ST E,40880,
348,WEBER ST E,79238,
7238,REX DR,32260,
7239,REX DR,32260,
9708,THALER AVE,37620,
5600,PATTANDON AVE,63500,
1476,OTTAWA ST S,64330,
524,FLORENCE AVE,14700,


Next I'll see if there are any matching asset ID's for null and non-null asset sizes. If there are any matches, then it could be safe to say that I could impute the null values with the ID's matching asset size. If this doesn't prove to be true then I'll have to explore more options.

In [145]:
asset_size[asset_size['STREET'] == 'FLORENCE AVE']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
97,FLORENCE AVE,14690,
98,FLORENCE AVE,14690,
99,FLORENCE AVE,14690,
100,FLORENCE AVE,14690,
101,FLORENCE AVE,14690,
...,...,...,...
9648,FLORENCE AVE,14700,
9649,FLORENCE AVE,14700,
9650,FLORENCE AVE,14700,
9651,FLORENCE AVE,14700,


Unfortunately there are no matching ID's for this street. Let's check a few more streets to be sure that this trend might not hold...

In [146]:
asset_size[asset_size['STREET'] == 'OTTAWA ST S']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
149,OTTAWA ST S,95460,
150,OTTAWA ST S,95460,
151,OTTAWA ST S,95460,
152,OTTAWA ST S,95460,
153,OTTAWA ST S,95460,
...,...,...,...
10560,OTTAWA ST S,58380,450.0
10561,OTTAWA ST S,58380,450.0
10562,OTTAWA ST S,58380,450.0
10563,OTTAWA ST S,58380,450.0


In [147]:
# display street names with no asset size
streets_with_na = asset_size.STREET[asset_size['ASSET_SIZE'].isna()].unique()
print(streets_with_na)

['MILL ST' 'FLORENCE AVE' 'OTTAWA ST S' 'WEBER ST E' 'OTTAWA ST N'
 'ST CLAIR AVE' 'BRIDGE ST E' 'FERGUS AVE' 'CORAL CRES' 'THALER AVE'
 'BOEHMER ST' 'MAUSSER AVE' 'VALEWOOD PL' 'WINDOM RD' 'REX DR'
 'NORFOLK CRES' 'KEHL ST' 'HEIMAN ST' 'HEBEL PL' 'GUERIN AVE' 'BECKER ST'
 'STIRLING AVE S' 'MAURICE ST' 'SOUTHILL DR' 'WALKER ST' 'EIGHTH AVE'
 'FAIRMOUNT RD' 'HUBER ST' 'GOLFVIEW PL' 'PATTANDON AVE' 'HOFFMAN ST'
 'SYDNEY ST S' 'ANN ST' 'EDWIN ST' 'SCHWEITZER ST']


In [148]:
asset_size[asset_size['STREET'] == 'WEBER ST E']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
107,WEBER ST E,41030,150.0
108,WEBER ST E,41030,150.0
347,WEBER ST E,79238,
348,WEBER ST E,79238,
349,WEBER ST E,79238,
...,...,...,...
9687,WEBER ST E,41030,150.0
10020,WEBER ST E,41030,150.0
10021,WEBER ST E,41030,150.0
10311,WEBER ST E,41030,150.0


Well there we have it, each ID is unique to the asset size whether it's missing or not.

I will fill in the `NaN` values with the most frequent number (mode) for now.

In [149]:
asset_size.ASSET_SIZE.value_counts()

150.0     6429
300.0     1635
200.0     1059
450.0      292
100.0      265
600.0      102
250.0       76
25.0        41
750.0       26
0.0         20
13.0        18
50.0        17
1200.0      14
Name: ASSET_SIZE, dtype: int64

In [150]:
# fill asset size with the mode of the column
break_data['ASSET_SIZE'].fillna(break_data['ASSET_SIZE'].mode()[0], inplace=True)

In [151]:
break_data.ASSET_SIZE.value_counts()

150.0     7136
300.0     1635
200.0     1059
450.0      292
100.0      265
600.0      102
250.0       76
25.0        41
750.0       26
0.0         20
13.0        18
50.0        17
1200.0      14
Name: ASSET_SIZE, dtype: int64

In [152]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  726
ASSET_MATERIAL -  707
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  62
MAP_LABEL -  62
CATEGORY -  62
PIPE_SIZE -  62
MATERIAL -  62
LINED -  62
LINED_MATERIAL -  62
INSTALLATION_DATE -  63
BRIDGE_MAIN -  62
CRITICALITY -  62
UNDERSIZED -  62
SHALLOW_MAIN -  62
CONDITION_SCORE -  62
OVERSIZED -  62
CLEANED -  62
GlobalID -  62
Shape__Length -  62


In [153]:
break_data.ASSET_YEAR_INSTALLED.nunique()

94

In [155]:
# compare the asset year installed with the installation date
break_data[['ASSET_YEAR_INSTALLED', 'INSTALLATION_DATE', 'INCIDENT_DATE']].sample(10)

Unnamed: 0,ASSET_YEAR_INSTALLED,INSTALLATION_DATE,INCIDENT_DATE
2720,1967.0,2012-03-22 00:00:00,2009-06-13 00:00:00
6942,1953.0,1953-01-01 00:00:00,1999-01-08 00:00:00
6057,1984.0,1984-04-01 00:00:00,2008-12-01 00:00:00
3238,1969.0,1968-09-01 00:00:00,2009-05-01 00:00:00
3613,1963.0,1970-06-01 00:00:00,2014-10-12 00:00:00
6305,2010.0,2010-06-14 00:00:00,2002-11-30 00:00:00
2249,1966.0,1966-01-01 00:00:00,2006-10-22 00:00:00
3081,1960.0,2020-10-09 00:00:00,2000-11-30 00:00:00
7809,1961.0,1961-01-01 00:00:00,2007-02-12 00:00:00
3004,1953.0,1956-01-01 00:00:00,2018-02-06 00:00:00


In [157]:
# convert asset year installed to datetime
break_data['ASSET_YEAR_INSTALLED'] = pd.to_datetime(break_data['ASSET_YEAR_INSTALLED'], format='%Y')

In [158]:
# how many incident dates are less than asset year installed
break_data.loc[break_data['INCIDENT_DATE'] < break_data['ASSET_YEAR_INSTALLED']]

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
25,7880,2006-10-30 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2009-02-11 00:00:00,N,6.0,N,N,9.35,N,Y,b6486254-bf01-4343-93fa-608b0c845148,39.047301
26,7880,2006-10-30 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2009-02-11 00:00:00,N,7.0,N,N,9.35,N,Y,e5e49db1-b6e9-43dc-a935-5b6923046c8f,6.587164
27,7880,2006-10-30 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2009-02-11 00:00:00,N,7.0,N,N,9.35,N,Y,d568e5e0-3369-4a08-88b8-0f6dd60adaa3,167.541757
28,7880,2006-10-30 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2009-02-11 00:00:00,N,6.0,N,N,9.35,N,Y,b05f7c24-4e9d-428a-81ed-66b852999af5,13.508361
46,7886,2009-10-27 00:00:00,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1989-05-31 00:00:00,N,7.0,N,N,9.10,N,N,d17c2b3f-80e5-4f49-8e14-0a35428d8774,81.320859
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9399,10400,2011-01-25 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,COMBINATION,Y,Y,Y,...,2011-11-14 00:00:00,N,6.0,N,N,9.35,N,N,6af381a1-f4c5-4b5c-93c4-276ef2dcd843,2.893684
9400,10400,2011-01-25 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,COMBINATION,Y,Y,Y,...,2011-11-14 00:00:00,N,6.0,N,N,9.35,N,N,3464eb9e-e992-4674-b67c-a9ef3780d391,5.926080
9401,10400,2011-01-25 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,COMBINATION,Y,Y,Y,...,2016-08-15 00:00:00,N,-1.0,N,N,9.85,N,N,0e9798ee-6e55-4f4a-8487-caa1ab208b72,8.934507
9402,10400,2011-01-25 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,COMBINATION,Y,Y,Y,...,2018-03-07 00:00:00,N,-1.0,N,N,6.25,N,N,6043427d-6975-4470-8e75-4e977fee669b,65.764831


In [159]:
# drop the rows where the incident date is less than the asset year installed
break_data.drop(break_data[break_data['INCIDENT_DATE'] < break_data['ASSET_YEAR_INSTALLED']].index, inplace=True)

In [160]:
break_data.shape

(9294, 42)

In [161]:
# show the missing asset year installed rows with their installation date
break_data[['ASSET_YEAR_INSTALLED', 'INSTALLATION_DATE']][break_data['ASSET_YEAR_INSTALLED'].isna()].sample(30)

Unnamed: 0,ASSET_YEAR_INSTALLED,INSTALLATION_DATE
9708,NaT,2021-05-07 00:00:00
9416,NaT,2018-12-01 00:00:00
7397,NaT,2018-10-02 00:00:00
764,NaT,2015-02-09 00:00:00
4067,NaT,2021-09-01 00:00:00
38,NaT,2019-06-21 00:00:00
7641,NaT,2020-06-18 00:00:00
3221,NaT,2018-10-29 00:00:00
2298,NaT,2015-02-09 00:00:00
5270,NaT,2022-02-11 00:00:00


In [162]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  726
ASSET_MATERIAL -  707
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  59
MAP_LABEL -  59
CATEGORY -  59
PIPE_SIZE -  59
MATERIAL -  59
LINED -  59
LINED_MATERIAL -  59
INSTALLATION_DATE -  60
BRIDGE_MAIN -  59
CRITICALITY -  59
UNDERSIZED -  59
SHALLOW_MAIN -  59
CONDITION_SCORE -  59
OVERSIZED -  59
CLEANED -  59
GlobalID -  59
Shape__Length -  59


From reading different studies, it seems as though the year the pipe was installed or rather the age of the pipe is a critical factor in predicting breaks/time of failure. In this case, I believe I may have to drop the observations where no year is indicated. 

In [163]:
break_data.dropna(axis=0, subset=['ASSET_YEAR_INSTALLED'], inplace=True)

In [164]:
print(break_data.shape)
print_null_values(break_data)

(8568, 42)
OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  0
ASSET_MATERIAL -  0
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  55
MAP_LABEL -  55
CATEGORY -  55
PIPE_SIZE -  55
MATERIAL -  55
LINED -  55
LINED_MATERIAL -  55
INSTALLATION_DATE -  56
BRIDGE_MAIN -  55
CRITICALITY -  55
UNDERSIZED -  55
SHALLOW_MAIN -  55
CONDITION_SCORE -  55
OVERSIZED -  55
CLEANED -  55
GlobalID -  55
Shape__Length -  55


In [165]:
for col in break_data.columns:
    if break_data[col].isnull().sum() > 0:
        print(col, break_data[col].unique())

PRESSURE_ZONE ['KIT 4' 'KIT 5' 'KIT 2E' 'BRIDGEPORT' nan 'KIT 6' 'KIT 2W' 'RAW NO ZONE'
 'MANNHEIM']
MAP_LABEL ['138.2m 150mm  CI' '4.7m 25mm  PVC' '1.4m 25mm  PVC' ...
 '88.1m 150mm  DI' '9.3m 150mm  DI' '75.5m 150mm  DI']
CATEGORY ['TREATED' 'UNTREATED' nan]
PIPE_SIZE [ 150.   25.  450.  200.  300.  750.  250.   nan  600.  100.    0.  900.
   50.  400. 1200.   38.]
MATERIAL ['CI' 'PVC' 'DI' 'HDPE' 'CPP' 'PVCB' 'ST' nan 'PVCO' 'AC' 'COP']
LINED ['NO' nan 'YES']
LINED_MATERIAL ['NONE' nan 'EPS' 'CEMENT']
INSTALLATION_DATE ['1923-01-01 00:00:00' '2017-01-11 00:00:00' '1965-01-01 00:00:00'
 '1966-01-01 00:00:00' '2010-06-14 00:00:00' '1959-01-01 00:00:00'
 '2020-08-17 00:00:00' '2020-09-02 00:00:00' '1973-01-01 00:00:00'
 '1975-01-01 00:00:00' '1971-05-01 00:00:00' '2000-06-01 00:00:00'
 '1987-01-01 00:00:00' '1962-01-01 00:00:00' '1961-01-01 00:00:00'
 '2017-07-01 00:00:00' '1951-06-01 00:00:00' '1955-01-01 00:00:00'
 '1992-11-01 00:00:00' '1985-06-01 00:00:00' '1974-08-01 00:00:00'
 '1

Pretty sure if we drop the installation dates that are missing instead of imputing them, that will get rid of all of the rest of the missing information...

In [166]:
break_data.dropna(axis=0, subset=['INSTALLATION_DATE'], inplace=True)

In [167]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  0
ASSET_MATERIAL -  0
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  0
MAP_LABEL -  0
CATEGORY -  0
PIPE_SIZE -  0
MATERIAL -  0
LINED -  0
LINED_MATERIAL -  0
INSTALLATION_DATE -  0
BRIDGE_MAIN -  0
CRITICALITY -  0
UNDERSIZED -  0
SHALLOW_MAIN -  0
CONDITION_SCORE -  0
OVERSIZED -  0
CLEANED -  0
GlobalID -  0
Shape__Length -  0


In [168]:
print(break_data.shape)
break_data.sample(15)

(8512, 42)


Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
3780,8964,2009-02-14 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2019-06-26 00:00:00,N,-1.0,N,N,9.85,N,N,3449a379-3fdf-4c6e-b6dd-64e52a02f867,41.000366
4018,9018,2015-03-02 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1953-01-01 00:00:00,N,5.0,N,N,7.75,N,N,1e06a4de-0161-42ab-ae4d-f8790b373502,11.756875
10303,47681,2022-01-17 20:58:57,MAIN,4-8 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,AGE,Y,Y,Y,...,1970-01-01 00:00:00,N,6.0,N,N,4.25,N,N,336aab5c-5608-4ddb-827e-10f560cb13c9,135.27158
4287,9094,2015-02-15 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,2022-09-01 00:00:00,N,-1.0,N,N,-1.0,N,N,2d59f197-12fc-489f-a893-1f62368bf2ea,27.749843
3851,8979,2015-03-29 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:00:00,N,6.0,N,N,8.5,N,N,92e1cd14-6c95-43bb-80ea-4fd59cad1472,85.125441
6960,9766,2013-03-04 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1954-01-01 00:00:00,N,6.0,N,N,7.75,N,N,6954a697-b9e4-434e-b693-44784b998176,7.620663
5070,9285,2001-03-11 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1974-10-01 00:00:00,N,7.0,N,N,8.5,N,N,3e532a9a-ccd8-4db1-9399-7c874e41215b,96.481475
2009,8512,2008-01-12 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1967-09-01 00:00:00,N,5.0,N,N,4.65,N,N,e1b5fe4c-e995-4511-960d-91444f778bd6,138.684027
9886,29761,2020-10-31 10:58:26,MAIN,4-8 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,UNKNOWN,Y,Y,Y,...,2012-04-11 00:00:00,N,5.0,N,N,9.35,N,N,819b9c16-6d83-4850-a5f4-fc508d4d883a,0.941614
5543,9403,2005-05-01 00:00:00,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1987-01-01 00:00:00,N,6.0,N,N,9.35,N,N,79185981-d95e-44a0-91da-17ff2a50cdcf,66.518626


Lucky for us that took care of our missing values in the `ASSET_MATERIAL` column. Now we have a clean dataset finally to do some EDA and feature engineering.

In [169]:
# show the rows where the incident date is less than the asset year installed
break_data.loc[break_data['INCIDENT_DATE'] < break_data['ASSET_YEAR_INSTALLED']]

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length


In [171]:
break_data.columns

Index(['OBJECTID', 'INCIDENT_DATE', 'BREAK_TYPE', 'HOUR_IMPACTED', 'STATUS',
       'BREAK_NATURE', 'BREAK_APPARENT_CAUSE', 'POSITIVE_PRESSURE_MAINTANED',
       'AIR_GAP_MAINTANED', 'DISINFECTED', 'MECHANICAL_REMOVAL',
       'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING', 'ANODE_INSTALLED',
       'BREAK_CATEGORIZATION', 'ROADSEGMENTID', 'STREET', 'ASSETID',
       'ASSET_SIZE', 'ASSET_YEAR_INSTALLED', 'ASSET_MATERIAL', 'ASSET_EXISTS',
       'GLOBALID', 'longitude', 'latitude', 'PRESSURE_ZONE', 'MAP_LABEL',
       'CATEGORY', 'PIPE_SIZE', 'MATERIAL', 'LINED', 'LINED_MATERIAL',
       'INSTALLATION_DATE', 'BRIDGE_MAIN', 'CRITICALITY', 'UNDERSIZED',
       'SHALLOW_MAIN', 'CONDITION_SCORE', 'OVERSIZED', 'CLEANED', 'GlobalID',
       'Shape__Length'],
      dtype='object')

In [170]:
break_data.to_csv('../data/interim/cleaned_break_data.csv', index=False)