## Initial Pre-processing

This notebook demonstrates my pre-processing steps for water main break data. I download and import the most recently available data from the [regional database](https://open-kitchenergis.opendata.arcgis.com/datasets/KitchenerGIS::water-main-breaks/about) and get rid of the unnecessary columns so we can focus on the features that are going to essential for our model later on.

Let's dive in!

In [84]:
# Basic imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
%matplotlib inline

In [85]:
# Import our most recent data and check out a sample
break_data = pd.read_csv("../data/raw/water_data.csv")
break_data.sample(10)


Columns (7,18,20,21,40,69) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,ROAD_CLOSED,SIDEWALK_CLOSED,HOUR_IMPACTED,UNITS_IMPACTED,CW_SERVICE_REQUEST,STATUS,...,CRITICALITY,REL_CLEANING_AREA,REL_CLEANING_SUBAREA,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
6115,9549,1377,1236643200000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,5.0,1.0,31.0,N,N,9.35,N,N,959775e3-1507-415a-9952-6a98953f61ee,13.305506
3633,8922,2014,1415232000000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,6.0,5.0,3.0,N,N,7.75,N,N,46ebcd70-c014-45e7-92c8-4a746182a72c,1.428562
2261,8578,1336,1231718400000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,5.0,1.0,20.0,N,N,9.35,N,N,b78a9daf-ccbe-4912-a462-a4602bfdaf99,0.935479
8464,10148,497,997488000000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,-1.0,5.0,6.0,N,N,9.85,N,N,66989675-9c3e-469d-95d0-4ea009a1300e,11.080928
3069,8782,897,971481600000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,7.0,1.0,24.0,N,N,7.85,N,N,cf0ea71d-871c-4700-9abd-6f66782a38df,29.997047
162,8032,1835,1363824000000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,,,,,,,,,,
5484,9392,400,1044144000000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,5.0,4.0,9.0,N,N,6.15,N,N,8891cd2f-e90c-4ecb-aa48-c4dd7b9822aa,116.810379
5266,9343,720,1052352000000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,7.0,6.0,0.0,N,N,9.35,N,Y,05d872e6-9648-4290-ac2d-6a5e4342ea6e,15.738052
2518,8646,2128,1429747200000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,7.0,5.0,17.0,N,N,8.5,N,N,1c7879a2-5e0b-4de6-9d08-a257e5d25767,144.349237
3135,8794,542,1008028800000,MAIN,Open,Open,8-12 hours,,,REPAIR COMPLETED,...,6.0,5.0,8.0,N,N,7.75,N,N,52061d52-c2d1-47e1-80d2-25d3e6d69787,7.619966


In [86]:
break_data.shape

(10721, 80)

In [99]:
break_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10721 entries, 0 to 10720
Data columns (total 43 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   OBJECTID                     10721 non-null  int64  
 1   INCIDENT_DATE                10721 non-null  int64  
 2   BREAK_TYPE                   10721 non-null  object 
 3   HOUR_IMPACTED                10721 non-null  object 
 4   STATUS                       10721 non-null  object 
 5   BREAK_NATURE                 10214 non-null  object 
 6   BREAK_APPARENT_CAUSE         10030 non-null  object 
 7   POSITIVE_PRESSURE_MAINTANED  10721 non-null  object 
 8   AIR_GAP_MAINTANED            10721 non-null  object 
 9   DISINFECTED                  10721 non-null  object 
 10  MECHANICAL_REMOVAL           10721 non-null  object 
 11  FLUSHING_EXCAVATION          10721 non-null  object 
 12  HIGHER_VELOCITY_FLUSHING     10721 non-null  object 
 13  ANODE_INSTALLED 

In [87]:
break_data.columns

Index(['OBJECTID', 'WATBREAKINCIDENTID', 'INCIDENT_DATE', 'BREAK_TYPE',
       'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'HOUR_IMPACTED', 'UNITS_IMPACTED',
       'CW_SERVICE_REQUEST', 'STATUS', 'STATUS_DATE', 'WORKORDER',
       'RETURN_TO_NORMAL', 'BREAK_NATURE', 'BREAK_APPARENT_CAUSE',
       'REPAIR_TYPE', 'NEW_SECTION_LENGTH', 'MAINTENANCE_DESC',
       'VALVES_CLOSED', 'VALVES_OPENED', 'HYDRANTS_CALLED_OUT',
       'HYDRANTS_CALLED_BACK_IN', 'POSITIVE_PRESSURE_MAINTANED',
       'AIR_GAP_MAINTANED', 'DISINFECTED', 'MECHANICAL_REMOVAL',
       'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING', 'ANODE_INSTALLED',
       'BREAK_CATEGORIZATION', 'BACTERIA_TESTING_DATE',
       'HEALTH_DEPT_NOTIFICATION', 'MOECC_SAC_NOTIFICATION',
       'SAC_REFERENCE_NO', 'LOCAL_MOE_OFFICE', 'BWA_DWA', 'BWA_DWA_DECLARED',
       'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'ROADSEGMENTID',
       'CIVIC_NUMBER', 'STREET', 'ASSETID', 'ASSET_DEPTH', 'FROST_DEPTH',
       'ASSET_SIZE', 'ASSET_YEAR_INSTALLED',

In [88]:
break_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10721 entries, 0 to 10720
Data columns (total 80 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   OBJECTID                     10721 non-null  int64  
 1   WATBREAKINCIDENTID           10721 non-null  int64  
 2   INCIDENT_DATE                10721 non-null  int64  
 3   BREAK_TYPE                   10721 non-null  object 
 4   ROAD_CLOSED                  10721 non-null  object 
 5   SIDEWALK_CLOSED              10721 non-null  object 
 6   HOUR_IMPACTED                10721 non-null  object 
 7   UNITS_IMPACTED               1126 non-null   object 
 8   CW_SERVICE_REQUEST           75 non-null     float64
 9   STATUS                       10721 non-null  object 
 10  STATUS_DATE                  10097 non-null  float64
 11  WORKORDER                    6092 non-null   float64
 12  RETURN_TO_NORMAL             382 non-null    float64
 13  BREAK_NATURE    

In [89]:
def print_null_values(df, column=None):
    if column is None:
        # print null values for all columns
        for col in df.columns:
            null_values = df[col].isnull().sum()
            print(f"{col} - ", null_values)
    else:
        # print null values for a single column
        null_values = df[column].isnull().sum()
        print(f"{col} - ", null_values)

In [90]:
# break_data.isna().sum()
print_null_values(break_data)

OBJECTID -  0
WATBREAKINCIDENTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
ROAD_CLOSED -  0
SIDEWALK_CLOSED -  0
HOUR_IMPACTED -  0
UNITS_IMPACTED -  9595
CW_SERVICE_REQUEST -  10646
STATUS -  0
STATUS_DATE -  624
WORKORDER -  4629
RETURN_TO_NORMAL -  10339
BREAK_NATURE -  507
BREAK_APPARENT_CAUSE -  691
REPAIR_TYPE -  10138
NEW_SECTION_LENGTH -  10639
MAINTENANCE_DESC -  10721
VALVES_CLOSED -  10717
VALVES_OPENED -  10721
HYDRANTS_CALLED_OUT -  10695
HYDRANTS_CALLED_BACK_IN -  10699
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  542
BACTERIA_TESTING_DATE -  10684
HEALTH_DEPT_NOTIFICATION -  10721
MOECC_SAC_NOTIFICATION -  10712
SAC_REFERENCE_NO -  10721
LOCAL_MOE_OFFICE -  10712
BWA_DWA -  0
BWA_DWA_DECLARED -  10713
PROCEEDURES_FOLLOWED -  0
RECORD_CHANGE_REQD -  0
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_DEP

In [91]:
# print the null values that have greater than 80 percent null values
def print_null_thresh_values(df, column=None, threshold=0.7):
    if column is None:
        # print null values for all columns
        for col in df.columns:
            null_values = df[col].isnull().sum()
            if null_values > threshold * len(df):
                print(f"{col} - ", null_values)
    else:
        # print null values for a single column
        null_values = df[column].isnull().sum()
        print(f"{column} - ", null_values)

print_null_thresh_values(break_data)

UNITS_IMPACTED -  9595
CW_SERVICE_REQUEST -  10646
RETURN_TO_NORMAL -  10339
REPAIR_TYPE -  10138
NEW_SECTION_LENGTH -  10639
MAINTENANCE_DESC -  10721
VALVES_CLOSED -  10717
VALVES_OPENED -  10721
HYDRANTS_CALLED_OUT -  10695
HYDRANTS_CALLED_BACK_IN -  10699
BACTERIA_TESTING_DATE -  10684
HEALTH_DEPT_NOTIFICATION -  10721
MOECC_SAC_NOTIFICATION -  10712
SAC_REFERENCE_NO -  10721
LOCAL_MOE_OFFICE -  10712
BWA_DWA_DECLARED -  10713
ASSET_DEPTH -  10665
FROST_DEPTH -  10440
LINED_DATE -  10586
CONSULTANT -  7557
BRIDGE_DETAILS -  10717


In [92]:
def drop_null_columns(df, threshold=0.7):
    for col in df.columns:
        null_values = df[col].isnull().sum()
        if null_values > threshold * len(df):
            df.drop(col, axis=1, inplace=True)

drop_null_columns(break_data)

In [93]:
break_data.columns

Index(['OBJECTID', 'WATBREAKINCIDENTID', 'INCIDENT_DATE', 'BREAK_TYPE',
       'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'HOUR_IMPACTED', 'STATUS',
       'STATUS_DATE', 'WORKORDER', 'BREAK_NATURE', 'BREAK_APPARENT_CAUSE',
       'POSITIVE_PRESSURE_MAINTANED', 'AIR_GAP_MAINTANED', 'DISINFECTED',
       'MECHANICAL_REMOVAL', 'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING',
       'ANODE_INSTALLED', 'BREAK_CATEGORIZATION', 'BWA_DWA',
       'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'ROADSEGMENTID',
       'CIVIC_NUMBER', 'STREET', 'ASSETID', 'ASSET_SIZE',
       'ASSET_YEAR_INSTALLED', 'ASSET_MATERIAL', 'ASSET_EXISTS', 'GLOBALID',
       'longitude', 'latitude', 'OBJECTID.1', 'WATMAINID', 'STATUS.1',
       'PRESSURE_ZONE', 'ROADSEGMENTID.1', 'MAP_LABEL', 'CATEGORY',
       'PIPE_SIZE', 'MATERIAL', 'LINED', 'LINED_MATERIAL', 'INSTALLATION_DATE',
       'ACQUISITION', 'OWNERSHIP', 'BRIDGE_MAIN', 'CRITICALITY',
       'REL_CLEANING_AREA', 'REL_CLEANING_SUBAREA', 'UNDERSIZED',
       'SHALLOW

In [94]:
break_data.drop(['WATBREAKINCIDENTID', 'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'STATUS_DATE', 'WORKORDER',
                 'BWA_DWA', 'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'OBJECTID.1', 'WATMAINID', 'STATUS.1',
                 'ROADSEGMENTID.1', 'ACQUISITION', 'OWNERSHIP', 'REL_CLEANING_AREA', 'REL_CLEANING_SUBAREA',], axis=1, inplace=True)

In [95]:
break_data.shape

(10721, 43)

In [96]:
# mains_data = pd.read_csv("../data/raw/Water_Mains.csv")
# breaks_data = pd.read_csv("../data/raw/Water_Main_Breaks.csv")

In [97]:
# take the INSTALLATION_DATE from Water_Mains.csv and the INCIDENT_DATE from Water_Main_Breaks.csv and add them into this dataframe to replace the same columns that are there now
# they should match up with the ROADSEGMENTID

In [98]:
break_data['INCIDENT_DATE'] = pd.to_datetime(break_data['INCIDENT_DATE'], unit='s')
break_data['INSTALLATION_DATE'] = pd.to_datetime(break_data['INSTALLATION_DATE'], unit='s')

OutOfBoundsDatetime: cannot convert input with unit 's'

In [29]:
break_data.head()

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
0,1,1970-01-01 00:25:12.141300,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,1969-12-31 23:35:16.771200,N,6.0,N,N,3.85,N,N,7032c165-215f-4cf6-ae35-65c9320aab4e,138.194599
1,1,1970-01-01 00:25:12.141300,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,1970-01-01 00:24:44.092800,N,-1.0,N,N,9.85,N,N,791c47f5-9a9a-48b1-94bb-cad82bec682c,4.721263
2,1,1970-01-01 00:25:12.141300,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,1970-01-01 00:24:44.092800,N,-1.0,N,N,9.85,N,N,fbc17d4d-3636-4b58-823b-54b3cbc77470,1.399993
3,1,1970-01-01 00:25:12.141300,MAIN,12-16 hours,REPAIR COMPLETED,CORROSION AND FITTING/JOINT,AGE,N,N,Y,...,1970-01-01 00:24:44.092800,N,-1.0,N,N,5.93,N,N,e67cbabe-4883-4c85-854b-f0367fff4f1a,1.579268
4,7874,1970-01-01 00:16:25.564800,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1969-12-31 23:57:22.233600,N,5.0,N,N,7.85,N,Y,c99abdb4-c3cc-409d-be83-5b42e4029bc9,12.798598


In [32]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  507
BREAK_APPARENT_CAUSE -  691
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  542
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_SIZE -  720
ASSET_YEAR_INSTALLED -  739
ASSET_MATERIAL -  720
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  76
MAP_LABEL -  76
CATEGORY -  76
PIPE_SIZE -  76
MATERIAL -  76
LINED -  76
LINED_MATERIAL -  76
INSTALLATION_DATE -  77
BRIDGE_MAIN -  76
CRITICALITY -  76
UNDERSIZED -  76
SHALLOW_MAIN -  76
CONDITION_SCORE -  76
OVERSIZED -  76
CLEANED -  76
GlobalID -  76
Shape__Length -  76


Let's investigate the types of values there are in the `BREAK_APPARENT_CAUSE` variable and `BREAK_NATURE` variable. There aren't a lot of missing values so I might be able to easily impute them.

In [33]:
break_data.BREAK_APPARENT_CAUSE.unique()

array(['AGE', 'OTHER', 'COMBINATION', 'CORROSION', 'SOILS', 'UNKNOWN',
       'PRESSURE', 'FAULTY INSTALL', nan], dtype=object)

In [34]:
break_data.BREAK_APPARENT_CAUSE.value_counts()

OTHER             7962
AGE                825
COMBINATION        503
CORROSION          405
UNKNOWN            132
PRESSURE            92
SOILS               72
FAULTY INSTALL      39
Name: BREAK_APPARENT_CAUSE, dtype: int64

I'll fill the nan values with 'UNKNOWN'

In [35]:
break_data.BREAK_APPARENT_CAUSE.fillna('UNKNOWN', inplace=True)

In [36]:
break_data.BREAK_APPARENT_CAUSE.value_counts()


OTHER             7962
AGE                825
UNKNOWN            823
COMBINATION        503
CORROSION          405
PRESSURE            92
SOILS               72
FAULTY INSTALL      39
Name: BREAK_APPARENT_CAUSE, dtype: int64

In [37]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      7936
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
OTHER                                          46
CORROSION AND FITTING/JOINT                    43
OTHER: WATER SERVICE                           36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [38]:
break_data['BREAK_NATURE'] = break_data['BREAK_NATURE'].replace({'OTHER: WATER SERVICE': 'WATER SERVICE'})

In [39]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      7936
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
OTHER                                          46
CORROSION AND FITTING/JOINT                    43
WATER SERVICE                                  36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [40]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL', 'OTHER', nan,
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [41]:
break_data.BREAK_NATURE.fillna('UNKNOWN', inplace=True)

In [42]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      8443
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
OTHER                                          46
CORROSION AND FITTING/JOINT                    43
WATER SERVICE                                  36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [43]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL', 'OTHER',
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [44]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  542
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_SIZE -  720
ASSET_YEAR_INSTALLED -  739
ASSET_MATERIAL -  720
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  76
MAP_LABEL -  76
CATEGORY -  76
PIPE_SIZE -  76
MATERIAL -  76
LINED -  76
LINED_MATERIAL -  76
INSTALLATION_DATE -  77
BRIDGE_MAIN -  76
CRITICALITY -  76
UNDERSIZED -  76
SHALLOW_MAIN -  76
CONDITION_SCORE -  76
OVERSIZED -  76
CLEANED -  76
GlobalID -  76
Shape__Length -  76


In [45]:
break_data.BREAK_CATEGORIZATION.value_counts()

CATEGORY 1    10114
CATEGORY 2       65
Name: BREAK_CATEGORIZATION, dtype: int64

In [46]:
break_data.BREAK_CATEGORIZATION.unique()

array(['CATEGORY 2', 'CATEGORY 1', nan], dtype=object)

In [47]:
def fill_null_values(df, column, value):
    # fill null values in the specified column with the specified value
    df[column].fillna(value, inplace=True)

In [48]:
# break_data.BREAK_CATEGORIZATION.fillna('UNKNOWN', inplace=True)
fill_null_values(break_data, 'BREAK_CATEGORIZATION', 'UNKNOWN')

In [49]:
break_data.BREAK_CATEGORIZATION.value_counts()

CATEGORY 1    10114
UNKNOWN         542
CATEGORY 2       65
Name: BREAK_CATEGORIZATION, dtype: int64

In [50]:
break_data['BREAK_NATURE'] = break_data['BREAK_NATURE'].replace({'OTHER': 'UNKNOWN'})

In [51]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL',
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [52]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      8489
CIRCUMFERENTIAL                              1295
CORROSION                                     357
FITTING/JOINT                                 185
LONGITUDINAL                                  102
CIRCUMFERENTIAL AND FITTING/JOINT              89
CORROSION AND CIRCUMFERENTIAL                  82
CORROSION AND FITTING/JOINT                    43
WATER SERVICE                                  36
CORROSION AND LONGITUDINAL                     32
FITTING/JOINT AND LONGITUDINAL                  9
CORROSION - ROBAR SADDLE CORRODED AT SEAM       2
Name: BREAK_NATURE, dtype: int64

In [53]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
CIVIC_NUMBER -  525
STREET -  20
ASSETID -  0
ASSET_SIZE -  720
ASSET_YEAR_INSTALLED -  739
ASSET_MATERIAL -  720
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  76
MAP_LABEL -  76
CATEGORY -  76
PIPE_SIZE -  76
MATERIAL -  76
LINED -  76
LINED_MATERIAL -  76
INSTALLATION_DATE -  77
BRIDGE_MAIN -  76
CRITICALITY -  76
UNDERSIZED -  76
SHALLOW_MAIN -  76
CONDITION_SCORE -  76
OVERSIZED -  76
CLEANED -  76
GlobalID -  76
Shape__Length -  76


In [54]:
break_data['CIVIC_NUMBER'].sample(5)

6473       22
2447      167
8862    257.0
8112      150
8866     68.0
Name: CIVIC_NUMBER, dtype: object

In [55]:
break_data.drop(['CIVIC_NUMBER'], axis=1, inplace=True)

We can see there's some `STREET` names missing. It would be best to explore these missing attributes and see how we can possibly impute them. They still contain `LATITUDE` and `LONGITUDE` values so we can possibly identify them by those. Let's see what we can do...

In [56]:
no_street = break_data.loc[break_data.STREET.isna()]
no_street

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
548,8140,1970-01-01 00:20:56.169600,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,NaT,,,,,,,,,
1066,8278,1970-01-01 00:16:24.528000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:02:27.225600,N,6.0,N,N,4.9,N,N,f9f9bd80-ceff-4ce0-951c-03c02a4cc6e7,31.843715
1067,8278,1970-01-01 00:16:24.528000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:02:27.225600,N,0.0,N,N,8.5,N,N,9e411899-71a2-441f-bd0d-ccec7a529549,1.63387
1068,8278,1970-01-01 00:16:24.528000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:02:27.225600,N,6.0,N,N,8.5,N,N,0e30f363-a96b-49c5-a8a0-e6ac38693bc6,5.867424
1900,8478,1970-01-01 00:23:09.312000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:04:07.190400,N,6.0,N,N,8.5,N,N,a8bc6f06-948d-49b9-9fa6-b3ff49af237f,6.594864
1901,8478,1970-01-01 00:23:09.312000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:04:07.190400,N,7.0,N,N,6.9,N,N,ee2352cf-7fdf-4b1c-ab21-fcf4a3c636bf,117.352822
1902,8478,1970-01-01 00:23:09.312000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:04:07.190400,N,7.0,N,N,8.5,N,N,e0706f51-cc0b-4d07-9535-93597255e037,7.917657
2660,8678,1970-01-01 00:22:03.963507,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,NaT,,,,,,,,,
5659,9428,1970-01-01 00:22:03.963508,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,NaT,,,,,,,,,
6294,9592,1970-01-01 00:22:03.963508,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,NaT,,,,,,,,,


Looking at the latitude and longitude of the missing street addresses, there are actually only 5 distinct missing addresses. We can plot them on a map and see where they are located.

In [57]:
import plotly
import plotly.express as px

In [59]:
from config import token
api_token = token
px.set_mapbox_access_token(token)
fig = px.scatter_mapbox(data_frame=no_street, lat='latitude', lon='longitude')
fig.update_layout(mapbox_style="carto-positron", mapbox_accesstoken=token)
fig.show();

In [60]:
# drop the rows with no street
break_data.dropna(subset=['STREET'], inplace=True)

In [61]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  707
ASSET_YEAR_INSTALLED -  726
ASSET_MATERIAL -  707
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  62
MAP_LABEL -  62
CATEGORY -  62
PIPE_SIZE -  62
MATERIAL -  62
LINED -  62
LINED_MATERIAL -  62
INSTALLATION_DATE -  63
BRIDGE_MAIN -  62
CRITICALITY -  62
UNDERSIZED -  62
SHALLOW_MAIN -  62
CONDITION_SCORE -  62
OVERSIZED -  62
CLEANED -  62
GlobalID -  62
Shape__Length -  62


In [62]:
break_data.ASSET_SIZE.value_counts()

150.0     6429
300.0     1635
200.0     1059
450.0      292
100.0      265
600.0      102
250.0       76
25.0        41
750.0       26
0.0         20
13.0        18
50.0        17
1200.0      14
Name: ASSET_SIZE, dtype: int64

So there's 707 missing values for `ASSET_SIZE`. We could simply fill the null values with the mean value, the most frequent value, or just set them to 0. Another way that I feel might be the most appropriate approach is visualize on a map where these asset sizes occur, categorizing by street name, and then seeing if the null values occur on the same streets as some of those already known asset sizes, and then fill in the missing sizes with which neighbours their near.

In [63]:
# pd.set_option("display.max_rows", None)
break_data.loc[break_data.ASSET_SIZE.isna()]

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
29,7881,1970-01-01 00:23:46.118400,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:22:34.060800,N,-1.0,N,N,9.35,N,N,6610f5df-daf4-4740-a895-6998427c4db1,17.641098
30,7881,1970-01-01 00:23:46.118400,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,067a573a-2de1-4b0c-a9dd-e1c47a4c215a,95.003715
31,7881,1970-01-01 00:23:46.118400,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,91af793e-7013-4ec2-81cd-5124d597c909,89.270670
32,7881,1970-01-01 00:23:46.118400,SERVICE,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,e7350b7b-8bb5-4abb-86ef-a07c09f6978e,84.284014
76,8007,1970-01-01 00:17:31.488000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,2edaca94-c56e-44ea-b779-47ba2a9c9d43,155.740656
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9722,19841,1970-01-01 00:26:19.256677,SERVICE,20-24 hours,REPAIR COMPLETED,UNKNOWN,UNKNOWN,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,067a573a-2de1-4b0c-a9dd-e1c47a4c215a,95.003715
9723,19841,1970-01-01 00:26:19.256677,SERVICE,20-24 hours,REPAIR COMPLETED,UNKNOWN,UNKNOWN,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,91af793e-7013-4ec2-81cd-5124d597c909,89.270670
9724,19841,1970-01-01 00:26:19.256677,SERVICE,20-24 hours,REPAIR COMPLETED,UNKNOWN,UNKNOWN,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,e7350b7b-8bb5-4abb-86ef-a07c09f6978e,84.284014
9969,33925,1970-01-01 00:26:50.785832,MAIN,4-8 hours,CANCELLED,UNKNOWN,UNKNOWN,Y,Y,Y,...,1970-01-01 00:27:02.505600,N,-1.0,N,N,-1.00,N,N,e55dd405-e462-4dd7-b2fb-b2a2e59ea19e,15.511798


What I will try to do next is look at a sort of breakdown of the asset sizes, listed with their asset ID's and corresponding street names, and see if we can match up any non-null asset sizes from the same street as asset sizes with null values. There's a possibility of imputing the average asset size value from the same street for the null values. For now I am jusy hypothesizing this, I'm not entirely sure if it'll work but why not try right?

In [64]:
asset_size = break_data[['STREET', 'ASSETID', 'ASSET_SIZE']]
# cherry picking null asset sizes to see which streets we could look at
asset_size.loc[asset_size.ASSET_SIZE.isna()].sample(10)

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
5180,MILL ST,26700,
8864,MILL ST,64030,
3230,MILL ST,26750,
5272,OTTAWA ST S,64330,
7164,FERGUS AVE,88992,
3571,WEBER ST E,40980,
3229,MILL ST,26750,
5324,STIRLING AVE S,36250,
5693,FLORENCE AVE,14690,
6992,WALKER ST,40340,


Next I'll see if there are any matching asset ID's for null and non-null asset sizes. If there are any matches, then it could be safe to say that I could impute the null values with the ID's matching asset size. If this doesn't prove to be true then I'll have to explore more options.

In [65]:
asset_size[asset_size['STREET'] == 'FLORENCE AVE']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
97,FLORENCE AVE,14690,
98,FLORENCE AVE,14690,
99,FLORENCE AVE,14690,
100,FLORENCE AVE,14690,
101,FLORENCE AVE,14690,
...,...,...,...
9648,FLORENCE AVE,14700,
9649,FLORENCE AVE,14700,
9650,FLORENCE AVE,14700,
9651,FLORENCE AVE,14700,


Unfortunately there are no matching ID's for this street. Let's check a few more streets to be sure that this trend might not hold...

In [66]:
asset_size[asset_size['STREET'] == 'OTTAWA ST S']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
149,OTTAWA ST S,95460,
150,OTTAWA ST S,95460,
151,OTTAWA ST S,95460,
152,OTTAWA ST S,95460,
153,OTTAWA ST S,95460,
...,...,...,...
10560,OTTAWA ST S,58380,450.0
10561,OTTAWA ST S,58380,450.0
10562,OTTAWA ST S,58380,450.0
10563,OTTAWA ST S,58380,450.0


In [67]:
# display street names with no asset size
streets_with_na = asset_size.STREET[asset_size['ASSET_SIZE'].isna()].unique()
print(streets_with_na)

['MILL ST' 'FLORENCE AVE' 'OTTAWA ST S' 'WEBER ST E' 'OTTAWA ST N'
 'ST CLAIR AVE' 'BRIDGE ST E' 'FERGUS AVE' 'CORAL CRES' 'THALER AVE'
 'BOEHMER ST' 'MAUSSER AVE' 'VALEWOOD PL' 'WINDOM RD' 'REX DR'
 'NORFOLK CRES' 'KEHL ST' 'HEIMAN ST' 'HEBEL PL' 'GUERIN AVE' 'BECKER ST'
 'STIRLING AVE S' 'MAURICE ST' 'SOUTHILL DR' 'WALKER ST' 'EIGHTH AVE'
 'FAIRMOUNT RD' 'HUBER ST' 'GOLFVIEW PL' 'PATTANDON AVE' 'HOFFMAN ST'
 'SYDNEY ST S' 'ANN ST' 'EDWIN ST' 'SCHWEITZER ST']


In [68]:
asset_size[asset_size['STREET'] == 'WEBER ST E']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
107,WEBER ST E,41030,150.0
108,WEBER ST E,41030,150.0
347,WEBER ST E,79238,
348,WEBER ST E,79238,
349,WEBER ST E,79238,
...,...,...,...
9687,WEBER ST E,41030,150.0
10020,WEBER ST E,41030,150.0
10021,WEBER ST E,41030,150.0
10311,WEBER ST E,41030,150.0


Well there we have it, each ID is unique to the asset size whether it's missing or not.

I will fill in the `NaN` values with the most frequent number (mode) for now.

In [69]:
asset_size.ASSET_SIZE.value_counts()

150.0     6429
300.0     1635
200.0     1059
450.0      292
100.0      265
600.0      102
250.0       76
25.0        41
750.0       26
0.0         20
13.0        18
50.0        17
1200.0      14
Name: ASSET_SIZE, dtype: int64

In [70]:
# fill asset size with the mode of the column
break_data['ASSET_SIZE'].fillna(break_data['ASSET_SIZE'].mode()[0], inplace=True)

In [71]:
break_data.ASSET_SIZE.value_counts()

150.0     7136
300.0     1635
200.0     1059
450.0      292
100.0      265
600.0      102
250.0       76
25.0        41
750.0       26
0.0         20
13.0        18
50.0        17
1200.0      14
Name: ASSET_SIZE, dtype: int64

In [72]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  726
ASSET_MATERIAL -  707
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  62
MAP_LABEL -  62
CATEGORY -  62
PIPE_SIZE -  62
MATERIAL -  62
LINED -  62
LINED_MATERIAL -  62
INSTALLATION_DATE -  63
BRIDGE_MAIN -  62
CRITICALITY -  62
UNDERSIZED -  62
SHALLOW_MAIN -  62
CONDITION_SCORE -  62
OVERSIZED -  62
CLEANED -  62
GlobalID -  62
Shape__Length -  62


In [73]:
break_data.ASSET_YEAR_INSTALLED.value_counts()

1958.0    526
1967.0    476
1966.0    467
1953.0    413
1960.0    404
         ... 
1990.0      2
1920.0      2
1945.0      1
1904.0      1
1997.0      1
Name: ASSET_YEAR_INSTALLED, Length: 94, dtype: int64

In [74]:
# compare the asset year installed with the installation date
break_data[['ASSET_YEAR_INSTALLED', 'INSTALLATION_DATE']].sample(10)

Unnamed: 0,ASSET_YEAR_INSTALLED,INSTALLATION_DATE
10208,1961.0,1970-01-01 00:15:59.817600
1044,1960.0,1969-12-31 23:54:44.380800
4309,1947.0,1970-01-01 00:26:37.104000
3508,1968.0,1969-12-31 23:59:17.923200
10144,1953.0,1969-12-31 23:51:03.542400
6926,1958.0,1969-12-31 23:53:41.308800
3957,1971.0,1970-01-01 00:08:56.457600
5044,1955.0,1970-01-01 00:07:21.763200
9358,1973.0,1970-01-01 00:01:34.694400
2027,1966.0,1969-12-31 23:57:53.769600


In [77]:
# show the missing asset year installed rows with their installation date
break_data[['ASSET_YEAR_INSTALLED', 'INSTALLATION_DATE']][break_data['ASSET_YEAR_INSTALLED'].isna()].sample(30)

Unnamed: 0,ASSET_YEAR_INSTALLED,INSTALLATION_DATE
9614,,1970-01-01 00:26:39.782400
6969,,1970-01-01 00:26:00.729600
9093,,1970-01-01 00:26:07.296000
6507,,1970-01-01 00:27:31.363200
3222,,1970-01-01 00:26:01.075200
7165,,1969-12-31 23:52:38.150400
9058,,1970-01-01 00:26:31.315200
1459,,1970-01-01 00:27:10.454400
9547,,1970-01-01 00:26:00.729600
1410,,1970-01-01 00:27:02.505600


From reading different studies, it seems as though the year the pipe was installed or rather the age of the pipe is a critical factor in predicting breaks/time of failure. In this case, I believe I may have to drop the observations where no year is indicated. 

In [78]:
break_data.dropna(axis=0, subset=['ASSET_YEAR_INSTALLED'], inplace=True)

In [79]:
print(break_data.shape)
print_null_values(break_data)

(9975, 42)
OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  0
ASSET_MATERIAL -  0
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  58
MAP_LABEL -  58
CATEGORY -  58
PIPE_SIZE -  58
MATERIAL -  58
LINED -  58
LINED_MATERIAL -  58
INSTALLATION_DATE -  59
BRIDGE_MAIN -  58
CRITICALITY -  58
UNDERSIZED -  58
SHALLOW_MAIN -  58
CONDITION_SCORE -  58
OVERSIZED -  58
CLEANED -  58
GlobalID -  58
Shape__Length -  58


In [80]:
for col in break_data.columns:
    if break_data[col].isnull().sum() > 0:
        print(col, break_data[col].unique())

PRESSURE_ZONE ['KIT 4' 'KIT 5' 'KIT 2E' 'KIT 6' 'BRIDGEPORT' nan 'KIT 2W' 'RAW NO ZONE'
 'WAT 4' 'MANNHEIM']
MAP_LABEL ['138.2m 150mm  CI' '4.7m 25mm  PVC' '1.4m 25mm  PVC' ...
 '88.1m 150mm  DI' '9.3m 150mm  DI' '75.5m 150mm  DI']
CATEGORY ['TREATED' 'UNTREATED' nan]
PIPE_SIZE [ 150.   25.  450.  200.  300.  750.  250.   nan  600.  100.    0.  900.
   50.  400. 1200.   38.]
MATERIAL ['CI' 'PVC' 'DI' 'HDPE' 'CPP' 'PVCB' 'ST' 'PVCO' nan 'AC' 'PVCF' 'COP']
LINED ['NO' nan 'YES']
LINED_MATERIAL ['NONE' nan 'EPS' 'CEMENT']
INSTALLATION_DATE ['1969-12-31T23:35:16.771200000' '1970-01-01T00:24:44.092800000'
 '1969-12-31T23:57:22.233600000' '1969-12-31T23:57:53.769600000'
 '1970-01-01T00:21:16.473600000' '1969-12-31T23:54:12.844800000'
 '1970-01-01T00:26:37.622400000' '1970-01-01T00:26:39.004800000'
 '1970-01-01T00:01:34.694400000' '1970-01-01T00:02:37.766400000'
 '1970-01-01T00:00:41.904000000' '1970-01-01T00:15:59.817600000'
 '1970-01-01T00:08:56.457600000' '1970-01-01T00:20:34.310400000'
 '

Pretty sure if we drop the installation dates that are missing instead of imputing them, that will get rid of all of the rest of the missing information...

In [81]:
break_data.dropna(axis=0, subset=['INSTALLATION_DATE'], inplace=True)

In [82]:
print_null_values(break_data)

OBJECTID -  0
INCIDENT_DATE -  0
BREAK_TYPE -  0
HOUR_IMPACTED -  0
STATUS -  0
BREAK_NATURE -  0
BREAK_APPARENT_CAUSE -  0
POSITIVE_PRESSURE_MAINTANED -  0
AIR_GAP_MAINTANED -  0
DISINFECTED -  0
MECHANICAL_REMOVAL -  0
FLUSHING_EXCAVATION -  0
HIGHER_VELOCITY_FLUSHING -  0
ANODE_INSTALLED -  0
BREAK_CATEGORIZATION -  0
ROADSEGMENTID -  0
STREET -  0
ASSETID -  0
ASSET_SIZE -  0
ASSET_YEAR_INSTALLED -  0
ASSET_MATERIAL -  0
ASSET_EXISTS -  0
GLOBALID -  0
longitude -  0
latitude -  0
PRESSURE_ZONE -  0
MAP_LABEL -  0
CATEGORY -  0
PIPE_SIZE -  0
MATERIAL -  0
LINED -  0
LINED_MATERIAL -  0
INSTALLATION_DATE -  0
BRIDGE_MAIN -  0
CRITICALITY -  0
UNDERSIZED -  0
SHALLOW_MAIN -  0
CONDITION_SCORE -  0
OVERSIZED -  0
CLEANED -  0
GlobalID -  0
Shape__Length -  0


In [83]:
break_data.sample(15)

Unnamed: 0,OBJECTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,STATUS,BREAK_NATURE,BREAK_APPARENT_CAUSE,POSITIVE_PRESSURE_MAINTANED,AIR_GAP_MAINTANED,DISINFECTED,...,INSTALLATION_DATE,BRIDGE_MAIN,CRITICALITY,UNDERSIZED,SHALLOW_MAIN,CONDITION_SCORE,OVERSIZED,CLEANED,GlobalID,Shape__Length
3315,8850,1970-01-01 00:23:15.446400,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1969-12-31 23:46:51.081600,N,6.0,N,N,3.83,N,N,c6e70d57-8fd7-47e5-b3c4-9cf525cf9eb3,4.985097
8245,10099,1970-01-01 00:19:29.942400,MAIN,8-12 hours,REPAIR COMPLETED,CIRCUMFERENTIAL,CORROSION,Y,Y,Y,...,1969-12-31 23:58:25.305600,N,5.0,N,N,5.45,N,N,e2141449-5241-41a5-a223-494cea9a2cbd,143.744248
418,8105,1970-01-01 00:16:29.971200,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:20:01.824000,N,8.0,N,N,9.35,N,N,4946cabd-d51b-423e-8b56-1c48d18f487e,0.668591
5835,9479,1970-01-01 00:17:49.632000,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:05:47.155200,N,5.0,N,N,6.9,N,N,e2a46eec-187f-4876-a585-ccc2dc3b0367,131.176354
7872,9989,1970-01-01 00:21:06.192000,MAIN,8-12 hours,REPAIR COMPLETED,FITTING/JOINT,COMBINATION,Y,Y,Y,...,1970-01-01 00:25:56.064000,N,-1.0,N,N,9.85,N,N,a1d04f24-2540-45bb-9567-cf4f6b0a53dc,1.576866
10345,50241,1970-01-01 00:27:24.138442,MAIN,4-8 hours,CANCELLED,CIRCUMFERENTIAL,AGE,Y,Y,Y,...,1969-12-31 23:53:41.308800,N,5.0,N,N,7.75,N,N,7ef55bda-631c-49b8-be40-7cd5203b003f,0.995367
8928,10269,1970-01-01 00:25:45.350400,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1969-12-31 23:56:19.075200,N,7.0,N,N,7.85,N,N,63f50500-2d20-4e9e-909d-08400e99e2c6,0.60941
2024,8515,1970-01-01 00:16:54.422400,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1969-12-31 23:53:41.308800,N,4.0,N,N,4.55,N,N,d12d3c9e-510b-4e67-b28a-79184dc8889e,301.953636
7501,9886,1970-01-01 00:23:48.537600,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1970-01-01 00:25:35.760000,N,-1.0,N,N,9.85,N,Y,e3bf299b-975d-40e3-a4f5-c2e5a23355d2,442.68666
1831,8463,1970-01-01 00:15:49.881600,MAIN,8-12 hours,REPAIR COMPLETED,UNKNOWN,OTHER,Y,Y,Y,...,1969-12-31 23:58:25.305600,N,7.0,N,N,7.85,N,N,0a2f7b01-123c-4833-95a9-4861a74c8308,1.460324


Lucky for us that took care of our missing values in the `ASSET_MATERIAL` column. Now we have a clean dataset finally to do some EDA and feature engineering.

In [215]:
break_data.to_csv('../data/processed/cleaned_break_data.csv', index=False)