## Initial Pre-processing

This notebook demonstrates my pre-processing steps for water main break data. I download and import the most recently available data from the [regional database](https://open-kitchenergis.opendata.arcgis.com/datasets/KitchenerGIS::water-main-breaks/about) and get rid of the unnecessary columns so we can focus on the features that are going to essential for our model later on.

Let's dive in!

In [1]:
# Basic imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
%matplotlib inline

In [5]:
# Import our most recent data and check out a sample

break_data = pd.read_csv("../data/Nov_10_22_Water_Main_Breaks.csv")
break_data.sample(10)

Unnamed: 0,X,Y,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,ROAD_CLOSED,SIDEWALK_CLOSED,HOUR_IMPACTED,UNITS_IMPACTED,...,CIVIC_NUMBER,STREET,ASSETID,ASSET_DEPTH,FROST_DEPTH,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
320,-80.453139,43.444913,8305,683,2003/12/08 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,59,DALEWOOD DR,8890,,,150.0,1966.0,CI,Y,7df688cc-05ac-42fc-9763-1a74b668d279
416,-80.520601,43.418163,8401,899,2001/05/31 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,8,GLEN LAKE CRES,16550,,,150.0,1969.0,CI,Y,3cb15c2c-7fae-4c61-8ce4-78565bb35254
1858,-80.492316,43.452447,9843,2125,2015/04/23 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,120,DUKE ST W,10520,,,150.0,1974.0,CI,Y,10718ff2-9419-4e84-b48e-7abc5ba6972f
1941,-80.47096,43.452082,9926,1496,2010/08/04 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,38,GLENDALE RD,133118,,,150.0,2016.0,PVC,N,f4e6ab53-f688-4f8c-bed5-3bf59a1cb1f1
1382,-80.502692,43.429417,9367,396,2001/11/23 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,30,FOREST HILL DR,14950,,,150.0,1958.0,CI,Y,446cd53b-aa3e-49d1-9ab9-2d7cc64109d1
1547,-80.482027,43.43216,9532,363,2000/12/22 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,55,HOFFMAN ST,19630,,,150.0,1957.0,CI,Y,13e23d09-7ca1-47ad-b9a7-60f8b805601f
2328,-80.523013,43.438272,10313,2382,2018/04/18 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,95,VICMOUNT DR,92756,,,150.0,1971.0,DI,Y,ed9f4e6a-cfdf-4890-84cd-0f0342e4b54e
1760,-80.526047,43.441314,9745,869,2001/06/18 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,123,CHOPIN DR,7160,,,150.0,1968.0,DI,Y,19579309-33e3-4dbb-8552-c34f9f76fd8f
1504,-80.484829,43.443714,9489,1398,2009/07/31 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,114,MADISON AVE S,69383,,,150.0,2008.0,PVC,Y,5ff1f653-0cef-48ab-b58f-f0692ce0b646
1077,-80.514033,43.4257,9062,2070,2015/02/20 00:00:00+00,MAIN,Open,Open,8-12 hours,,...,156,RIPPLEWOOD CRES,88076,,,150.0,1960.0,CI,Y,95a8fe6e-042f-4371-b938-538140df60fa


In [6]:
break_data.shape

(2750, 52)

It seems like we have lots of columns that don't give us any useful information. Let's have a look and determine which columns we'd like to keep.

In [4]:
break_data.columns

Index(['X', 'Y', 'OBJECTID', 'WATBREAKINCIDENTID', 'INCIDENT_DATE',
       'BREAK_TYPE', 'ROAD_CLOSED', 'SIDEWALK_CLOSED', 'HOUR_IMPACTED',
       'UNITS_IMPACTED', 'CW_SERVICE_REQUEST', 'STATUS', 'STATUS_DATE',
       'WORKORDER', 'RETURN_TO_NORMAL', 'BREAK_NATURE', 'BREAK_APPARENT_CAUSE',
       'REPAIR_TYPE', 'NEW_SECTION_LENGTH', 'MAINTENANCE_DESC',
       'VALVES_CLOSED', 'VALVES_OPENED', 'HYDRANTS_CALLED_OUT',
       'HYDRANTS_CALLED_BACK_IN', 'POSITIVE_PRESSURE_MAINTANED',
       'AIR_GAP_MAINTANED', 'DISINFECTED', 'MECHANICAL_REMOVAL',
       'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING', 'ANODE_INSTALLED',
       'BREAK_CATEGORIZATION', 'BACTERIA_TESTING_DATE',
       'HEALTH_DEPT_NOTIFICATION', 'MOECC_SAC_NOTIFICATION',
       'SAC_REFERENCE_NO', 'LOCAL_MOE_OFFICE', 'BWA_DWA', 'BWA_DWA_DECLARED',
       'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'ROADSEGMENTID',
       'CIVIC_NUMBER', 'STREET', 'ASSETID', 'ASSET_DEPTH', 'FROST_DEPTH',
       'ASSET_SIZE', 'ASSET_YEAR_I

### Column description breakdown:

- Wat Break Incident ID
- Incident date
- Type of Asset Broken
- Does the road need to be closed?
- Does the sidewalk need to be closed?
- Estimated Hours for Repair
- Estimated Number of Units Impacted
- CW Service Request Number
- Current status of the break
- Status last updated date
- CW Workorder #
- Date operations was returned to normal service
- Nature of Break
- Apparent cause of break
- Repair Type
- New Section Length (m)
- Type of Planned Maintenance
- List Valves Closed
- List Hydrants Called Out
- List Hydrants Called Back In
- Positive Pressure Maintained?
- Air Gap Maintained thr Repair Process?
- Pipe and Press Parts Disinfected?
- Mechanical removal of contaminants?
- Flushing into the excavation?
- Higher velocity flushing after repairs?
- Anode Installed?
- Categorization of the Break
- Bacteria Testing Date Taken
- Health Dept Notification
- MOECC/SAC Notification
- Local MOE Office Notification
- BWA/DWA declared
- IF BWA / DWA are issued, Date/Time declared
- Where proceedures Followed?
- Was a record change required?
- Road Segment ID
- Closest Civic Number
- Street
- Related Asset ID
- Related Asset Depth (m)
- Depth of Frost (m)
- Asset Size (cm)
- Year Asset Installed
- Asset Material
- Asset Exists

We should check for null values in the features to see if it's worth imputing them or to just get rid of them altogether.

In [6]:
break_data.isna().sum()

X                                 0
Y                                 0
OBJECTID                          0
WATBREAKINCIDENTID                0
INCIDENT_DATE                     0
BREAK_TYPE                        0
ROAD_CLOSED                       0
SIDEWALK_CLOSED                   0
HOUR_IMPACTED                     0
UNITS_IMPACTED                 2472
CW_SERVICE_REQUEST             2651
STATUS                            0
STATUS_DATE                     167
WORKORDER                      1174
RETURN_TO_NORMAL               2621
BREAK_NATURE                     89
BREAK_APPARENT_CAUSE            127
REPAIR_TYPE                    2567
NEW_SECTION_LENGTH             2662
MAINTENANCE_DESC               2670
VALVES_CLOSED                  2669
VALVES_OPENED                  2670
HYDRANTS_CALLED_OUT            2669
HYDRANTS_CALLED_BACK_IN        2669
POSITIVE_PRESSURE_MAINTANED       0
AIR_GAP_MAINTANED                 0
DISINFECTED                       0
MECHANICAL_REMOVAL          

In [7]:
break_data['UNITS_IMPACTED'] = break_data['UNITS_IMPACTED'].fillna(0)

In [8]:
break_data = break_data.rename(columns={'X': 'LONGITUDE', 'Y': 'LATITUDE'})

In [9]:
break_data['INCIDENT_DATE'] = pd.to_datetime(break_data['INCIDENT_DATE'])

In [10]:
break_data.head()

Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,ROAD_CLOSED,SIDEWALK_CLOSED,HOUR_IMPACTED,UNITS_IMPACTED,...,CIVIC_NUMBER,STREET,ASSETID,ASSET_DEPTH,FROST_DEPTH,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
0,-80.484005,43.462939,1,2252,2017-12-01 15:15:00+00:00,MAIN,Partially Closed,Open,12-16 hours,47,...,125,LANCASTER ST W,134292,1.6,0.3,450.0,1937.0,CI,Y,3521d297-1a2e-4e7b-a071-fc53ed87e965
1,-80.515075,43.422742,7874,1311,2001-03-26 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,76,CLOVERDALE CRES,4101323,,,13.0,1965.0,XXX,Y,72445d62-16a8-43c1-9733-56b06015b077
2,-80.439811,43.445067,7875,1325,2006-09-06 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,47,WREN CRES,4099987,,,25.0,1967.0,XXX,Y,3bdc8931-31c0-4090-a07a-a6847781dd97
3,-80.510859,43.426478,7876,1328,2006-09-11 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,382,GREENBROOK DR,4642530,,,25.0,1964.0,PVC,Y,f75ad0b1-5b2a-4125-8ad5-2b9a037debd7
4,-80.45752,43.443201,7877,1308,2000-01-27 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,224,MONTGOMERY RD,4100648,,,25.0,1967.0,XXX,Y,5a3c5d03-0899-4899-95e7-278bc5cbb682


After seeing how many null values some columns have and the type of information they display, it's safe to say that a lot can be dropped. Columns like `CW_SERVICE_REQUEST`, `WORKORDER`, `HEALTH_DEPT_NOTIFICATION`, `SAC_REFERENCE_NO`, `LOCAL_MOE_OFFICE`, `BWA_DWA`, `BWA_DWA_DECLARED` all seem like they won't contribute much information so we'll exclude those, and possible more as we go through the data.

In [11]:
break_data = break_data.drop(['CW_SERVICE_REQUEST', 'WORKORDER', 'HEALTH_DEPT_NOTIFICATION',
                              'SAC_REFERENCE_NO', 'LOCAL_MOE_OFFICE', 'BWA_DWA', 'BWA_DWA_DECLARED',
                              'PROCEEDURES_FOLLOWED', 'RECORD_CHANGE_REQD', 'MOECC_SAC_NOTIFICATION',
                              'BACTERIA_TESTING_DATE'], axis=1)

In [12]:
print(break_data.shape)
break_data.sample(5)

(2750, 41)


Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,ROAD_CLOSED,SIDEWALK_CLOSED,HOUR_IMPACTED,UNITS_IMPACTED,...,CIVIC_NUMBER,STREET,ASSETID,ASSET_DEPTH,FROST_DEPTH,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
1758,-80.457321,43.458436,9743,1335,2009-01-10 00:00:00+00:00,MAIN,Open,Open,8-12 hours,0,...,,,22910,,,150.0,1955.0,CI,Y,58888261-25da-4173-b479-a0a6f5ae2fdc
1050,-80.469302,43.442237,9035,164,2005-03-08 00:00:00+00:00,MAIN,Open,Open,8-12 hours,0,...,39.0,BRICK ST,4430,,,150.0,1952.0,DI,N,b9d05109-067f-4057-ab3f-8f348f4bed62
1400,-80.468688,43.457844,9385,153,2003-02-21 00:00:00+00:00,MAIN,Open,Open,8-12 hours,0,...,164.0,ANN ST,660,,,150.0,1952.0,CI,Y,69d8f3ed-aaa0-4a06-88ce-cc650a656dc7
2498,-80.432683,43.435581,17921,192776,2019-12-31 05:55:11+00:00,MAIN,Open,Open,4-8 hours,0-50,...,101.0,THALER AVE,85012,,,150.0,1967.0,CI,Y,62782c43-a375-4e0d-8063-e5790dbc9ba3
1631,-80.456813,43.43424,9616,344,2000-12-18 00:00:00+00:00,MAIN,Open,Open,8-12 hours,0,...,55.0,FIRST AVE,14390,,,150.0,1957.0,CI,Y,05c0f3e3-b7b0-4a0f-b402-f6bafce414ef


How many null values do we still need to take care of?

In [14]:
break_data.isnull().sum()

LONGITUDE                         0
LATITUDE                          0
OBJECTID                          0
WATBREAKINCIDENTID                0
INCIDENT_DATE                     0
BREAK_TYPE                        0
ROAD_CLOSED                       0
SIDEWALK_CLOSED                   0
HOUR_IMPACTED                     0
UNITS_IMPACTED                    0
STATUS                            0
STATUS_DATE                     167
RETURN_TO_NORMAL               2660
BREAK_NATURE                    117
BREAK_APPARENT_CAUSE            168
REPAIR_TYPE                    2601
NEW_SECTION_LENGTH             2733
MAINTENANCE_DESC               2750
VALVES_CLOSED                  2749
VALVES_OPENED                  2750
HYDRANTS_CALLED_OUT            2748
HYDRANTS_CALLED_BACK_IN        2748
POSITIVE_PRESSURE_MAINTANED       0
AIR_GAP_MAINTANED                 0
DISINFECTED                       0
MECHANICAL_REMOVAL                0
FLUSHING_EXCAVATION               0
HIGHER_VELOCITY_FLUSHING    

We can still drop more columns that aren't going to help us when it comes to building a model and making predictions. The ones we're going to drop don't provide information about the breaks, they're more just providing a status. We can go ahead and drop the following:

In [16]:
break_data = break_data.drop(['STATUS_DATE', 'MAINTENANCE_DESC', 'VALVES_CLOSED', 'VALVES_OPENED',
                              'HYDRANTS_CALLED_OUT', 'HYDRANTS_CALLED_BACK_IN'], axis=1)

In [17]:
break_data.head()

Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,ROAD_CLOSED,SIDEWALK_CLOSED,HOUR_IMPACTED,UNITS_IMPACTED,...,CIVIC_NUMBER,STREET,ASSETID,ASSET_DEPTH,FROST_DEPTH,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
0,-80.484005,43.462939,1,2252,2017-12-01 15:15:00+00:00,MAIN,Partially Closed,Open,12-16 hours,47,...,125,LANCASTER ST W,134292,1.6,0.3,450.0,1937.0,CI,Y,3521d297-1a2e-4e7b-a071-fc53ed87e965
1,-80.515075,43.422742,7874,1311,2001-03-26 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,76,CLOVERDALE CRES,4101323,,,13.0,1965.0,XXX,Y,72445d62-16a8-43c1-9733-56b06015b077
2,-80.439811,43.445067,7875,1325,2006-09-06 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,47,WREN CRES,4099987,,,25.0,1967.0,XXX,Y,3bdc8931-31c0-4090-a07a-a6847781dd97
3,-80.510859,43.426478,7876,1328,2006-09-11 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,382,GREENBROOK DR,4642530,,,25.0,1964.0,PVC,Y,f75ad0b1-5b2a-4125-8ad5-2b9a037debd7
4,-80.45752,43.443201,7877,1308,2000-01-27 00:00:00+00:00,SERVICE,Open,Open,8-12 hours,0,...,224,MONTGOMERY RD,4100648,,,25.0,1967.0,XXX,Y,5a3c5d03-0899-4899-95e7-278bc5cbb682


In [18]:
break_data = break_data.drop(['RETURN_TO_NORMAL', 'REPAIR_TYPE', 'NEW_SECTION_LENGTH', 'CIVIC_NUMBER'], axis=1)

In [19]:
break_data.shape

(2750, 31)

In [None]:
break_data.isna().sum()

We've cut down our features from 57 to 31 so far. I wish there were less values missing from `ASSET_DEPTH` and `FROST_DEPTH` because I think they would be valuable for our analysis, but unfortunately I will have to get rid of them as well since almost all of the values are missing.

From the studies that I've read, none of the researchers used features related to asset depth or frost depth so my hypothesis about them providing valuable info could be wrong anyway and just a bad judgment call.

In [20]:
break_data = break_data.drop(['ASSET_DEPTH', 'FROST_DEPTH'], axis=1)

In [21]:
break_data.isna().sum()

LONGITUDE                        0
LATITUDE                         0
OBJECTID                         0
WATBREAKINCIDENTID               0
INCIDENT_DATE                    0
BREAK_TYPE                       0
ROAD_CLOSED                      0
SIDEWALK_CLOSED                  0
HOUR_IMPACTED                    0
UNITS_IMPACTED                   0
STATUS                           0
BREAK_NATURE                   117
BREAK_APPARENT_CAUSE           168
POSITIVE_PRESSURE_MAINTANED      0
AIR_GAP_MAINTANED                0
DISINFECTED                      0
MECHANICAL_REMOVAL               0
FLUSHING_EXCAVATION              0
HIGHER_VELOCITY_FLUSHING         0
ANODE_INSTALLED                  0
BREAK_CATEGORIZATION           133
ROADSEGMENTID                    0
STREET                          16
ASSETID                          0
ASSET_SIZE                     161
ASSET_YEAR_INSTALLED           165
ASSET_MATERIAL                 161
ASSET_EXISTS                     0
GLOBALID            

Let's investigate the types of values there are in the `BREAK_APPARENT_CAUSE` variable and `BREAK_NATURE` variable. There aren't a lot of missing values so I might be able to easily impute them.

In [23]:
break_data.BREAK_APPARENT_CAUSE.unique()

array(['AGE', 'OTHER', 'COMBINATION', 'CORROSION', 'SOILS', 'UNKNOWN',
       'PRESSURE', 'FAULTY INSTALL', nan], dtype=object)

In [None]:
break_data.BREAK_APPARENT_CAUSE.value_counts()

I'll fill the nan values with 'UNKNOWN'

In [24]:
break_data.BREAK_APPARENT_CAUSE.fillna('UNKNOWN', inplace=True)

In [26]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      2005
CIRCUMFERENTIAL                               370
CORROSION                                      87
FITTING/JOINT                                  55
LONGITUDINAL                                   29
CIRCUMFERENTIAL AND FITTING/JOINT              26
CORROSION AND CIRCUMFERENTIAL                  18
OTHER                                          14
OTHER: WATER SERVICE                            9
CORROSION AND LONGITUDINAL                      9
CORROSION AND FITTING/JOINT                     6
FITTING/JOINT AND LONGITUDINAL                  4
CORROSION - ROBAR SADDLE CORRODED AT SEAM       1
Name: BREAK_NATURE, dtype: int64

In [27]:
break_data['BREAK_NATURE'] = break_data['BREAK_NATURE'].replace({'OTHER: WATER SERVICE': 'WATER SERVICE'})

In [28]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      2005
CIRCUMFERENTIAL                               370
CORROSION                                      87
FITTING/JOINT                                  55
LONGITUDINAL                                   29
CIRCUMFERENTIAL AND FITTING/JOINT              26
CORROSION AND CIRCUMFERENTIAL                  18
OTHER                                          14
WATER SERVICE                                   9
CORROSION AND LONGITUDINAL                      9
CORROSION AND FITTING/JOINT                     6
FITTING/JOINT AND LONGITUDINAL                  4
CORROSION - ROBAR SADDLE CORRODED AT SEAM       1
Name: BREAK_NATURE, dtype: int64

In [29]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL', 'OTHER', nan,
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [30]:
break_data.BREAK_NATURE.fillna('UNKNOWN', inplace=True)

In [31]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      2122
CIRCUMFERENTIAL                               370
CORROSION                                      87
FITTING/JOINT                                  55
LONGITUDINAL                                   29
CIRCUMFERENTIAL AND FITTING/JOINT              26
CORROSION AND CIRCUMFERENTIAL                  18
OTHER                                          14
WATER SERVICE                                   9
CORROSION AND LONGITUDINAL                      9
CORROSION AND FITTING/JOINT                     6
FITTING/JOINT AND LONGITUDINAL                  4
CORROSION - ROBAR SADDLE CORRODED AT SEAM       1
Name: BREAK_NATURE, dtype: int64

In [32]:
def print_null(df):
    return df.isna().sum()

In [33]:
print_null(break_data)

LONGITUDE                        0
LATITUDE                         0
OBJECTID                         0
WATBREAKINCIDENTID               0
INCIDENT_DATE                    0
BREAK_TYPE                       0
ROAD_CLOSED                      0
SIDEWALK_CLOSED                  0
HOUR_IMPACTED                    0
UNITS_IMPACTED                   0
STATUS                           0
BREAK_NATURE                     0
BREAK_APPARENT_CAUSE             0
POSITIVE_PRESSURE_MAINTANED      0
AIR_GAP_MAINTANED                0
DISINFECTED                      0
MECHANICAL_REMOVAL               0
FLUSHING_EXCAVATION              0
HIGHER_VELOCITY_FLUSHING         0
ANODE_INSTALLED                  0
BREAK_CATEGORIZATION           133
ROADSEGMENTID                    0
STREET                          16
ASSETID                          0
ASSET_SIZE                     161
ASSET_YEAR_INSTALLED           165
ASSET_MATERIAL                 161
ASSET_EXISTS                     0
GLOBALID            

Most research papers that I've read so far really only include pipe attributes (length, width, diameter, etc.), last break incident, type of break, number of previous breaks, material of pipe, as well as other data not directly related to the pipe. This includes soil moisture, soil resistivity, soil corrosivity, average temperature, and other attributes like this.

Disregarding the data about soil, there are still a decent number of features here that don't seem like they would contribute much information to the model.

In [34]:
break_data.columns

Index(['LONGITUDE', 'LATITUDE', 'OBJECTID', 'WATBREAKINCIDENTID',
       'INCIDENT_DATE', 'BREAK_TYPE', 'ROAD_CLOSED', 'SIDEWALK_CLOSED',
       'HOUR_IMPACTED', 'UNITS_IMPACTED', 'STATUS', 'BREAK_NATURE',
       'BREAK_APPARENT_CAUSE', 'POSITIVE_PRESSURE_MAINTANED',
       'AIR_GAP_MAINTANED', 'DISINFECTED', 'MECHANICAL_REMOVAL',
       'FLUSHING_EXCAVATION', 'HIGHER_VELOCITY_FLUSHING', 'ANODE_INSTALLED',
       'BREAK_CATEGORIZATION', 'ROADSEGMENTID', 'STREET', 'ASSETID',
       'ASSET_SIZE', 'ASSET_YEAR_INSTALLED', 'ASSET_MATERIAL', 'ASSET_EXISTS',
       'GLOBALID'],
      dtype='object')

'ROAD_CLOSED', 'SIDEWALK_CLOSED' can be removed. Same with 'DISINFECTED' since I can't really see whether the pipe being disinfected or not contributes to a break.

In [35]:
break_data.drop(['ROAD_CLOSED', 'SIDEWALK_CLOSED', 'DISINFECTED'], inplace=True, axis=1)

In [36]:
print_null(break_data)

LONGITUDE                        0
LATITUDE                         0
OBJECTID                         0
WATBREAKINCIDENTID               0
INCIDENT_DATE                    0
BREAK_TYPE                       0
HOUR_IMPACTED                    0
UNITS_IMPACTED                   0
STATUS                           0
BREAK_NATURE                     0
BREAK_APPARENT_CAUSE             0
POSITIVE_PRESSURE_MAINTANED      0
AIR_GAP_MAINTANED                0
MECHANICAL_REMOVAL               0
FLUSHING_EXCAVATION              0
HIGHER_VELOCITY_FLUSHING         0
ANODE_INSTALLED                  0
BREAK_CATEGORIZATION           133
ROADSEGMENTID                    0
STREET                          16
ASSETID                          0
ASSET_SIZE                     161
ASSET_YEAR_INSTALLED           165
ASSET_MATERIAL                 161
ASSET_EXISTS                     0
GLOBALID                         0
dtype: int64

In [37]:
break_data[break_data['BREAK_CATEGORIZATION'].isna()]


Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,UNITS_IMPACTED,STATUS,BREAK_NATURE,...,ANODE_INSTALLED,BREAK_CATEGORIZATION,ROADSEGMENTID,STREET,ASSETID,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
2453,-80.460827,43.429301,10561,136193,2019-07-24 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,OTHER,...,Y,,13313,BONIFACE AVE,90214,150.0,1958.0,CI,Y,8fc8a656-31c0-4774-890b-2ff3aa443c2e
2454,-80.505538,43.439726,10562,136194,2019-07-26 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,OTHER,...,N,,1699,MARLBOROUGH AVE,25510,25.0,1950.0,COP,Y,fbaab3fb-8994-4aa6-b141-054b05e344c0
2455,-80.474402,43.459717,10563,136195,2019-07-30 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,OTHER,...,Y,,12694,EDNA ST,59990,150.0,1950.0,CI,Y,d1b36a22-add1-4205-8fa1-69001fba6950
2456,-80.468793,43.461157,10564,136196,2019-08-03 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,LONGITUDINAL,...,Y,,12675,BRUCE ST,4920,150.0,1952.0,CI,Y,cbd203f2-642b-4fd1-ad1a-2e274b45ad55
2457,-80.486056,43.451335,10565,136197,2019-08-14 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,OTHER,...,Y,,6759,WEBER ST E,95308,,,,N,3f9a68c2-a9d9-4884-84c3-ac09ee036fff
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2738,-80.515022,43.432726,62081,147074,2022-08-25 19:54:14+00:00,MAIN,4-8 hours,0-50,REPAIR COMPLETED,UNKNOWN,...,Y,,20727,CECILE DR,5830,150.0,1966.0,DI,Y,4bd70200-6338-4498-b693-d44f366fb061
2739,-80.482765,43.481979,62401,147094,2022-08-29 10:06:14+00:00,MAIN,4-8 hours,0-50,REPAIR COMPLETED,UNKNOWN,...,Y,,13115,LANCASTER ST W,23240,450.0,1968.0,DI,Y,bf6846f3-5484-457b-9694-37c8a73f5517
2740,-80.524039,43.417825,62402,147096,2022-08-29 17:01:37+00:00,MAIN,4-8 hours,0-50,REPAIR COMPLETED,UNKNOWN,...,Y,,5437,FORESTWOOD DR,15000,150.0,1969.0,DI,Y,4df3a96a-1c13-48aa-8638-98d3bd34e710
2741,-80.539546,43.411365,62721,147134,2022-09-14 11:12:46+00:00,MAIN,4-8 hours,0-50,REPAIR COMPLETED,UNKNOWN,...,Y,,21084,STONEHENGE PL,87420,150.0,1986.0,DI,Y,ad381529-00af-45fd-b396-5508bcb62f09


In [38]:
break_data.BREAK_CATEGORIZATION.value_counts()

CATEGORY 1    2603
CATEGORY 2      14
Name: BREAK_CATEGORIZATION, dtype: int64

In [39]:
break_data.BREAK_CATEGORIZATION.fillna('UNKNOWN', inplace=True)

In [40]:
break_data.BREAK_CATEGORIZATION.value_counts()

CATEGORY 1    2603
UNKNOWN        133
CATEGORY 2      14
Name: BREAK_CATEGORIZATION, dtype: int64

We looked at 'BREAK_NATURE' earlier in this notebook, but I'm just noticing now that the feature has both 'UNKNOWN' and 'OTHER' as values. We need to convert these to one or the other so everything stays consistent. Let's first see all of the unique values again, and then fix this issue. Since there are a significant amount of the 'UNKNOWN' value we'll most likely convert 'OTHER' to that.

In [41]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL', 'OTHER',
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [42]:
break_data['BREAK_NATURE'] = break_data['BREAK_NATURE'].replace({'OTHER': 'UNKNOWN'})

In [43]:
break_data.BREAK_NATURE.unique()

array(['CORROSION AND FITTING/JOINT', 'UNKNOWN', 'CORROSION',
       'CIRCUMFERENTIAL', 'CORROSION AND CIRCUMFERENTIAL',
       'FITTING/JOINT', 'CIRCUMFERENTIAL AND FITTING/JOINT',
       'LONGITUDINAL', 'WATER SERVICE', 'CORROSION AND LONGITUDINAL',
       'FITTING/JOINT AND LONGITUDINAL',
       'CORROSION - ROBAR SADDLE CORRODED AT SEAM'], dtype=object)

In [44]:
break_data.BREAK_NATURE.value_counts()

UNKNOWN                                      2136
CIRCUMFERENTIAL                               370
CORROSION                                      87
FITTING/JOINT                                  55
LONGITUDINAL                                   29
CIRCUMFERENTIAL AND FITTING/JOINT              26
CORROSION AND CIRCUMFERENTIAL                  18
WATER SERVICE                                   9
CORROSION AND LONGITUDINAL                      9
CORROSION AND FITTING/JOINT                     6
FITTING/JOINT AND LONGITUDINAL                  4
CORROSION - ROBAR SADDLE CORRODED AT SEAM       1
Name: BREAK_NATURE, dtype: int64

In [45]:
print_null(break_data)

LONGITUDE                        0
LATITUDE                         0
OBJECTID                         0
WATBREAKINCIDENTID               0
INCIDENT_DATE                    0
BREAK_TYPE                       0
HOUR_IMPACTED                    0
UNITS_IMPACTED                   0
STATUS                           0
BREAK_NATURE                     0
BREAK_APPARENT_CAUSE             0
POSITIVE_PRESSURE_MAINTANED      0
AIR_GAP_MAINTANED                0
MECHANICAL_REMOVAL               0
FLUSHING_EXCAVATION              0
HIGHER_VELOCITY_FLUSHING         0
ANODE_INSTALLED                  0
BREAK_CATEGORIZATION             0
ROADSEGMENTID                    0
STREET                          16
ASSETID                          0
ASSET_SIZE                     161
ASSET_YEAR_INSTALLED           165
ASSET_MATERIAL                 161
ASSET_EXISTS                     0
GLOBALID                         0
dtype: int64

In [46]:
no_street = break_data.loc[break_data.STREET.isna()]
no_street

Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,UNITS_IMPACTED,STATUS,BREAK_NATURE,...,ANODE_INSTALLED,BREAK_CATEGORIZATION,ROADSEGMENTID,STREET,ASSETID,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
155,-80.573004,43.443999,8140,1405,2009-10-22 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,6b0439bb-b9db-4144-b228-43b612d85586
293,-80.47416,43.471444,8278,976,2001-03-14 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,604217,,32860,200.0,1974.0,DI,Y,b3b77f1f-124e-4653-bb23-32aa195500b9
493,-80.427697,43.39008,8478,1978,2014-01-10 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,50156,,91992,200.0,1977.0,DI,Y,6b29d708-e81b-4808-877c-31c686c9fd8c
693,-80.573004,43.443999,8678,505,2011-12-15 15:38:27+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,7ce0a441-0385-45b6-b915-6c6d1a8c0876
1443,-80.573004,43.443999,9428,1330,2011-12-15 15:38:28+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,54d89f5a-f3c7-4a9f-9a64-80a6696ff96b
1607,-80.573004,43.443999,9592,1329,2011-12-15 15:38:28+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,96d2114a-1e34-439c-8409-a42c298fe957
1616,-80.573004,43.443999,9601,1403,2009-10-14 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,f79cd49f-967d-478a-97fc-b4ac000d5d24
1758,-80.457321,43.458436,9743,1335,2009-01-10 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,603891,,22910,150.0,1955.0,CI,Y,58888261-25da-4173-b479-a0a6f5ae2fdc
2039,-80.573004,43.443999,10024,725,2011-12-15 15:38:27+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,73973bb7-ceba-45aa-b61f-f4900024a35d
2052,-80.573004,43.443999,10037,7,2011-12-15 15:38:26+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,0,,0,,,,N,0ddf6145-505e-4b7b-bd1b-200fafd74b6c


Looking at the latitude and longitude of the missing street addresses, there are actually only 5 distinct missing addresses. We can plot them on a map and see where they are located.

In [47]:
import plotly
import plotly.express as px

In [48]:
## HIDE TOKEN
from config import token
api_token = token
px.set_mapbox_access_token(token)
fig = px.scatter_mapbox(data_frame=no_street, lat='LATITUDE', lon='LONGITUDE')
fig.update_layout(mapbox_style="carto-positron", mapbox_accesstoken=token)
fig.show();

Plugging the latitude and longitudes that are missing, into a reverse geocoding website, we can get the following addresses that we can plug into the data.

In [None]:
break_data.loc[(break_data['LONGITUDE'] == -80.573004) & (break_data['LATITUDE'] == 43.443999), 'STREET'] = 'ERB ST'
break_data.loc[(break_data['LONGITUDE'] == -80.427697) & (break_data['LATITUDE'] == 43.390080), 'STREET'] = 'OLD CARRIAGE DR'
break_data.loc[(break_data['LONGITUDE'] == -80.457321) & (break_data['LATITUDE'] == 43.458436), 'CITY'] = 'KRUG ST'
break_data.loc[(break_data['LONGITUDE'] == -80.294712) & (break_data['LATITUDE'] == 43.538106), 'STREET'] = 'SPEEDVALE AVE W'

Let's fill in the `NaN` valued streets with just 'UNKNOWN'

In [49]:
break_data.STREET.fillna('UNKNOWN', inplace=True)

In [50]:
print_null(break_data)

LONGITUDE                        0
LATITUDE                         0
OBJECTID                         0
WATBREAKINCIDENTID               0
INCIDENT_DATE                    0
BREAK_TYPE                       0
HOUR_IMPACTED                    0
UNITS_IMPACTED                   0
STATUS                           0
BREAK_NATURE                     0
BREAK_APPARENT_CAUSE             0
POSITIVE_PRESSURE_MAINTANED      0
AIR_GAP_MAINTANED                0
MECHANICAL_REMOVAL               0
FLUSHING_EXCAVATION              0
HIGHER_VELOCITY_FLUSHING         0
ANODE_INSTALLED                  0
BREAK_CATEGORIZATION             0
ROADSEGMENTID                    0
STREET                           0
ASSETID                          0
ASSET_SIZE                     161
ASSET_YEAR_INSTALLED           165
ASSET_MATERIAL                 161
ASSET_EXISTS                     0
GLOBALID                         0
dtype: int64

In [51]:
break_data.ASSET_SIZE.value_counts()

150.0     1808
300.0      319
200.0      289
100.0       58
450.0       51
600.0       17
25.0        13
250.0       12
50.0         7
13.0         5
1200.0       5
0.0          3
750.0        2
Name: ASSET_SIZE, dtype: int64

So there's 121 missing values for `ASSET_SIZE`. We could simply fill the null values with the mean value, the most frequent value, or just set them to 0. Another way that I feel might be the most appropriate approach is visualize on a map where these asset sizes occur, categorizing by street name, and then seeing if the null values occur on the same streets as some of those already known asset sizes, and then fill in the missing sizes with which neighbours their near.

In [52]:
# pd.set_option("display.max_rows", None)
break_data.loc[break_data.ASSET_SIZE.isna()]

Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,UNITS_IMPACTED,STATUS,BREAK_NATURE,...,ANODE_INSTALLED,BREAK_CATEGORIZATION,ROADSEGMENTID,STREET,ASSETID,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
8,-80.486628,43.436634,7881,2047,2015-03-12 00:00:00+00:00,SERVICE,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,11399,MILL ST,4101475,,,,N,536f67ed-4ec6-470e-b0d2-7e916e639d6a
22,-80.483756,43.435221,8007,67,2003-04-28 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,11413,MILL ST,26730,,,,N,2b7d9979-3508-4727-b20d-54a50b6a4e9c
27,-80.435665,43.431450,8012,1113,2007-03-07 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,12095,FLORENCE AVE,14690,,,,N,d9a5c6eb-e57a-4e2d-84b0-0962f6faa35e
43,-80.479748,43.431306,8028,1384,2009-04-30 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,11743,OTTAWA ST S,95460,,,,N,e9f999c9-9b13-4951-aee7-baea12cd9d9e
83,-80.480002,43.431070,8068,266,1999-01-27 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,11743,OTTAWA ST S,29660,,,,N,1c8c083c-e54d-44db-8b6c-91973db0604c
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2577,-80.489597,43.467957,33925,140294,2021-01-16 08:30:32+00:00,MAIN,4-8 hours,0-50,CANCELLED,UNKNOWN,...,Y,UNKNOWN,10427,BOEHMER ST,3680,,,,Y,4ab4f0f3-960e-4a8d-882e-57dbec087fbe
2578,-80.489597,43.467885,33926,140296,2021-01-16 08:31:03+00:00,MAIN,4-8 hours,0-50,REPAIR COMPLETED,UNKNOWN,...,Y,UNKNOWN,10430,BOEHMER ST,3680,,,,N,72a3e429-c580-4d7b-a8a5-ef6e81f20639
2595,-80.294712,43.538106,36482,140716,2021-02-17 09:53:29+00:00,MAIN,20-24 hours,0-50,CANCELLED,UNKNOWN,...,Y,UNKNOWN,0,UNKNOWN,0,,,,Y,c3e03c5c-1489-4eb2-88e4-645ea92bdf8c
2596,-80.294712,43.538106,36483,140718,2021-02-17 10:03:55+00:00,MAIN,20-24 hours,0-50,CANCELLED,UNKNOWN,...,Y,UNKNOWN,0,UNKNOWN,0,,,,Y,1ceaeaa1-570c-4a84-ab3a-8c0b1568d6af


Unfortunately 13 out of the previous 16 null values for street names occur in this set. But that's okay for those ones, it just means that we might only end up with 13 null values instead of the current 121. It's better to figure out how to fill in the majority of the values and worry about those few afterwards.

What I will try to do next is look at a sort of breakdown of the asset sizes, listed with their asset ID's and corresponding street names, and see if we can match up any non-null asset sizes from the same street as asset sizes with null values. There's a possibility of imputing the average asset size value from the same street for the null values. For now I am jusy hypothesizing this, I'm not entirely sure if it'll work but why not try right?

In [53]:
asset_size = break_data[['STREET', 'ASSETID', 'ASSET_SIZE']]
# cherry picking null asset sizes to see which streets we coud look at
asset_size.loc[asset_size.ASSET_SIZE.isna()].sample(10)

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
1518,FLORENCE AVE,14700,
1852,REX DR,32260,
420,REX DR,32260,
2417,SCHWEITZER ST,441,
435,OTTAWA ST N,29220,
2102,UNKNOWN,0,
1292,VALEWOOD PL,38960,
1789,WALKER ST,40340,
2597,UNKNOWN,0,
1056,OTTAWA ST S,29680,


Next I'll see if there are any matching asset ID's for null and non-null asset sizes. If there are any matches, then it could be safe to say that I could impute the null values with the ID's matching asset size. If this doesn't prove to be true then I'll have to explore more options.

In [54]:
asset_size[asset_size['STREET'] == 'FLORENCE AVE']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
27,FLORENCE AVE,14690,
147,FLORENCE AVE,14700,
244,FLORENCE AVE,14680,
288,FLORENCE AVE,14690,
738,FLORENCE AVE,14680,
1313,FLORENCE AVE,14680,
1410,FLORENCE AVE,14700,
1454,FLORENCE AVE,14690,
1518,FLORENCE AVE,14700,
1734,FLORENCE AVE,14690,


Unfortunately there are no matching ID's for this street. Let's check a few more streets to be sure that this trend might not hold...

In [55]:
asset_size[asset_size['STREET'] == 'OTTAWA ST S']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
43,OTTAWA ST S,95460,
83,OTTAWA ST S,29660,
395,OTTAWA ST S,64330,
490,OTTAWA ST S,95460,
523,OTTAWA ST S,29630,150.0
596,OTTAWA ST S,29620,150.0
699,OTTAWA ST S,29620,150.0
748,OTTAWA ST S,95460,
753,OTTAWA ST S,29660,
805,OTTAWA ST S,29630,150.0


In [56]:
# display street names with no asset size
streets_with_na = asset_size.STREET[asset_size['ASSET_SIZE'].isna()].unique()

In [57]:
print(streets_with_na)

['MILL ST' 'FLORENCE AVE' 'OTTAWA ST S' 'WEBER ST E' 'OTTAWA ST N'
 'UNKNOWN' 'ST CLAIR AVE' 'BRIDGE ST E' 'CORAL CRES' 'BOEHMER ST'
 'MAUSSER AVE' 'VALEWOOD PL' 'WINDOM RD' 'REX DR' 'NORFOLK CRES' 'KEHL ST'
 'HEIMAN ST' 'HEBEL PL' 'GUERIN AVE' 'BECKER ST' 'STIRLING AVE S'
 'MAURICE ST' 'SOUTHILL DR' 'WALKER ST' 'EIGHTH AVE' 'FAIRMOUNT RD'
 'FERGUS AVE' 'HUBER ST' 'GOLFVIEW PL' 'PATTANDON AVE' 'HOFFMAN ST'
 'SYDNEY ST S' 'ANN ST' 'EDWIN ST' 'SCHWEITZER ST']


In [58]:
asset_size[asset_size['STREET'] == 'WEBER ST E']

Unnamed: 0,STREET,ASSETID,ASSET_SIZE
29,WEBER ST E,41030,150.0
101,WEBER ST E,79238,
110,WEBER ST E,40960,
263,WEBER ST E,40820,
310,WEBER ST E,41030,150.0
499,WEBER ST E,40760,
528,WEBER ST E,41030,150.0
565,WEBER ST E,40960,
664,WEBER ST E,40960,
849,WEBER ST E,40960,


Well there we have it, each ID is unique to the asset size whether it's missing or not.

I will fill in the `NaN` values with the most frequent number (mode) for now.

In [59]:
asset_size.ASSET_SIZE.value_counts()

150.0     1808
300.0      319
200.0      289
100.0       58
450.0       51
600.0       17
25.0        13
250.0       12
50.0         7
13.0         5
1200.0       5
0.0          3
750.0        2
Name: ASSET_SIZE, dtype: int64

In [60]:
# fill asset size with the mode of the column
break_data['ASSET_SIZE'].fillna(break_data['ASSET_SIZE'].mode()[0], inplace=True)

In [61]:
break_data.ASSET_SIZE.value_counts()

150.0     1969
300.0      319
200.0      289
100.0       58
450.0       51
600.0       17
25.0        13
250.0       12
50.0         7
13.0         5
1200.0       5
0.0          3
750.0        2
Name: ASSET_SIZE, dtype: int64

In [62]:
print_null(break_data)

LONGITUDE                        0
LATITUDE                         0
OBJECTID                         0
WATBREAKINCIDENTID               0
INCIDENT_DATE                    0
BREAK_TYPE                       0
HOUR_IMPACTED                    0
UNITS_IMPACTED                   0
STATUS                           0
BREAK_NATURE                     0
BREAK_APPARENT_CAUSE             0
POSITIVE_PRESSURE_MAINTANED      0
AIR_GAP_MAINTANED                0
MECHANICAL_REMOVAL               0
FLUSHING_EXCAVATION              0
HIGHER_VELOCITY_FLUSHING         0
ANODE_INSTALLED                  0
BREAK_CATEGORIZATION             0
ROADSEGMENTID                    0
STREET                           0
ASSETID                          0
ASSET_SIZE                       0
ASSET_YEAR_INSTALLED           165
ASSET_MATERIAL                 161
ASSET_EXISTS                     0
GLOBALID                         0
dtype: int64

In [63]:
break_data.ASSET_YEAR_INSTALLED.value_counts()

1966.0    159
1958.0    158
1967.0    156
1962.0    110
1953.0    104
         ... 
1904.0      1
1990.0      1
1936.0      1
1997.0      1
1920.0      1
Name: ASSET_YEAR_INSTALLED, Length: 94, dtype: int64

In [64]:
break_data[break_data['ASSET_YEAR_INSTALLED'].isna()].sample(10)

Unnamed: 0,LONGITUDE,LATITUDE,OBJECTID,WATBREAKINCIDENTID,INCIDENT_DATE,BREAK_TYPE,HOUR_IMPACTED,UNITS_IMPACTED,STATUS,BREAK_NATURE,...,ANODE_INSTALLED,BREAK_CATEGORIZATION,ROADSEGMENTID,STREET,ASSETID,ASSET_SIZE,ASSET_YEAR_INSTALLED,ASSET_MATERIAL,ASSET_EXISTS,GLOBALID
2338,-80.497329,43.434857,10323,2344,2018-01-16 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,6444,CORAL CRES,8000,150.0,,,N,4580e55e-da79-4c44-b94d-8c2e1126e14f
603,-80.477786,43.480303,8588,1008,2002-12-13 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,13090,BRIDGE ST E,489,150.0,,,N,fd851860-bd51-4183-ad59-2ea6f34537ca
838,-80.482813,43.43508,8823,1729,2012-01-21 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,10899,MILL ST,26750,150.0,,,N,c62309c7-5fb8-4e4d-962b-36e4516f3d6a
2069,-80.493849,43.434275,10054,1129,2007-04-13 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,20920,MAUSSER AVE,25730,150.0,,,N,24a744cd-762e-4ab7-aa5a-5e08cc0c02b9
2411,-80.472779,43.479773,10396,647,1999-01-24 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,12575,SCHWEITZER ST,442,150.0,,,N,0850c1a5-cfc4-4c31-b18e-52b58f15cd18
1410,-80.434504,43.432122,9395,516,1997-01-25 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,12094,FLORENCE AVE,14700,150.0,,,N,d3dfc28b-57d7-442a-9691-e90c6f6f6f12
1784,-80.48565,43.451114,9769,1888,2014-04-10 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,6759,WEBER ST E,40810,150.0,,,N,4bcc7907-2434-4b1b-ac7f-26d7d0d40839
288,-80.435611,43.431482,8273,1798,2013-12-03 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,CORROSION AND CIRCUMFERENTIAL,...,Y,CATEGORY 1,12095,FLORENCE AVE,14690,150.0,,,N,7811391e-e607-4ee4-bb0b-cc7ce3f737d4
1316,-80.497159,43.434917,9301,529,2002-12-03 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,6444,CORAL CRES,8000,150.0,,,N,d9e7a81e-4ffc-41bc-b017-72c1d5968159
1227,-80.479152,43.432663,9212,126,1997-02-09 00:00:00+00:00,MAIN,8-12 hours,0,REPAIR COMPLETED,UNKNOWN,...,Y,CATEGORY 1,10877,OTTAWA ST S,29680,150.0,,,N,03ca1e23-e8d0-4ca9-bebe-d5571da1c7a5


From reading different studies, it seems as though the year the pipe was installed or rather the age of the pipe is a critical factor in predicting breaks/time of failure. In this case, I believe I may have to drop the observations where no year is indicated. 

In [65]:
break_data.dropna(axis=0, subset=['ASSET_YEAR_INSTALLED'], inplace=True)

In [66]:
print(break_data.shape)
print_null(break_data)

(2585, 26)


LONGITUDE                      0
LATITUDE                       0
OBJECTID                       0
WATBREAKINCIDENTID             0
INCIDENT_DATE                  0
BREAK_TYPE                     0
HOUR_IMPACTED                  0
UNITS_IMPACTED                 0
STATUS                         0
BREAK_NATURE                   0
BREAK_APPARENT_CAUSE           0
POSITIVE_PRESSURE_MAINTANED    0
AIR_GAP_MAINTANED              0
MECHANICAL_REMOVAL             0
FLUSHING_EXCAVATION            0
HIGHER_VELOCITY_FLUSHING       0
ANODE_INSTALLED                0
BREAK_CATEGORIZATION           0
ROADSEGMENTID                  0
STREET                         0
ASSETID                        0
ASSET_SIZE                     0
ASSET_YEAR_INSTALLED           0
ASSET_MATERIAL                 0
ASSET_EXISTS                   0
GLOBALID                       0
dtype: int64

Lucky for us that took care of our missing values in the `ASSET_MATERIAL` column. Now we have a clean dataset finally to do some EDA and feature engineering.

In [68]:
break_data.to_csv('../data/processed/cleaned_break_data.csv', index=False)