An overview of the data cleaning steps implemented in lovelyrita.

In [8]:
from lovelyrita.clean import clean

In [9]:
clean??

[0;31mSignature:[0m [0mclean[0m[0;34m([0m[0mdataframe[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mclean[0m[0;34m([0m[0mdataframe[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Apply a series of data cleaning steps to a dataframe of raw data[0m
[0;34m[0m
[0;34m    Parameters[0m
[0;34m    ----------[0m
[0;34m    dataframe : pandas.DataFrame[0m
[0;34m[0m
[0;34m    Returns[0m
[0;34m    -------[0m
[0;34m    A cleaned DataFrame[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mdrop_void[0m[0;34m([0m[0mdataframe[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mreplace[0m[0;34m([0m[0mdataframe[0m[0;34m.[0m[0mstreet[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0maddresses[0m [0;34m=[0m [0mparse_addresses[0m[0;34m([0m[0mdataframe[0m[0;34m.[0m[0mstreet[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mdataframe[0m[0;34m.[0m[0mupdate[0

Data cleaning involves several stages:

# Drop voided citations
Some citations are voided as indicated by the word "VOID" or "ZVOIDZ" in the address field. Others have no ticket number. Drop these rows from the dataframe.

In [3]:
from lovelyrita.clean import drop_void

In [4]:
drop_void??

[0;31mSignature:[0m [0mdrop_void[0m[0;34m([0m[0mdataframe[0m[0;34m,[0m [0minplace[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mdrop_void[0m[0;34m([0m[0mdataframe[0m[0;34m,[0m [0minplace[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Drop voided citations[0m
[0;34m[0m
[0;34m    Parameters[0m
[0;34m    ----------[0m
[0;34m    dataframe : pandas.DataFrame[0m
[0;34m    inplace : bool[0m
[0;34m[0m
[0;34m    Returns[0m
[0;34m    -------[0m
[0;34m    If `inplace` is False, return the dataframe with voided citations dropped[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0mvoid_indices[0m [0;34m=[0m [0mdataframe[0m[0;34m.[0m[0mstreet[0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mcontains[0m[0;34m([0m[0;34mr'^Z?VOIDZ?'[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mnull_indices[0m [0;34m=[0m [0mdataframe[0m[0;34m.[0m[0mticket_number[0m[0;34m.

# Parse addresses
We parse the address to produce a street number and street name. For example, "123 MAIN STREET" will produce "123" and "MAIN STREET". This can simplify geocoding queries later on, as certain APIs can take street name and number as separate fields.

Several address patterns are detected.

In [10]:
from lovelyrita.addresses import parse_addresses, REPLACEMENTS

In [15]:
parse_addresses??

[0;31mSignature:[0m [0mparse_addresses[0m[0;34m([0m[0maddresses[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mparse_addresses[0m[0;34m([0m[0maddresses[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Parse addresses into street name and number according to several rules.[0m
[0;34m[0m
[0;34m    Parameter[0m
[0;34m    ---------[0m
[0;34m    addresses : pandas.Series[0m
[0;34m[0m
[0;34m    Returns[0m
[0;34m    -------[0m
[0;34m    A DataFrame containing street name and street column for those rows that were successfully [0m
[0;34m    parsed[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0;31m# Many addresses are in parking lots. Those will not have street numbers, so we should treat them separately. We will only concern ourselves with potential street addresses.[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mlot_indices[0m [0;34m=[0m [0maddresses[0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mcontains[0m[

# Replace text
Fix things by providing a pattern (as a [regular expression](https://regexr.com/)) and replacement text for that pattern. This is the most fine-tuned way to fix problems and should be used sparingly, since each replacement takes some time.

Here are the default patterns and replacements.

In [11]:
REPLACEMENTS

[('^ONE ', '1 '),
 ('^TWO ', '2 '),
 (' -', '-'),
 (' TERM$', ' TERMINAL'),
 ('^#', '')]

# Parse ticket issue times
Ticket issue times are provided as strings in the raw data. They can be parsed into `datetime` objects that support many useful operations, e.g., finding citations in a date range.

In [12]:
from lovelyrita.clean import get_datetime, impute_missing_times

In [13]:
get_datetime??

[0;31mSignature:[0m [0mget_datetime[0m[0;34m([0m[0mdataframe[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mget_datetime[0m[0;34m([0m[0mdataframe[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Get a datatime for each row in a DataFrame[0m
[0;34m[0m
[0;34m    Parameters[0m
[0;34m    ----------[0m
[0;34m    dataframe : pandas.DataFrame[0m
[0;34m        A dataframe with `ticket_issue_date` and `ticket_issue_time` columns[0m
[0;34m[0m
[0;34m    Returns[0m
[0;34m    -------[0m
[0;34m    A Series of datetime values[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0mdt[0m [0;34m=[0m [0mdataframe[0m[0;34m[[0m[0;34m'ticket_issue_date'[0m[0;34m][0m [0;34m+[0m [0;34m' '[0m [0;34m+[0m [0mdataframe[0m[0;34m[[0m[0;34m'ticket_issue_time'[0m[0;34m][0m[0;34m[0m
[0;34m[0m    [0mdatetime_format[0m [0;34m=[0m [0minfer_datetime_format[0m[0;34m([0m[0mdt[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0

## Inferring the times for missing data

Once we have the times as `datetime` objects, we can guess the ticket issue times where they are missing by interpolating between known times. Caveat: this assumes the tickets are listed in chronological order.

In [14]:
impute_missing_times??

[0;31mSignature:[0m [0mimpute_missing_times[0m[0;34m([0m[0mdatetimes[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mimpute_missing_times[0m[0;34m([0m[0mdatetimes[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Fill in missing times by interpolating surrounding times[0m
[0;34m[0m
[0;34m    Parameters[0m
[0;34m    ----------[0m
[0;34m    datetimes : pandas.Series[0m
[0;34m[0m
[0;34m    Returns[0m
[0;34m    -------[0m
[0;34m    The original Series with missing times replaced by interpolated times[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;31m# get valid start and stop indices for null ranges[0m[0;34m[0m
[0;34m[0m    [0mn_rows[0m [0;34m=[0m [0mlen[0m[0;34m([0m[0mdatetimes[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mnull_indices[0m [0;34m=[0m [0mdatetimes[0m[0;34m.[0m[0misnull[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mnonzero[0m[0;34m([0m[0

# Convert dollar fields to numeric

In [15]:
from lovelyrita.clean import convert_dollar_to_float

In [16]:
convert_dollar_to_float??

[0;31mSignature:[0m [0mconvert_dollar_to_float[0m[0;34m([0m[0mdollars[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mconvert_dollar_to_float[0m[0;34m([0m[0mdollars[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0mdollars[0m[0;34m.[0m[0mreplace[0m[0;34m([0m[0;34m'\$'[0m[0;34m,[0m [0;34m''[0m[0;34m,[0m [0mregex[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m.[0m[0mastype[0m[0;34m([0m[0;34m'float32'[0m[0;34m)[0m[0;34m[0m[0m
[0;31mFile:[0m      /projects/lovely-rita/lovelyrita/clean.py
[0;31mType:[0m      function
