In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


In [2]:
filename = '../data/nyc-parking-violations-2020.csv'

df = pd.read_csv(filename,
                 usecols=['Plate ID',  'Registration State',
                        'Vehicle Make', 'Vehicle Color', 'Street Name'])

df.head()

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,43 ST,BK
1,KRE6058,PA,ME/BE,UNION ST,BLK
2,444326R,NJ,LEXUS,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,DIVISION AVE,
4,FMY9090,NY,JEEP,GRAND ST,GREY


# Beyond 1

Run `value_counts` on the `Vehicle Make` column, and look at some of the vehicle names. (There are more than 5,200 distinct makes, which almost certainly indicates that there is a lot of inconsistency in this data.) What problems do you see? Write a function that, given a value, cleans it up -- putting the name in all caps, removing punctuation, and standardizing whatever names you can, and then use the `apply` method to fix up the column. How many distinct vehicle makes are there when you're done?

In [3]:
# I could have used regular expressions, but decided to make it a bit easier to follow

import string

def clean_name(one_string):

    if not isinstance(one_string, str):
        return one_string

    output = ''
    
    for one_character in one_string.strip().upper():
        if one_character in string.ascii_uppercase:
            output += one_character

    return output

print(len(df['Vehicle Make'].value_counts()))
df['Vehicle Make'] = df['Vehicle Make'].apply(clean_name)
print(len(df['Vehicle Make'].value_counts()))

5210
4915


# Beyond 2

How standardized are the street names in system? What changes could you apply to improve things?

In [4]:
# Let's do some experiments to see how standardized things are

# For example, it sometimes says E 110th St and sometimes says E 110 ST
s = df['Street Name'].dropna()
s[s.str.contains('110')].value_counts()

W 110th St              2970
110th St                2388
E 110th St              2048
WB 110TH AVE/BRINKER     922
110th Ave                704
                        ... 
O/F 77 EAST 110 ST         1
C/O 110 RD                 1
S/E C/O E 110 ST           1
E/B 110 W 48 ST            1
E 110  ST                  1
Name: Street Name, Length: 73, dtype: int64

In [5]:
# Sometimes it says BWAY and sometimes BROADWAY ...

# So to clean things up, we would need to standardize whether we use st/nd/rd/th, and if/when
# we abbreviate street names, and HOW we do that. Also, there is a separate column for the
# cross street, so it shouldn't be in the "Street Name" column.  A mess!  (Or an opportunity...)

s[s.str.contains('BWAY') | s.str.contains('BROADWAY')].value_counts()

SB BROADWAY @ 252ND     21939
NB BROADWAY @ W 228T    13367
BROADWAY                10771
SB BROADWAY @ W 196T     6623
NB BROADWAY @ W 120T     5691
                        ...  
S/B BWAY                    1
BROADWAY PL                 1
S/S BWAY                    1
S/O 1350 BROADWAY           1
N/E 220 BROADWAY            1
Name: Street Name, Length: 181, dtype: int64

# Beyond 3

Would you need to clean up the `Registration State` column? Why or why not?

In [6]:
# We have 68 "states," which includes Canadian provinces and some other countries
# So this seems pretty reasonsable, although perhaps some additional cleanup is needed.
df['Registration State'].value_counts()

NY    9753643
NJ    1096110
PA     338779
FL     174056
CT     165205
       ...   
PE         18
SK          8
MX          7
NT          3
YT          2
Name: Registration State, Length: 68, dtype: int64