# Exploration of Car Make and Violation Description Data from The City of Los Angeles Parking Citation Open Dataset

## Data cleanliness

Building on previous explorations of the Los Angeles Parking Citation Open Dataset, these analyses will further explore the connections between car make and parking violation type. Before going much further, data completeness and consistency has to be explored.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import os
from pathlib import Path
import random
import seaborn as sns

# Load project directory
PROJECT_DIR = Path(os.path.abspath('../..'))

In [67]:
df = pd.read_csv(PROJECT_DIR / 'data/raw/2021-01-02_raw.csv',skiprows=lambda i: i > 0 and random.random() > .01,)
df.head()

Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,...,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude,Agency Description,Color Description,Body Style Description
0,1111967194,12/22/2015,310.0,,,CA,201603.0,,TOYO,PA,...,3L20,2.0,007,22500B,68.0,6439781.9,1802687.3,,,
1,1113011594,12/16/2015,1635.0,,,CA,201510.0,,DODG,PA,...,13FB4,1.0,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0,,,
2,1113011616,12/16/2015,1645.0,,,CA,201602.0,,FORD,SU,...,13FB4,1.0,8056E4,RED ZONE,93.0,99999.0,99999.0,,,
3,1113965473,12/27/2015,1156.0,45.0,,CA,201612.0,,MERC,PA,...,00203,51.0,8058L,PREF PARKING,68.0,6427271.2,1834319.2,,,
4,1113965576,12/24/2015,1014.0,,,CA,201611.0,,HOND,PA,...,00141,51.0,8069BS,NO PARK/STREET CLEAN,73.0,6437369.7,1832322.3,,,


In [3]:
# Missing data
df[['Violation code', 'Violation Description']].isna().sum()/len(df)

Violation code           0.003769
Violation Description    0.009044
dtype: float64

In [4]:
# Unique pairs of data with missing data
df[['Violation code', 'Violation Description']][df['Violation code'].isna()|df['Violation Description'].isna()].drop_duplicates()[:10]

Unnamed: 0,Violation code,Violation Description
8942,000,
99156,,
99738,8069AP,
99740,80714,
99742,22502A,
99743,5204A,
99744,5204,
99820,8069BS,
99822,22500I,
99889,4000A1,


In [68]:
# Remove entries with both violation code and violation description missing
df = df[~(df['Violation code'].isna() & df['Violation Description'].isna())]

In [8]:
same_codes = set(df['Violation code']).intersection(set(df['Violation Description']))

df[['Violation code', 'Violation Description']][df['Violation Description'].isin(same_codes)]

Unnamed: 0,Violation code,Violation Description
90,013,22500H
656,002,4000A
962,099,5204
1079,013,22500H
1250,570,2251157B
...,...,...
94855,098,5200
94863,024,22514
98485,013,22500H
105125,011,22500F


It would seem that there are a few violation codes that have been entered in as violation descriptions. The codes should be moved over and the descriptions should be deleted. It would seem that the 3 numeral violation codes are not very meaningful sometimes and that 000 is often used for different types violations.

In [9]:
# Create function to swap codes and descriptions
def code_swap(df):
    df['Violation code'] = df['Violation Description']
    df['Violation Description'] = np.nan
    return df

In [70]:
code_swap_filter = (df['Violation Description'].isin(same_codes) | (df['Violation code'] == '000'))

df.loc[code_swap_filter,['Violation code', 'Violation Description']] = df[['Violation code', 'Violation Description']][code_swap_filter].apply(code_swap, axis=1)

# Remove new entries with both violation code and violation description missing
df = df[~(df['Violation code'].isna() & df['Violation Description'].isna())]

In [71]:
df[['Violation code', 'Violation Description']].drop_duplicates().sort_values('Violation code')

Unnamed: 0,Violation code,Violation Description
21736,017,22502
21717,029,22521
42473,030,22522
4578,031,22523A
42586,045,4000
...,...,...
37003,89391C,EXCEED TIME LMT
3141,8940,PARKING AREA
109302,8940A,
21864,8943,PARK IN XWALK


In [89]:
code_dict = {}
for code in set(df['Violation code']):
    desc_aliases = df.loc[(df['Violation code'] == code), 'Violation Description'].drop_duplicates()        .dropna().to_list()
    if desc_aliases:
        if len(desc_aliases) > 1:
            code_dict[code] = max(desc_aliases, key=len)
        else:
            code_dict[code] = desc_aliases[0]

In [90]:
code_dict

{'22500B': 'PARKED IN CROSSWALK',
 '22507.8B': 'DISABLED PARKING/OBS',
 '80.69A+': 'STOP/STAND PROHIBIT',
 '8056E2': 'YELLOW ZONE',
 '22507.8B-': 'DISABLED PARKING/OBSTRUCT ACCESS',
 '80.61': 'STANDNG IN ALLEY',
 '80.66.1D': 'RESTRICTED TAXI ZONE',
 '80.54': 'OVERNIGHT PARKING',
 '8061#': 'STANDING IN ALLEY',
 '80.69.1C': 'PK TRAILER',
 '225001': 'PARK FIRE LANE',
 '80.69D': "VEH/LOAD OVR 6' HIGH",
 '22500.1+': 'PARKED IN FIRE LANE',
 '22507.8A-': 'DISABLED PARKING/NO DP ID',
 '557': '8755*',
 '80.69BS': 'NO PARK/STREET CLEAN',
 '22502E': '18 IN. CURB/1 WAY',
 '5200A': 'DSPLYPLATE A',
 '22507.8A': 'DISABLED PARKING/NO',
 '80692*': 'COMVEH RES/OV TM B-2',
 '553': '80581',
 '22511.57B': 'DP- RO NOT PRESENT',
 '80.69.4': 'PK OVERSIZ',
 '80.75.1': 'AUDIBLE ALARM',
 '5204A-': 'DISPLAY OF TABS',
 '80.69C': 'PARKED OVER TIME LIMIT',
 '80713': 'PARKING/FRONT YARD 1',
 '22500K': 'PARKED ON BRIDGE',
 '569': '2251157A',
 '88.03A': 'OUTSIDE LINES/METER',
 '22522-': '3 FT. SIDEWALK RAMP',
 '22507.8

In [88]:
max(desc_aliases, key=len)

'SIGN POSTED - NO PARKING'