# Exploration of Car Make and Violation Description Data from The City of Los Angeles Parking Citation Open Dataset

## Data cleanliness

Building on previous explorations of the Los Angeles Parking Citation Open Dataset, these analyses will further explore the connections between car make and parking violation type. Before going much further, data completeness and consistency has to be explored.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import os
from pathlib import Path
import random
import seaborn as sns

# Load project directory
PROJECT_DIR = Path(os.path.abspath('../..'))

In [2]:
df = pd.read_csv(PROJECT_DIR / 'data/raw/2021-01-02_raw.csv',skiprows=lambda i: i > 0 and random.random() > .01,)
df.head()

Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,...,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude,Agency Description,Color Description,Body Style Description
0,1107179581,12/27/2015,1055.0,,,CA,201605.0,,TOYO,PA,...,,54.0,8058L,PREF PARKING,68.0,99999.0,99999.0,,,
1,1110265251,12/16/2015,1340.0,,,CA,,,,TR,...,22MP4,1.0,5204A,EXPIRED TAGS,25.0,99999.0,99999.0,,,
2,1112716673,12/28/2015,1020.0,,,CA,201601.0,,KIA,PA,...,00461,54.0,8069BS,NO PARK/STREET CLEAN,73.0,99999.0,99999.0,,,
3,1112718025,12/28/2015,1222.0,,,CA,201510.0,,SATU,PA,...,00461,54.0,5204A,EXPIRED TAGS,25.0,99999.0,99999.0,,,
4,1113965031,12/24/2015,1108.0,,,CA,201701.0,,FORD,PA,...,00141,51.0,8069BS,NO PARK/STREET CLEAN,73.0,6436025.9,1833425.9,,,


In [3]:
# Missing data
df[['Violation code', 'Violation Description']].isna().sum()/len(df)

Violation code           0.004298
Violation Description    0.009355
dtype: float64

In [4]:
# Unique pairs of data with missing data
df[['Violation code', 'Violation Description']][df['Violation code'].isna()|df['Violation Description'].isna()].drop_duplicates()[:10]

Unnamed: 0,Violation code,Violation Description
4714,000,
98690,,
99242,8069AP,
99244,8069A,
99248,4000A1,
99249,22514,
99255,8058L,
99335,8056E4,
99338,8813B,
99419,225078A,


In [5]:
# Remove entries with both violation code and violation description missing
df = df[~(df['Violation code'].isna() & df['Violation Description'].isna())]

In [6]:
same_codes = set(df['Violation code']).intersection(set(df['Violation Description']))

df[['Violation code', 'Violation Description']][df['Violation Description'].isin(same_codes)]

Unnamed: 0,Violation code,Violation Description
194,024,22514
332,010,22500E
1080,024,22514
1191,098,5200
1370,013,22500H
...,...,...
103026,013,22500H
104846,011,22500F
104908,010,22500E
104909,098,5200


It would seem that there are a few violation codes that have been entered in as violation descriptions. The codes should be moved over and the descriptions should be deleted. It would seem that the 3 numeral violation codes are not very meaningful sometimes and that 000 is often used for different types violations.

In [7]:
# Create function to swap codes and descriptions
def code_swap(df):
    df['Violation code'] = df['Violation Description']
    df['Violation Description'] = np.nan
    return df

In [8]:
code_swap_filter = (df['Violation Description'].isin(same_codes) | (df['Violation code'] == '000'))

df.loc[code_swap_filter,['Violation code', 'Violation Description']] = df[['Violation code', 'Violation Description']][code_swap_filter].apply(code_swap, axis=1)

# Remove new entries with both violation code and violation description missing
df = df[~(df['Violation code'].isna() & df['Violation Description'].isna())]

In [9]:
df[['Violation code', 'Violation Description']].drop_duplicates().sort_values('Violation code')

Unnamed: 0,Violation code,Violation Description
4648,022,225078
22083,030,22522
13751,031,22523A
8908,032,22523B
274,099,5204
...,...,...
42623,8939B,
104682,8940,
622,8940,PARKING AREA
42089,8940B,PK OVR 2 SPACES


In [10]:
code_dict = {}
for code in set(df['Violation code']):
    desc_aliases = df.loc[(df['Violation code'] == code), 'Violation Description'].drop_duplicates()        .dropna().to_list()
    if desc_aliases:
        if len(desc_aliases) > 1:
            code_dict[code] = max(desc_aliases, key=len)
        else:
            code_dict[code] = desc_aliases[0]

In [11]:
code_dict

{'80731': 'STORING VEH/ON STR',
 '8939': 'WHITE CURB',
 '80.7': 'NO STOPPING/ANTI-GRIDLOCK ZONE',
 '8056': 'YELLOW ZONE',
 '8051A': 'LEFT SIDE OF ROADWAY',
 '8942': 'MORE 18-CURB',
 '8049': 'WRG SD/NOT PRL',
 '80.69.4': 'PK OVERSIZ',
 '22507.8C2': 'DISABLED PARKING/CROSS HATCH',
 '225001': 'PARK FIRE LANE',
 '22511.57B': 'DP- RO NOT PRESENT',
 '22502E': '18 IN. CURB/1 WAY',
 '80694**': 'PK OVERSIZED 3RD',
 '22523B-': 'ABAND VEH/PUB/PRIV',
 '80.61': 'STANDNG IN ALLEY',
 '8936': 'RED CURB',
 '22500L-': 'DP-BLKNG ACCESS RAMP',
 '80.69A+': 'STOP/STAND PROHIBIT',
 '22507.8C1': 'DISABLED PARKING/BOUNDARIES',
 '80.71.3': 'PARKING/FRONT YARD',
 '85.01': 'REPAIRING VEH/STREET',
 '22523A-': 'ABAND VEH/HIGHWAY',
 '80.56E4+': 'RED ZONE',
 '22507.8B': 'DISABLED PARKING/OBS',
 '8603': 'PK IN PROH AREA',
 '8058L': 'PREF PARKING',
 '21113A+': 'PUBLIC GROUNDS',
 '8073': 'ANGLE PKD',
 '80.58L': 'PREFERENTIAL PARKING',
 '8069BS': 'NO PARK/STREET CLEAN',
 '8056E4': 'RED ZONE',
 '22507.8A': 'DISABLED PARKI

In [21]:
df.loc[df['Violation Description'].isin(['NO STOP/STANDING', 'STOP/STAND PROHIBIT']), ['Violation code', 'Violation Description']].drop_duplicates()

Unnamed: 0,Violation code,Violation Description
25,80.69AP+,NO STOP/STANDING
80,80.69A+,STOP/STAND PROHIBIT
