# Data Workout - Parking Tickets

## Parking Data

This activity uses a sample of the New York City Parking Violations Dataset. Imagine this data was collected by police officers, parking inspectors, or other individuals. This means the data might have some missing or incorrect information.

## Importing the Tools

* Import the `pandas` library with the `pd` alias.


In [1]:
## Begin Solution
import pandas as pd

## End Solution

### Exercise 1 - Loading the Data

In this exercise, you'll start by loading the parking violation data into a DataFrame and selecting the columns we'll be working with. 

__Your Task__

1. __Create a DataFrame__:

    * Create a DataFrame named `parking_df` from the file located at:
       
        * `../data/nyc-parking-violation-sample.csv`

2. __Select Specific Columns__:

    * From the loaded data, we only need a few specific pieces of information.  Create a new DataFrame (you can name it `parking_df` again, overwriting the previous one, or use a new name) that includes only the following columns:

        * `Plate ID`
        * `Registration State`
        * `Vehicle Make`
        * `Vehicle Color`
        * `Violation Time`
        * `Street Name`

3. __Get to Know Your Data__:

    * Now, let's take a look at the structure of your DataFrame. Output the information about the `parking_df` DataFrame.  This will help you understand what you're working with.  Make sure to include:

        * The name of each column
        * The number of entries (rows) in each column
        * The data type of each column (e.g., text, numbers, dates)

In [3]:
## Begin Solution
data = "../data/nyc-parking-violation-sample.csv"

# Import Data
parking_df = pd.read_csv(
    data,
    low_memory=False
)

# Print Info (Before removing columns)
print(parking_df.info())

# Specify Columns to Keep
columns = [
    "Plate ID", "Registration State", "Vehicle Make",
           "Vehicle Color", "Violation Time", "Street Name"
]


# Filter columns
parking_df = parking_df[columns]

# Output the Resulting Info
print("* *"*20)
parking_df.info()

## End Solution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 44 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         250000 non-null  int64  
 1   Summons Number                     250000 non-null  int64  
 2   Plate ID                           249992 non-null  object 
 3   Registration State                 250000 non-null  object 
 4   Plate Type                         250000 non-null  object 
 5   Issue Date                         250000 non-null  object 
 6   Violation Code                     250000 non-null  int64  
 7   Vehicle Body Type                  247457 non-null  object 
 8   Vehicle Make                       247146 non-null  object 
 9   Issuing Agency                     250000 non-null  object 
 10  Street Code1                       250000 non-null  int64  
 11  Street Code2                       2500

### Exercise 2 - Removing `NaN`

In this exercise, you'll learn how to handle missing data, a common issue in real-world datasets. We'll remove rows with missing values and then analyze the impact of this data cleaning step.

__Your Task__

1. __Clean the Data:__

    * Create a new DataFrame named `cleaned_parking_df`.

        * Remove any rows from the `parking_df` DataFrame (from Exercise 1) that have missing data (represented as `NaN` values).
  


1. __Analyze the Cleaned Data:__

    * Determine the number of rows in `cleaned_parking_df`.  In other words, how many rows are left after removing the rows with missing data?
  


2. __Calculate Avoided Fines (Hypothetical):__

    * For the sake of this exercise, let's imagine that each parking ticket carries a $100 fine.
    * Also, imagine that if a ticket has any missing information, it can be successfully contested, and the fine is waived.
    * Based on the rows you removed in step 1, calculate the total amount of fines that New York City citizens hypothetically avoided due to missing data.
  


3. __Important Notes:__

    * The idea that missing data automatically voids a ticket is a simplified scenario created for this exercise to make it more engaging. It is not based on actual legal information.
    * The purpose of this exercise is to illustrate the impact of data cleaning.
    * This exercise is for educational purposes only. I am not a lawyer, and this should not be taken as legal advice. For legal advice, please consult a qualified professional.

In [5]:
## Begin Solution

parking_df_clean = parking_df.copy()

# Drop Null Rows
parking_df_clean = parking_df_clean.dropna()

# Count Rows
parking_df_len = len(parking_df)
clean_parking_df_len = len(parking_df_clean)

# Calculate Fine
fine = 100

total_fines_avoided = (parking_df_len - clean_parking_df_len) * fine

print(f"New Yorkers avoided ${total_fines_avoided} in parking fees due to missing data")

## End Solution

New Yorkers avoided $1191000 in parking fees due to missing data


### Exercise 3 - Missing Data

Let's switch up the removal criteria. A ticket can only be dismissed if the license plate, state, and or street name are missing.

__Your Task:__

1. Clean the Data:

    * Create a new DataFrame named `improved_parking_df`.

    * Remove rows from the `parking_df` DataFrame (from Exercise 1) that have missing data (represented as NaN values) in any of the following columns:

        * `Plate ID`
        * `Registration State`
        * `Street Name`

2. Analyze the Cleaned Data:
    * Determine the number of rows in `improved_parking_df`. In other words, how many rows are left after removing the rows with missing data?

3. Calculate Avoided Fines (Hypothetical):
    * For the sake of this exercise, let's imagine that each parking ticket carries a $100 fine.
    * Also, imagine that if a ticket has missing information in the `Plate ID`, `Registration State`, or `Street Name` columns, it can be successfully contested, and the fine is waived.
    * Based on the rows you removed in step 1, calculate the total amount of fines that New York City citizens hypothetically avoided due to missing data.
    * The result should be a more realistic value than the previous exercise.



In [7]:
## Begin Solution
# Drop Rows containing null values in certain columns
columns = ["Plate ID", "Registration State", "Street Name"]
                 
improved_parking_df = parking_df.dropna(subset = columns)

# Get Row Count
improved_parking_df_len = len(improved_parking_df)

# Calculate Avoided Fees
total_fines_avoided = (parking_df_len - improved_parking_df_len) * fine

print(f"New Yorkers avoided ${total_fines_avoided} in parking fees due to missing data")

## End Solution

New Yorkers avoided $8600 in parking fees due to missing data


### Exercise 3 - Missing License Plates

In data cleaning, we often deal with not just missing data (like `NaN` values), but also data that, while present, is invalid. This exercise focuses on identifying and removing invalid data.

__Your Task:__

Consider a new scenario where a parking ticket can be contested and dismissed if the Plate ID is recorded as `BLANKPLATE`.


1. __Clean the Data:__

    * Create a new DataFrame, `blank_plates_df`.
    * Start with the original DataFrame, `parking_df` (from Exercise 1).
    * Remove all rows where the `Plate ID` column contains the value `BLANKPLATE`.

2. __Analyze the Cleaned Data:__

    * Determine how many rows were removed from the original DataFrame (`parking_df`) in the previous step.

3. __Calculate Avoided Fines (Hypothetical):__

    * Based on the scenario where a `BLANKPLATE` entry allows a ticket to be successfully contested, calculate the total amount in fines that NYC citizens could have potentially avoided. Assume each fine is $100.

In [8]:
## Begin Solution

# Create Mask to Isolate BLANKPLATE (not null values!)
mask = parking_df["Plate ID"] == "BLANKPLATE"

# Apply Filter
blank_plates_df = parking_df[mask]

# Calculate Fees Avoided
blank_plates_df_len = len(blank_plates_df)

total_fines_avoided = blank_plates_df_len * fine

print(f"New Yorkers avoided ${total_fines_avoided} in parking fees due to missing data")
## End Solution

New Yorkers avoided $32500 in parking fees due to missing data


## Bonus - Vehicle Colors

Inspect and clean the `Vehicle Color` column from the `parking_df` dataframe.

What do you notice?

How will you go about cleaning this data?

In [26]:
## Vehicle Colors
parking_df["Vehicle Color"].value_counts().sample(50).sort_values(ascending=False)

Vehicle Color
BLACK     67027
RED       13368
BLUE       5406
Silver       96
PURPL        86
GLD          50
GY/          38
MAR          32
GY.          29
DKB          28
WHTE         19
WT.          16
WHGR         10
BKBL          9
RDWH          6
RED.          4
ORNG          4
DKWH          3
BIEGE         3
M             3
GRBL          3
GRY.          3
YWBL          2
OG            2
TNGR          2
GRW           2
GREY.         2
BUC           1
SIVLE         1
BLBR          1
GK            1
ORAG          1
MACOO         1
BURGN         1
RDRD          1
BL/WH         1
DKM           1
BRBR          1
PLUM          1
WH RD         1
CM            1
Silve         1
//            1
BLXK          1
SMART         1
WHIE          1
MN.           1
QBK           1
TQ            1
ROWN          1
Name: count, dtype: int64

In [19]:
## Your Solution
color_map = {
    'WH': 'WHITE', 'WHT': 'WHITE', 'WHI': 'WHITE', 'WT': 'WHITE', 'W': 'WHITE',
    'BK': 'BLACK', 'BLK': 'BLACK', 'B': 'BLACK',
    'GY': 'GREY', 'GRY': 'GREY', 'GRAY': 'GREY', 'GREY': 'GREY',
    'BL': 'BLUE', 'BLU': 'BLUE',
    'RD': 'RED',
    'GR': 'GREEN', 'GRN': 'GREEN', 'GN': 'GREEN',
    'SL': 'SILVER', 'SIL': 'SILVER', 'SILV': 'SILVER', 'SILVE': 'SILVER',
    'YW': 'YELLOW', 'YELL': 'YELLOW', 'YELLO': 'YELLOW', 'YL': 'YELLO',, 'YEL': 'YELLO',
    'BR': 'BROWN', 'BRN': 'BROWN', 'BRO': 'BROWN', 'BRW': 'BROWN', 'BN': 'BROWN',
    'OR': 'ORANGE', 'ORANG': 'ORANGE',
    'TN': 'TAN',
    'GL': 'GOLD',
    'MR': 'MAROON', 'MAROO':'MAROON',
    'PR': 'PURPLE', 'PURP':'PURPLE'
    'UNKNOWN': 'UNKNO',
}

parking_df['Clean_Color'] = parking_df['Vehicle Color'].str.strip().str.upper().map(color_map).fillna(parking_df['Vehicle Color'])
parking_df["Vehicle Color"].value_counts().head(50)

Vehicle Color
WHITE     70863
BLACK     67027
GREY      50675
RED       13368
BROWN      8046
SILVER     5704
GREEN      5481
BLUE       5406
YELLOW     3299
TAN        2955
GOLD       1625
OTHER      1117
ORANGE      704
MAROON      689
LTGY        234
LTG         177
PURPLE      142
DK/         125
LT/         125
DKGY        102
Silver       96
PURPL        86
GYGY         82
YEL          72
G            71
WHBL         69
BW           62
DKG          59
GD           55
DKBL         53
GLD          50
SLV          50
MAROO        46
WH/          44
BKGY         43
BEIGE        41
UNKNO        39
GY/          38
WHGY         37
WHIT         35
BG           33
BURG         32
MAR          32
SLVR         32
RDW          31
BK.          31
GY.          29
DKGR         29
DKB          28
YL           24
Name: count, dtype: int64