# Preparing Data For Driver Behavior
    In the 5th year of the Vision Zero program, the New York City government is focusing on pedestrian safety especially for the senoir citizens, bicycle lanes, driver education program for people under 25 years old, and adding new traffic signals. This cleaning will be focusing on cleaning the data for the driver education program. 
       
    Two things will be accomplished in this notebook, namely:
        > Checking for any NaN values in the dataset
        > Filter out any unncessary data for exploratory analysis, and
        > Feature Engineering

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("data/Traffic_Tickets_Issued__Four_Year_Window.csv")

In [3]:
df.columns

Index(['Violation Charged Code', 'Violation Description', 'Violation Year',
       'Violation Month', 'Violation Day of Week', 'Age at Violation',
       'Gender', 'State of License', 'Police Agency', 'Court', 'Source'],
      dtype='object')

# Checking NaN Values and Filtering Out Uneccesary Data
    In this section, NaN values on any of parameters will be checked if it will be acceptable to have any missing data. Then after checking for any missing values, filtering out all licences outside of New York will be next, since the study is mainly focused on the driving behavior of New York Drivers.

In [4]:
#observation by parameter
df.shape

(14275009, 11)

In [9]:
#ratio of missing values with respect to total number of obsevations

round(df.isnull().sum()/df.shape[0] *100, 5)

Violation Charged Code    0.00000
Violation Description     0.00000
Violation Year            0.00000
Violation Month           0.00000
Violation Day of Week     0.00000
Age at Violation          1.10810
Gender                    0.00000
State of License          0.00035
Police Agency             0.00000
Court                     0.00000
Source                    0.00000
dtype: float64

The dataframe has only 2 parameters that contain NaN values, which are: Age at Violation and State Of License. Considering that missing values for both of these parameters are insignificant, dropping the missing observations would be best.

In [10]:
#filtering out state licences outside of new york and dropping all NaN values 
df = df[df["State of License"] == "NEW YORK"].dropna()

In [22]:
print("After removing all NaN values and filtering out license plates outside of New York, we have a total of {0} rows from 14M + observations" .format(df.shape[0]))

After removing all NaN values and filtering out license plates outside of New York, we have a total of 11520746 rows from 14M + observations


# Feature Engineering
    Lastly, the data set will have added features, which are age groups and season. An age group parameter will be added to find out if people of different generations behave differently as well as a season column to perform time series analysis upon data exploration.

In [23]:
#Age Group
condlist = [df["Age at Violation"] >= 54, (df["Age at Violation"] < 54) & (df["Age at Violation"] >= 37), (df["Age at Violation"] < 37) & (df["Age at Violation"] >= 23),
           df["Age at Violation"] < 23]
choicelist = ["Baby Boomers", "Generation X", "Millenial", "Centennials"]

df["Age Group"] = np.select(condlist, choicelist)

In [24]:
#Season
condlist = [(df["Violation Month"] == 12) | (df["Violation Month"] == 1) | (df["Violation Month"] == 2), 
           (df["Violation Month"] >= 3) & (df["Violation Month"] <= 5),
           (df["Violation Month"] >= 6) & (df["Violation Month"] <= 8),
           (df["Violation Month"] >= 9) & (df["Violation Month"] <= 11),]
choicelist = ["Winter", "Spring", "Summer", "Fall"]

df["Season"] = np.select(condlist, choicelist)

## Save a CSV file

In [26]:
df.to_csv("driver_behavior.csv", index = False)