# Project: Can we predict when, where, and which car is going to receive a parking ticket?

## Table of Contents:
* [Introduction](#1)
* [Wrangling](#2)
* [Exploratory Visuals](#3)
* [Explanatory Visuals](#4)
* [Conclusion](#5)

## Introduction:<a class="anchor" id="1"></a>
Have you ever browsed iMDB looking for good movies to watch, sorted by rating? Or browsing the movie you just watched on iMDB, only to find that it has a shockingly low or high viewer rating? What can we say about the high or low ratings of a movie on iMDB?

We take a dive into the dataset provided by Kaggle (but now replaced with TMDB ratings due to DMCA Takedown https://www.kaggle.com/tmdb/tmdb-movie-metadata/home) to see what's going on behind these scores!

In [18]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from patsy import dmatrices
import statsmodels.api as sm;
from datetime import datetime, timedelta
from statsmodels.stats.outliers_influence import variance_inflation_factor

%matplotlib inline

In [19]:
# Large dataset, we want to see all columns
pd.set_option('display.max_columns', None)

In [20]:
# read in .csv
df_og = pd.read_csv('parking-violations-issued-fiscal-year-2018.csv', low_memory = False)

## Wrangling:

### Data Issues
* column names: add _ between blank space and all lower case
* violation code is type int -> cast as string
* combine issue date + violation time

In [269]:
df = df_og.copy()

In [270]:
# Taking the columns we want to work with
df = df[['Registration State', 'Issue Date', 'Violation Time', 'Violation Code', 'Street Name', 'Sub Division',
        'Vehicle Body Type', 'Vehicle Make', 'Vehicle Color', 'Vehicle Year']]

In [271]:
# change column names
df.columns = ['registration_state', 'issue_date', 'violation_time', 'violation_code', 'street_name', 
              'subdivision', 'vehicle_body_type', 'vehicle_make', 'vehicle_color', 'vehicle_year']

In [272]:
# drop duplicate rows and rows with missing value
df.drop_duplicates(inplace = True)
df.dropna(inplace = True)

In [273]:
# Playing around with datetime objects to learn them

x = datetime(2020, 5, 17)
x = x + timedelta(hours=9, minutes=59)

print(x)

2020-05-17 09:59:00


In [275]:
# Convert Issue Date to datetime object
df.issue_date = pd.to_datetime(df.issue_date)

In [277]:
# Get rid of incorrect violation time data
mask = (df.violation_time.str.len() == 5) & (df.violation_time.str.count('\.') == 0) & (df.violation_time.str.count(' ') == 0)

df = df.loc[mask]

In [280]:
# Add flag column for additional filtering later on
df['flag'] = False
df = df.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [287]:
for i in df.index:
    # grab violation time at row i
    v_time = df.get_value(i, 'violation_time')
    
    # tokenize hour, minute, and PM/AM characters into different variables, cast to int as necessary
    h = int(v_time[0:2])
    m = int(v_time[2:4])
    p = v_time[4:5]
    
    # flag rows with erroneous violation time values (ex. 6831P) 
    if h >= 24:
        df.set_value(i, 'flag', True)   
    if m >= 60:
        df.set_value(i, 'flag', True)
    
    # if violation time is PM, add 12 to hour value
    if p == 'P':
        h += 12
    
    # add time data to issue date data using timedelta
    df.set_value(i, 'issue_date', df.issue_date[i] + timedelta(hours = h, minutes = m))

  from ipykernel import kernelapp as app


In [293]:
# remove flagged rows
df = df.query('flag == False')

In [297]:
# drop violation time as it's no longer needed
df.drop('violation_time', axis = 1, inplace = True)

In [298]:
df.head()

Unnamed: 0,registration_state,issue_date,violation_code,street_name,subdivision,vehicle_body_type,vehicle_make,vehicle_color,vehicle_year,flag
0,NY,2018-07-04 16:22:00,14,HANSON PLACE,D1,SDN,HONDA,BLUE,2006,False
1,NY,2018-06-28 23:30:00,46,AUSTIN ST,C,SDN,NISSA,GRY,2017,False
2,NY,2018-06-09 07:50:00,24,GREAT KILLS BOAT LAU,D5,SUBN,JEEP,GREEN,0,False
3,NC,2018-06-08 02:46:00,24,GREAT KILLS PARK BOA,D5,P-U,FORD,WHITE,0,False
4,NY,2018-06-30 10:28:00,17,HANSON PLACE,C4,SUBN,HYUND,GREEN,2007,False


In [302]:
# save to csv for now so we do not need to repeat lengthy cleaning steps
df.to_csv('temp_data', index = False)