<a href="https://colab.research.google.com/github/johnny-tran/COGS108_repo/blob/master/FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COGS 108 - Final Project

## Important

- ONE, and only one, member of your group should upload this notebook to TritonED. 
- Each member of the group will receive the same grade on this assignment. 
- Keep the file name the same: submit the file 'FinalProject.ipynb'.
- Only upload the .ipynb file to TED, do not upload any associted data. Make sure that for cells in which you want graders to see output that these cells have been executed.

## Group Members: Fill in the Student IDs of each group member here

Replace the lines below to list each persons full student ID, ucsd email and full name.

- A11888496 - L4truong@ucsd.edu - Loc Truong
- A15352670 - t8wei@ucsd.edu - Timothy Wei
- A14732783 - jpt017@ucsd.edu - Jonathan Tran
- A11962666 - nnowain@ucsd.edu - Nathan Nowain
- A14493674 - cvshanno@ucsd.edu - Collin Shannon



# Introduction and Background

## Research Question

> Are there any correlations to make/color/body type to violation types?
How does visual appearance affect the likelihood of receiving a parking violation? Do certain makes/models or colors receive more tickets than expected?


## Hypothesis

> We predict that loud colors & higher end models have a higher chance of receiving violations.
We hypothesize that visual appearance does affect the likelihood of receiving a parking violation. We expect that parking officials are more likely to notice cars that have loud colors and higher end models and thus more likely to give them a ticket.


## Background

> We believe that there are biases that are imposed from humans as they do their job, such as parking enforcement. There is a typically well known statement that cops and parking enforcement have to meet quotas monthly to keep them on goal as they do their job. Besides this commonly heard phenomena, is there also a possible data to show that some cars also get more tickets than others? With further research we found there are many articles online which sum up reasons why one might be more likely to get a ticket; and some of their data show that certain vehicle makes and models, more notably the expensive and luxurious make and models, receive more speeding tickets [1]. There is also a well known statistic that red colored cars get pulled over more than others, which is most likely related to its high visibility over other colored cars [2].

> With this in mind, we supposed there can be a possible correlation to be found. As we hypothesized, we believe that cars of certain visible attributes like color or make can possibly increase their likelihood of receiving a ticket in comparison to other vehicles which may not attract as much attention. We believe this is important because we want to see if there are possible implicit biases in how parking enforcement gives out violations. This data could help better inform people before choosing a car to buy, if they care or to just find an interesting relationship to human cognition and how perception alters thinking.
  
> ### References

>> 1) https://www.more.com/lifestyle/6-things-almost-guarantee-speeding-ticket

>> 2) http://www.brettrics.com/9-million-parking-tickets-la/


# Data Description

Dataset Name: Los Angeles Parking Citations

>The dataset consists of 19 variables based on the information on the ticket slip. This includes ticket date, issue time, meter id, make, body style, color, location, route, agency, violation code, violation description, fine amount, and location data.

> Link to the dataset: https://data.lacity.org/A-Well-Run-City/Parking-Citations/wjz9-h9np

> Kaggle link: https://www.kaggle.com/cityofLA/los-angeles-parking-citations

# Data Cleaning / Pre Processing

In [1]:
# Imports -  These are all you need for the assignment: do not import additional packages
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

# read PDF via tabula
import tabula
!pip install tabula-py



In [2]:
# read parking citations(PC) file
PC_df = pd.read_csv('parking-citations.csv')
PC_df.head()


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,Color,Location,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude
0,1103341116,2015-12-21T00:00:00,1251.0,,,CA,200304.0,,HOND,PA,GY,13147 WELBY WAY,01521,1.0,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0
1,1103700150,2015-12-21T00:00:00,1435.0,,,CA,201512.0,,GMC,VN,WH,525 S MAIN ST,1C51,1.0,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0
2,1104803000,2015-12-21T00:00:00,2055.0,,,CA,201503.0,,NISS,PA,BK,200 WORLD WAY,2R2,2.0,8939,WHITE CURB,58.0,6439997.9,1802686.4
3,1104820732,2015-12-26T00:00:00,1515.0,,,CA,,,ACUR,PA,WH,100 WORLD WAY,2F11,2.0,000,17104h,,6440041.1,1802686.2
4,1105461453,2015-09-15T00:00:00,115.0,,,CA,200316.0,,CHEV,PA,BK,GEORGIA ST/OLYMPIC,1FB70,1.0,8069A,NO STOPPING/STANDING,93.0,99999.0,99999.0


In [3]:
# read agency codes(AC) file
from tabula import read_pdf
AC_df = read_pdf('LADOT-Xerox Crib Sheet Agency Codes 12-31-2015.pdf')

# set header to top row
new_header = AC_df.iloc[0]
AC_df = AC_df[1:]
AC_df.columns = new_header

# print head to check
AC_df.head()


Unnamed: 0,CODE,AGENCY NAME,NAME
1,1,WESTERN,WESTERN
2,2,LAX CURRENT,LAX CUR
3,3,VALLEY,VALLEY
4,4,HOLLYWOOD,HOLLYWOOD
5,5,SOUTHERN,SOUTHERN


In [6]:
# remove unnecessary columns
PC_df = PC_df.drop(['Ticket number', 'Issue Date', 'Issue time', 'Meter Id', 'Marked Time', 'RP State Plate', 'Plate Expiry Date', 'VIN', 'Location', 'Route', 'Fine amount', 'Latitude', 'Longitude'], axis=1)

PC_df.head()


Unnamed: 0,Make,Body Style,Color,Agency,Violation code,Violation Description
0,HOND,PA,GY,1.0,4000A1,NO EVIDENCE OF REG
1,GMC,VN,WH,1.0,4000A1,NO EVIDENCE OF REG
2,NISS,PA,BK,2.0,8939,WHITE CURB
3,ACUR,PA,WH,2.0,000,17104h
4,CHEV,PA,BK,1.0,8069A,NO STOPPING/STANDING


In [72]:
agency_01 = PC_df[PC_df['Agency'] == 1.0]['Color'].value_counts()
print(agency_01)



a1_color = PC_df[PC_df['Agency'] == 1.0]['Color']
a1_color_count = a1_color.value_counts()
a1_color.unique().size

#print(a1_color)
#a1_color.plot(kind='bar')

WH    42215
BK    38205
GY    31378
SI    20220
BL    16624
RE    12831
GR     9280
TA     6004
GO     3292
BR     3154
BU     1496
YE     1213
BE      719
OR      548
MA      391
PU      366
SL      185
PI       93
TE       73
CR       61
CH       48
BZ       36
WT       34
GL       28
RD       22
RU       21
TU       18
MU       15
YL        9
SA        9
      ...  
AP        1
CP        1
O         1
GA        1
GN        1
EF        1
TL        1
R         1
AU        1
MI        1
MY        1
GE        1
TR        1
SV        1
SM        1
CU        1
SU        1
UN        1
PP        1
AD        1
LE        1
MP        1
RA        1
WA        1
MO        1
SO        1
AM        1
VU        1
ES        1
BN        1
Name: Color, Length: 82, dtype: int64


83

In [32]:
# find 

unique_agency_count = PC_df['Agency'].unique()
agency_count = unique_agency_count.size

print("We are looking at citations from ", agency_count,
      " different agencies in LA.")

#agency_01 = PC_df[PC_df['Agency'] == 1.0]
#agency_02 = PC_df[PC_df['Agency'] == 2.0]
#agency_03 = PC_df[PC_df['Agency'] == 3.0]
#agency_04 = PC_df[PC_df['Agency'] == 4.0]
#agency_05 = PC_df[PC_df['Agency'] == 5.0]
#agency_06 = PC_df[PC_df['Agency'] == 6.0]
#agency_07 = PC_df[PC_df['Agency'] == 7.0]
#agency_08 = PC_df[PC_df['Agency'] == 8.0]
#agency_09 = PC_df[PC_df['Agency'] == 9.0]
#agency_10 = PC_df[PC_df['Agency'] == 10.0]


We are looking at citations from  45  different agencies in LA.


GY = grey
WH = white | WT = white
BK = black
BL = blue | BE = blue | CO = cobalt
BR = brown
SI = silver | SL = silver
GO = gold
RE = red | RD = red

MA = magenta
TA = tan | TN = tan

BU = burgundy | BG = burgundy
GR = gray
YE = yellow
OR = orange
BN = brown
OT = ???
GN = green
MR = ??
PR = purple | PU = purple
UN = UNKNOWN???
PK = pink | PI = pink 
TU = turqoise

RU = ???
PL = purple ????
CR = ???
SN = ???
PE = ???
BZ = ???
ME = ???
CH = ???
TE = teal
LI = ???
MU = ???

In [None]:
#standardize colors function
def standardize_colors(string):

    string = string.lower()
    string = string.strip()
    
    #edit below for colors
    if "cog" in string:
        output = "COGSCI"
    elif "computer" in string:
        output = "COMPSCI"
    elif "cs" in string:
        output = "COMPSCI"
    elif "math" in string:
        output = "MATH"
    elif "electrical" in string:
        output = "ECE"
    elif "bio" in string:
        output = "BIO"
    elif "chem" in string:
        output = "CHEM"
    
    #otherwise, keep as is
    else:
        output = string
    
    return output

# Data Visualization

## Ideas

> ### Graphs:

> Color vs. # of tix

> Model vs. # of tix

> Color + model vs. # of tix

> Color vs. type of tix

>Model vs. type of tix

>Color + model vs. type of tix






# Data Analysis and Results

# Privacy/Ethics Considerations

The data was posted on Kaggle’s data set page but the information is released to the public via DataLA, a public site maintained by the Los Angeles government. We have permission to use this data under the Open Database License. There are no privacy concerns in regard to the data. This data set is potentially biased in terms of who it composes for two reasons. The first is regional factors. For example if parking officials stay in a neighborhood where the parking restrictions are stricter then we would expect the models of the cars in that neighborhood to be over represented in the data set. The second reason is that some violations may be more for lower income individuals. For example we would expect lower income individuals to receive more expired registration violations because of their inability to afford registration. If we identify any of these issues, we can modify our research topic so that it does not include biased data. For example we will not analyze expired registration violations if we identify that the data is heavily skewed towards lower income models.

In addition, our dataset was already efficient in not including much personal identifiers of the individuals whose vehicles have received citations. We practiced the Safe Harbor method, in which we removed the VIN column from our dataset to further improve its anonymization. Thus, our project considerably respects the privacy of the individuals whose vehicles received citations as we have altered the dataset to focus solely on the vehicles.

# Conclusion and Discussion