In [134]:
import numpy as np
import pandas as pd
import time
import statsmodels.api as sm
pd.set_option('display.max_columns', 500)

In [145]:
start_time = time.time()
df = pd.read_csv('CT_cleaned.csv')
df['stop_date'] = pd.to_datetime(df['stop_date'])
print("--- %s seconds to load in data---" % (time.time() - start_time))

  interactivity=interactivity, compiler=compiler, result=result)


--- 1.5069735050201416 seconds to load in data---


In [147]:
df.head(5)

Unnamed: 0,id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,officer_id,stop_duration
0,CT-2013-00001,CT,2013-10-01,00:01,westport,Fairfield County,9001.0,"00000 N I 95 (WESTPORT, T158) X 18 LL",State Police,F,69,69.0,Black,Black,Speed Related,Speeding,False,,,False,Ticket,False,1000002754,1-15 min
1,CT-2013-00002,CT,2013-10-01,00:02,mansfield,Tolland County,9013.0,rte 195 storrs,State Police,M,20,20.0,White,White,Moving Violation,Moving violation,False,,,False,Verbal Warning,False,1000001903,1-15 min
2,CT-2013-00003,CT,2013-10-01,00:07,franklin,New London County,9011.0,Rt 32/whippoorwill,State Police,M,34,34.0,Hispanic,Hispanic,Speed Related,Speeding,False,,,False,Ticket,False,1000002711,1-15 min
3,CT-2013-00004,CT,2013-10-01,00:10,danbury,Fairfield County,9001.0,I-84,State Police,M,46,46.0,Black,Black,Speed Related,Speeding,False,,,False,Written Warning,False,113658284,1-15 min
4,CT-2013-00005,CT,2013-10-01,00:10,east hartford,Hartford County,9003.0,"00000 W I 84 (EAST HARTFORD, T043)E.OF XT.56",State Police,M,30,30.0,White,White,Speed Related,Speeding,False,,,False,Ticket,False,830814942,1-15 min


In [186]:
#remove duplicated columns and constant columns
#all stops were conducted by the State Police, so there is no need for the column
columns = ['location_raw', 'driver_age_raw', 'driver_race_raw', 'violation_raw', 'search_type_raw', 'police_department']

In [149]:
df = df.drop(columns = columns)
del(columns)

The following bit of code computes some statistics for the data set. It first gives the number of traffic stops, the number of searches conducted, and the number of hits. A hit is defined as 'contraband' being found during a search. 

From there, the search rate is computed as the number of searches over the number of stops.

Similarly, the hit rate is given as the number of hits over the number of stops. Wilson Score intervals are computed for each, as the proportion of searches and hits are fairly close to 0, and the interval has been shown to give better coverage in this situation than regular normal-theory confidence intervals. The interval is computed with the following formula:

$\frac{ \hat{p} + z^2/2n}{1 + z^2/n} \pm \frac{z}{1 + z^2/n} \cdot \sqrt{\frac{\hat{p}(1- \hat{p})}{n} + \frac{z^2}{4n^2}}$

It should be noted that the wilson score interval has been observed to be a conservative estimate of the true confidence interval. It is often wider than the nominal coverage.

where $z$ is the z-score from the normal distribution, $\hat{p}$ is the estimated proportion and $n$ is the sample size

In [196]:

from statsmodels.stats.proportion import proportion_confint

def search_stats(df, alpha = 0.01):
    n_stops = len(df.search_conducted)
    n_searches = np.sum(df.search_conducted)
    n_hits = np.sum(df.contraband_found)
    search_rate = n_searches / n_stops
    hit_rate = n_hits / n_searches
    
    ci_search = proportion_confint(n_searches, n_stops, alpha=0.01, method='wilson')
    
    ci_hit = proportion_confint(n_hits, n_searches, alpha=0.01, method='wilson')
    
    print(' %s traffic stops,' % n_stops, '%s searches, ' % n_searches,'%s hits for contraband,' % n_hits)
    
    print('Search rate: %s ,' % round(search_rate, 3))
    print('Search Rate 99% confidence interval:', (round(ci_search[0], 5), round(ci_search[1], 5)))
    print('Hit rate: %s ,' % round(hit_rate, 3))
    print('Hit Rate 99% confidence interval:', (round(ci_hit[0], 5), round(ci_hit[1], 5)))
    del(n_stops, n_searches, n_hits, search_rate, hit_rate, ci_search, ci_hit)
    #return n_stops, n_searches, n_hits, search_rate, hit_rate

Now, computing these statistics for various subsets of the data. We can look at race first. We compute the statistics for white drivers, then for non-white drivers. Non-white drivers are searched more than twice the rate of white drivers and the hit rate is about ~$0.09$ lower as well, indicating that when a search is performed in this sample, the the probability of contraband actually being found was lower.

According to the open policing project's results, one possible interpretation of the search and hit rate is that, if the police are doing appropriate police work, then given a search, the hit rates should be approximately the same. If the hit rate is significantly lower for a particular group, that could indicate that the search was conducted out of bias. 

NB: The confidence intervals for search and hit rates are disjoint, indicating statistical signifigance, but we can't read too much into that just yet.

In [151]:
search_stats(df[df['driver_race'] == 'White'])

 242349 traffic stops, 3108 searches,  1179 hits for contraband,
Search rate: 0.013 ,
Search Rate 99% confidence interval: (0.01225, 0.01343)
Hit rate: 0.379 ,
Hit Rate 99% confidence interval: (0.3572, 0.402)


In [152]:
search_stats(df[df.driver_race != 'White'])

 76320 traffic stops, 2224 searches,  638 hits for contraband,
Search rate: 0.029 ,
Search Rate 99% confidence interval: (0.02761, 0.03075)
Hit rate: 0.287 ,
Hit Rate 99% confidence interval: (0.26283, 0.31218)


Doing the same for gender: we see that somen are searched at a much lower rate than men in this data set, but the hit rates are approximately the same. This may indicate a *lack* of bias when it comes to dealing with women.

In [153]:
search_stats(df[df.driver_gender == 'M'])

 211885 traffic stops, 4506 searches,  1542 hits for contraband,
Search rate: 0.021 ,
Search Rate 99% confidence interval: (0.02047, 0.02209)
Hit rate: 0.342 ,
Hit Rate 99% confidence interval: (0.32425, 0.36064)


In [154]:
search_stats(df[df.driver_gender == 'F'])

 106784 traffic stops, 826 searches,  275 hits for contraband,
Search rate: 0.008 ,
Search Rate 99% confidence interval: (0.00707, 0.00846)
Hit rate: 0.333 ,
Hit Rate 99% confidence interval: (0.29217, 0.37635)


Now lets look at those who were stopped for common violations: speeding, moving violations, stop sign/light violations, cellphone violations.

Running our basic search stats tool shows that in the case of "common violations", non-whites are searched at approximately 3 times the rate, but the hit rate is still lower than that of non-whites.

In [193]:
bad_driving = df[(df['violation'] == 'Speeding') | (df['violation'] == 'Moving violation')
                | (df['violation'] == 'Stop sign/light') | (df['violation'] == 'Cell phone')]
bad_driving.shape

(156160, 19)

In [197]:
search_stats(bad_driving[bad_driving['driver_race'] == 'White'])

 119286 traffic stops, 1097 searches,  368 hits for contraband,
Search rate: 0.009 ,
Search Rate 99% confidence interval: (0.00851, 0.00994)
Hit rate: 0.335 ,
Hit Rate 99% confidence interval: (0.29983, 0.37307)


In [198]:
search_stats(bad_driving[bad_driving['driver_race'] != 'White'])

 36874 traffic stops, 781 searches,  199 hits for contraband,
Search rate: 0.021 ,
Search Rate 99% confidence interval: (0.01933, 0.0232)
Hit rate: 0.255 ,
Hit Rate 99% confidence interval: (0.21682, 0.29691)


In [203]:
#subset dataframe with stops that did not result in a search
no_search = bad_driving[bad_driving['search_conducted'] == False]

In [200]:
#computes some statistics on ticket rates
def ticket_stats(df, alpha = 0.01):
    #take the number of stops in the data
    n_stops = len(df.search_conducted)
    #compute the number of tickets given in data
    n_tickets = len(df[df['stop_outcome'] == 'Ticket'])
    #gives the proportion of tickets given
    ticket_rate = n_tickets / n_stops
    
    #wilson confidence interval for the ticket proportion
    ci_ticket = proportion_confint(n_tickets, n_stops, alpha=0.01, method='wilson')
    
    print(' %s traffic stops,' % n_stops, '%s tickets given, ' % n_tickets)
    
    print('Ticket Rate: %s ,' % round(ticket_rate, 3))
    print('Ticket Rate 99% confidence interval:', (round(ci_ticket[0], 5), round(ci_ticket[1], 5)))
    del(n_stops, n_tickets, ticket_rate, ci_ticket)

Removing those data points which resulted in a search from these common violations only resulted in about 1000 data points being removed, out of nearly nearly 11000. The probability of a search is fairly low given a stop for a "common" violation.



So the ticket rates and corresponding confidence intervals are nearly identical. The ticket rate is about 69% for common violations.

In [204]:
ticket_stats(bad_driving)

 156160 traffic stops, 108053 tickets given, 
Ticket Rate: 0.692 ,
Ticket Rate 99% confidence interval: (0.68892, 0.69494)


In [205]:
ticket_stats(no_search)

 154282 traffic stops, 107088 tickets given, 
Ticket Rate: 0.694 ,
Ticket Rate 99% confidence interval: (0.69108, 0.69712)


Let's take a look at the ticket rate as a function of gender. 

Looks like men are ticketed slightly more than women with a ticket rate of almost 70% (compared to a ticket rate of 69%). The 99% confidence intervals are mildly disjoint.

In [206]:
ticket_stats(no_search[no_search['driver_gender'] == 'M'])

 101391 traffic stops, 70858 tickets given, 
Ticket Rate: 0.699 ,
Ticket Rate 99% confidence interval: (0.69513, 0.70256)


In [207]:
ticket_stats(no_search[no_search['driver_gender'] != 'M'])

 52891 traffic stops, 36230 tickets given, 
Ticket Rate: 0.685 ,
Ticket Rate 99% confidence interval: (0.67977, 0.69017)


We can take a look at the data as a function of race now. In order to account for the bias of more frequent searches being conducted on non-whites in this data set, we look at only those data points in which a search *was not* conducted. 

With white drivers, the ticket rate is about 67.6% and with non-white drivers, the ticket rate is about 75.3%. The corresponding 99% (wilson) confidence intervals are widely disjoint.

In [210]:
ticket_stats(no_search[no_search['driver_race']  == 'White'])

 118189 traffic stops, 79918 tickets given, 
Ticket Rate: 0.676 ,
Ticket Rate 99% confidence interval: (0.67267, 0.67968)


In [211]:
ticket_stats(no_search[no_search['driver_race']  != 'White'])

 36093 traffic stops, 27170 tickets given, 
Ticket Rate: 0.753 ,
Ticket Rate 99% confidence interval: (0.74688, 0.75858)
