# 🚘 Analyzing SF traffic stops with `R`: Part 1

<img src="img/sf-traffic.jpg" alt="traffic" width="600" align="left"/>

## Introduction

In this series of tutorials, we'll use `R` to explore traffic stops in San Francisco (SF). In particular, we'll investigate whether there is evidence of racial discrimination in SF's policing practices. 

> **Important note**: Policing can be a sensitive subject. It's important to remember that each row in our data represents a real interaction between a police officer and driver. Please keep this in mind as you work through the tutorial, and be sure to engage with the material to the extent you're comfortable. 

By the end of the tutorials, you'll have foundational understanding of the following:
1. 📊 How to use `R` to explore tabular data and calculate descriptive statistics. 
2. 📈 How to make an informative plot with `R`
2. ⚖️ How to approach questions about social policy with data. 

Let's get started!

## ✅ Set up

While the core `R` language contains many useful functions (e.g., `sum` and `sample`), there is vast functionality built on top of `R` by community members.

Make sure to run the cell below. It imports additional useful functions, adjusts `R` settings, and loads in data. 

In [69]:
# Load in additional functions
library(tidyverse)
library(lubridate)

# Use three digits past the decimal point
options(digits = 3)

# This makes our plots look nice!
theme_set(theme_bw())

# Read in the data
stops <- read_rds("data/sf_stop_data.rds")
pop_2015 <- read_rds("data/sf_pop_2015.rds")

### 🖼️ The data frame

Data frames are like spreadsheets in Microsft Excel or Google Sheets: they have rows and columns, and each cell in the spreadsheet contains data.

Run the cell below to preview the `stops` data. What do you notice?

> 🔎 The `head` function allows us to see the first couple rows of a dataframe.

In [70]:
head(stops)

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation


⬆️ From the preview above, we might guess that each row in the `stops` dataframe represents a stop, and each column contains information about each stop.

> This guess is correct!

### 💭 Asking questions about the data

As an analyst, you might start with some basic questions:

1. How many stops (i.e., rows) are in the `stops` data?
2. What do we know about each stop?
3. When was the earliest stop?
4. What were the most commons reasons for stops?
5. Who is most likely to get stopped?

Let's start with the first question: how many rows are in the `stops` data?

In [71]:
nrow(stops)

Looks like we have information on approximately 640,000 stops.

What do we know about each stop?

In [72]:
colnames(stops)

It looks like we have the basics of each stop: time, location, demographics, and outcomes.

## 🚀 Exercise: Stop dates

When did the traffic stops in the `stops` data occur? 

Use the `date` column in the `stops` data to get a sense of when stops typically occur. Write a comment explaining your results. 

A few pointers:

> 💵 To extract a column from a data frame, use the `$` symbol. To retrieve column `age` from data frame `df`, we write `df$age`.

> You may find the following functions helpful: `sample`, `min`, `max`, `range`, and `print`. You can learn more about a function `f` by running `?f`.

In [73]:
## Your code here!

# START

date_col = stops$date

# The earliest stop took place on Jan 1, 2009.
print(min(date_col))

# The last stop took place on Jun 30, 2016.
print(max(date_col))

# `range` gives us the same information as above in a more compact form.
print(range(date_col))

# We observe stops between 2009 and 2016. You can't confirm this using
# just min and max!
print(sample(date_col, 50))

# END

[1] "2009-01-01"
[1] "2016-06-30"
[1] "2009-01-01" "2016-06-30"
 [1] "2009-08-11" "2009-10-08" "2011-12-23" "2010-07-23" "2015-02-20"
 [6] "2012-01-06" "2009-06-06" "2015-03-12" "2010-01-08" "2009-05-12"
[11] "2011-11-11" "2016-03-13" "2016-04-23" "2011-09-03" "2014-02-16"
[16] "2010-10-28" "2014-04-07" "2012-01-06" "2012-01-06" "2009-09-22"
[21] "2010-03-12" "2009-07-30" "2010-02-19" "2009-08-21" "2012-10-05"
[26] "2011-03-24" "2016-03-12" "2010-05-22" "2016-06-27" "2009-07-20"
[31] "2015-01-25" "2010-11-09" "2009-08-17" "2009-11-03" "2015-03-30"
[36] "2013-03-04" "2016-01-09" "2014-04-17" "2009-02-27" "2013-06-08"
[41] "2012-05-20" "2011-06-20" "2015-02-08" "2015-05-15" "2012-02-29"
[46] "2010-06-03" "2010-06-09" "2009-10-01" "2010-12-24" "2009-08-11"


## 🚰 The pipe: `%>%`

Both of these lines of code do exactly the same thing:

In [74]:
# Method 1
print(nrow(stops))

# Method 2
stops %>% 
    nrow() %>%
    print()

[1] 636161
[1] 636161


Why should we care? Read on to find out!

### The math of the pipe `%>%`

To process a dataset, we may have to use several functions. For example, we may want to use function `a`, then function `b`, and finally function `c`:

```
c(b(a(data)))
```

To understand what this code is doing, we have to read the code ⏪inside out⏩: we start with `a`, then apply `b`, then apply `c`. 

🙀 If we start adding more functions, things gets messy:

```
f(e(d(c(b(a(data))))))
```


The pipe `%>%` allows us to turn our code inside out. This makes our code read more like a sentence:

```
# do a(), then b(), then c(), then d(), then e(), then f()

data %>% a() %>% b() %>% c() %>% d() %>% e() %>% f()
```

More readably:

```
data %>%
    a() %>%
    b() %>%
    c() %>%
    d() %>%
    e() %>%
    f()
```

The pipe pushes (i.e., pipes!) what's on the left of the pipe `%>%` into the first argument of the function on the right:

```
x %>% f() == f(x)
x %>% f(y) == f(x, y)
x %>% f(y, z) == f(x, y, z)
```

The pipe `%>%` really ☀️shines☀️ when you have a lot of steps! 

## 📝 Adding new columns with `mutate`

Our data extends from 2009 to the first half of 2016. Suppose want to examine the most recent full year of data: 2015.

Problem: We don't have a `year` column. To add new columns, we use `mutate`.

🖥️ Usage: `mutate(data, new_col = f(existing_col))`
* `data`: the data frame
* `new_col`: name of the new column to add
* `f`: function to apply to existing column(s) to generate the new column
* `existing_col`: name of existing column

For example, here's how we could add a column to `stops` containing the first digit of the driver's age.

In [75]:
stops %>%
    mutate(age_first_digit = round(age/10)) %>%
    head()

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,age_first_digit
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation,2
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation,4
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation,4
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation,3
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation,3
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation,3


❗❗❗Important note❗❗❗: Most `R` functions are "copy on modify". In other words, when we apply a function to data, `R` creates a copy of the data and then modifies the copy. The original data is unchanged.

So, `mutate` alone will not change the original data. 

### 🚀 Exercise

1. Use `year()` and `mutate()` to add a new column called `yr` to our `stops` data. 

> You can read about the `year()` function by running `?year`.

2. Assign the resulting data frame to a new variable called `stops_w_yr`. 
 
3. Finally, run `count(stops_w_yr, yr)`. 

> What do you think `count` does? Do you notice any patterns?

In [76]:
# Your code here!

# START

stops_w_yr = stops %>% 
    mutate(yr = year(date))

# Count the number of stops in each year.
# There are fewer stops in later years.
# There are a lot fewer stops in 2014 and 2016.
# We saw early that we only get the first half of 2016.
# But what's going on with 2014?
count(stops_w_yr, yr)

# END

yr,n
<dbl>,<int>
2009,110269
2010,104254
2011,99476
2012,82362
2013,74144
2014,39752
2015,85689
2016,40215


## 📝 Selecting rows with `filter`

Now that we have a `yr` column, we want to limit our data to just the stops in 2015.

Problem: We have data from 2009 to 2016. To limit to specific rows, we use `filter`.

🖥️ Usage: `filter(data, condition)`
* `data`: the data frame
* `condition`: a boolean vector where TRUE indicates the rows in `data` to keep.

For example, here's how we could limit `stops` to drivers under 30 years old:

In [77]:
stops %>%
    filter(age < 30) %>%
    head()

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation
2009-01-01,10:44:00,19TH/SANTIAGO,37.7,-122,I,29,white,female,False,,False,Equipment violation
2009-01-01,10:55:00,LA SALLE @ NEWCOMB,37.7,-122,C,26,black,male,False,False,True,Moving violation
2009-01-01,01:10:00,CORDELIA AL & BROADWAY,37.8,-122,A,19,black,male,False,,False,Moving violation


### 🚀 Exercise

1. Use `filter()` to filter the `stops` data to just 2015. Assign the result to a variable called `stops_2015`.

2. In the previous exercise, we saw that there were a lot fewer stops in 2014 than expected. Figure out why.

3. For practice, filter to stops occurring in 2013 or 2014 among female drivers less than 30 years old or more than 60 years old. 

In [91]:
# Your code here!

# START

# 1.
stops_2015 = stops_w_yr %>%
    filter(yr == 2015)

# 2.
stops_2014 = stops_w_yr %>%
    filter(yr == 2014)

# 2014 only covers stops before May 31
print(range(stops_2014$date))

# 2., but using only one pipe. 
# `pull(x, y)` has the same effect as `x$y`, but is more easily pipe-able.
# Similar to how add(x,y) is the same as `x+y`.
# `$` and `+` are called infix functions
stops_w_yr %>%
    filter(yr == 2014) %>%
    pull(date) %>%
    range() %>%
    print()

# 3. 
stops_w_yr %>%
    filter(
        (yr == 2013 | yr == 2014) &
        (gender == 'female') &
        (age < 30 | age > 60)
    ) %>%
    head()

# 3., but more compact
# comma-separate conditions in `filter()` are combined with `&`
stops_w_yr %>%
    filter(
        yr %in% c(2013, 2014),
        gender == 'female',
        age < 30 | age > 60
    ) %>%
    head()
# END

[1] "2014-01-01" "2014-05-30"
[1] "2014-01-01" "2014-05-30"


date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2013-01-01,01:02:00,17 ST / SAN BRUNO AV,37.8,-122,C,19,hispanic,female,False,,False,Equipment violation,2013
2013-01-01,10:26:00,3RD ST & BANCROFT,37.7,-122,C,22,black,female,False,,False,Equipment violation,2013
2013-01-01,10:35:00,6TH ST & HOWARD,37.8,-122,B,28,black,female,False,,False,Equipment violation,2013
2013-01-01,11:45:00,SUTTER & STOCKTON,37.8,-122,A,22,white,female,False,,False,Equipment violation,2013
2013-01-01,12:45:00,SOUTH VANNESS & MARKET,37.8,-122,B,24,hispanic,female,False,,False,Equipment violation,2013
2013-01-01,12:50:00,868 MISSION,37.8,-122,B,28,white,female,False,,False,Moving violation,2013


date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2013-01-01,01:02:00,17 ST / SAN BRUNO AV,37.8,-122,C,19,hispanic,female,False,,False,Equipment violation,2013
2013-01-01,10:26:00,3RD ST & BANCROFT,37.7,-122,C,22,black,female,False,,False,Equipment violation,2013
2013-01-01,10:35:00,6TH ST & HOWARD,37.8,-122,B,28,black,female,False,,False,Equipment violation,2013
2013-01-01,11:45:00,SUTTER & STOCKTON,37.8,-122,A,22,white,female,False,,False,Equipment violation,2013
2013-01-01,12:45:00,SOUTH VANNESS & MARKET,37.8,-122,B,24,hispanic,female,False,,False,Equipment violation,2013
2013-01-01,12:50:00,868 MISSION,37.8,-122,B,28,white,female,False,,False,Moving violation,2013


## 📝 Aggregating data with `summarize()`

What was the average, median, maximum, and minimum age of drivers in 2015?

Problem: We want to aggregate the values in a column. To do this, we use `summarize()`.

In [92]:
# Old method.
mean(stops_2015$age)
median(stops_2015$age)
max(stops_2015$age)
min(stops_2015$age)

# New method!
stops_2015 %>%
    summarize(
        mean_age = mean(age),
        median_age = median(age),
        max_age = max(age),
        min_age = min(age)
    )

mean_age,median_age,max_age,min_age
<dbl>,<int>,<int>,<int>
,,,


😱 Uh oh. By default, `R` will return `NA` for aggregating functions if at least one element is `NA` (i.e., missing).

> The `na.rm=TRUE` argument will remove (`rm`) all `NA` values.

In [80]:
mean(c(1, 2, 3, 4, NA))
mean(c(1, 2, 3, 4, NA), na.rm=TRUE)

🔄 Let's try things one more time:

In [93]:
# Old method.
mean(stops_2015$age, na.rm=TRUE)
median(stops_2015$age, na.rm=TRUE)
max(stops_2015$age, na.rm=TRUE)
min(stops_2015$age, na.rm=TRUE)

# New method!
stops_2015 %>%
    summarize(
        mean_age = mean(age, na.rm=TRUE),
        median_age = median(age, na.rm=TRUE),
        max_age = max(age, na.rm=TRUE),
        min_age = min(age, na.rm=TRUE)
    )

mean_age,median_age,max_age,min_age
<dbl>,<int>,<int>,<int>
38.8,36,99,10


Neat! But, it's not groundbreaking. `summarize()` really ☀️ shines ☀️ when used with `group_by()`.

## 📝 Getting powerful with `group_by()` and `summarize()`

Here's where things get really interesting. The techniques in this section account for a **huge** chunk of most data science workflows. 

Suppose I'm interested in the average age of drivers in each district.

> `unique(v)` returns the set of unique values in a vector `v`

> `sort(v)` sorts a vector `v` in numeric or alphabetical order.

In [82]:
sort(unique(stops_2015$district))

# Alternatively
stops_2015$district %>% unique %>% sort

You already have the tools to find the average age of drivers by district! 

Looks a little scary though...

In [83]:
stops_2015 %>% filter(district=='A') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='B') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='C') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='D') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='E') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='F') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='G') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='H') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='I') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='J') %>% pull(age) %>% mean(na.rm=TRUE)

# 😓

We now know the average age in each district, but there are some issues:
- We had to write a lot of repeated code.
- What if there were 100 districts? Or 1,000,000 districts?
- The results aren't labeled. We'd have to write even more code to label the output.

Here's another way to answer the question, but with less code:

In [84]:
stops_2015 %>%
    group_by(district) %>%
    summarize(avg_age = mean(age, na.rm=TRUE))

district,avg_age
<chr>,<dbl>
A,39.3
B,38.1
C,36.6
D,37.2
E,38.8
F,39.8
G,40.8
H,38.2
I,39.8
J,38.8


# 😮

The next section will explain the magic of grouping.

### 📝 The mechanics of `group_by()`

It's **very** common to calculate an aggregate statistic (e.g., `sum` or `mean`) for different groups (e.g., district or class year).

The *split-apply-combine* paradigm handles these situations:
- **Split** the data by group into mini-datasets
- **Apply** a function to each mini-dataset
- **Combine** the mini-datasets back together

🖼️ A visual:

<img src="img/split-apply-combine.drawio.png" alt="splitapplycombine" width="600" align="left"/>

#### 📝 Splitting with `group_by`

`group_by` handles the *splitting* step.

Problem: The data isn't grouped. To split the data, we use `group_by`.

🖥️ Usage: `group_by(data, column)`
* `data`: the data frame
* `column`: the name of the column to group by.

Let's try grouping the `stops` data by district.

In [85]:
stops_2015_grouped = stops_2015 %>%
    group_by(district)

head(stops_2015_grouped)

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2015-01-01,01:00:00,3RD ST. & MCKNINNON AVE.,37.7,-122,C,23,black,male,False,,False,Moving violation,2015
2015-01-01,01:00:00,MISSION/EUGENIA,37.7,-122,H,30,hispanic,female,False,,False,Moving violation,2015
2015-01-01,01:00:00,MISSION ST & VFALENCIA ST,37.7,-122,H,35,white,male,False,,False,Moving violation,2015
2015-01-01,01:00:00,EDDY / GOUGH,37.8,-122,E,44,white,male,False,,False,Moving violation,2015
2015-01-01,01:00:00,24TH/TARAVAL,37.7,-122,I,60,white,male,False,,False,MPC violation,2015
2015-01-01,10:46:00,307 CORTLAND AVE.,37.7,-122,H,45,white,female,False,,False,Moving violation,2015


Wait a second. This looks exactly the same as the regular data:

In [86]:
head(stops_2015)

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2015-01-01,01:00:00,3RD ST. & MCKNINNON AVE.,37.7,-122,C,23,black,male,False,,False,Moving violation,2015
2015-01-01,01:00:00,MISSION/EUGENIA,37.7,-122,H,30,hispanic,female,False,,False,Moving violation,2015
2015-01-01,01:00:00,MISSION ST & VFALENCIA ST,37.7,-122,H,35,white,male,False,,False,Moving violation,2015
2015-01-01,01:00:00,EDDY / GOUGH,37.8,-122,E,44,white,male,False,,False,Moving violation,2015
2015-01-01,01:00:00,24TH/TARAVAL,37.7,-122,I,60,white,male,False,,False,MPC violation,2015
2015-01-01,10:46:00,307 CORTLAND AVE.,37.7,-122,H,45,white,female,False,,False,Moving violation,2015


❗❗Important note❗❗: `group_by` doesn't actually change the underlying data. It invisibly groups the data in the background.

> There is a subtle indication that the data is grouped. If you look at the top of the grouped data frame, you'll see `A grouped_df`. At the top of the ungrouped data, you'll see `A tibble`.

> A *tibble* is a data frame with some extra features.

#### 📝 Applying and combining with `summarize()`

`summarize()` *applies* an aggregating function to each mini-dataset. It then *combines* the mini-datasets.

We've already seen `summarize()` in action:

In [87]:
stops_2015 %>%
    summarize(
        avg_age = mean(age, na.rm=TRUE)
    )

avg_age
<dbl>
38.8


Let's try `summarize()` with grouped data.

> As a bonus, we can also calculate the size of each group with the `n()` function.

In [88]:
stops_2015 %>%
    group_by(district) %>%
    summarize(
        avg_age = mean(age, na.rm=TRUE),
        num_stops_in_district = n()
    )

district,avg_age,num_stops_in_district
<chr>,<dbl>,<int>
A,39.3,8098
B,38.1,7988
C,36.6,9199
D,37.2,8760
E,38.8,5862
F,39.8,7049
G,40.8,10345
H,38.2,8702
I,39.8,13983
J,38.8,5703


That's all there is to it!

### 🚀 Exercise

1. Use `group_by()` and `summarize()` to calculate, by district, (1) the number of stops, (2) the proportion of stops that resulted in a search, and (3) the proportion of **searches** (not stops) that resulted in an arrest. What can you conclude from the results?

2. Redo part 1, but group by race instead of district. What do you conclude from the result?

3. Redo part 1, but group by district **and** race. What is your interpretation of the results?

In [89]:
# Your code here!

# START

# 1.
# Search and arrest rates differ by district.
# Districts with higher search rates tend to have lower arrest rates,
# and vice-versa. 
stops_2015 %>%
    group_by(district) %>%
    summarize(
        n_stops = n(),
        
        n_searches = sum(searched),
        search_rate = n_searches / n_stops,
        
        n_arrests = sum(arrested),
        arrest_rate = n_arrests / n_searches,
    )

# 1., but more efficient
stops_2015 %>%
    group_by(district) %>%
    summarize(
        n_stops = n(),
        
        # the mean of a boolean vector is the proportion of
        # elements in the vector that are TRUE
        search_rate = mean(searched),
        
        # mean(arrested) is the proportion of stops that resulted in arrests
        # We only want to consider stops that resulted in searches.
        arrest_rate = sum(arrested)/sum(searched)
    )

# 2.
# Search rates for Black and Hispanic drivers are higher than drivers of 
# other race/ethnicity groups. However, the arrest rates for Black and Hispanic
# drivers are lower than other race/ethnicities. Perhaps the search threshold
# for Black and Hispanic drivers is lower, so the Black and Hispanic drivers
# who are stopped are less "risky"?
stops_2015 %>%
    group_by(race) %>%
    summarize(
        n_stops = n(),
        
        search_rate = mean(searched),
        
        arrest_rate = mean(arrested)/mean(searched)
    )

# 3.
# The output is quite long, but it looks like the gaps in search and arrest
# rates for Black and Hispanic drivers persist across districts, though
# the gaps differ in size across districts.
# A plot might help us digest these results more easily.
stops_2015 %>%
    group_by(district, race) %>%
    summarize(
        n_stops = n(),
        
        search_rate = mean(searched),
        
        arrest_rate = mean(arrested)/mean(searched)
    )

# END

district,n_stops,n_searches,search_rate,n_arrests,arrest_rate
<chr>,<int>,<int>,<dbl>,<int>,<dbl>
A,8098,185,0.0228,65,0.351
B,7988,343,0.0429,107,0.312
C,9199,1163,0.1264,183,0.157
D,8760,584,0.0667,149,0.255
E,5862,393,0.067,85,0.216
F,7049,102,0.0145,48,0.471
G,10345,119,0.0115,45,0.378
H,8702,569,0.0654,105,0.185
I,13983,214,0.0153,66,0.308
J,5703,515,0.0903,84,0.163


district,n_stops,search_rate,arrest_rate
<chr>,<int>,<dbl>,<dbl>
A,8098,0.0228,0.351
B,7988,0.0429,0.312
C,9199,0.1264,0.157
D,8760,0.0667,0.255
E,5862,0.067,0.216
F,7049,0.0145,0.471
G,10345,0.0115,0.378
H,8702,0.0654,0.185
I,13983,0.0153,0.308
J,5703,0.0903,0.163


race,n_stops,search_rate,arrest_rate
<chr>,<int>,<dbl>,<dbl>
asian/pacific islander,15498,0.0125,0.32
black,14955,0.1565,0.155
hispanic,11911,0.0638,0.264
other,13560,0.0172,0.33
white,29765,0.0221,0.354


[1m[22m`summarise()` has grouped output by 'district'. You can override using
the `.groups` argument.


district,race,n_stops,search_rate,arrest_rate
<chr>,<chr>,<int>,<dbl>,<dbl>
A,asian/pacific islander,1479,0.01149,0.412
A,black,1088,0.07261,0.253
A,hispanic,888,0.03604,0.25
A,other,1821,0.01318,0.542
A,white,2822,0.01169,0.515
B,asian/pacific islander,1100,0.01273,0.714
B,black,1315,0.11635,0.176
B,hispanic,1064,0.04887,0.385
B,other,1432,0.02165,0.323
B,white,3077,0.03022,0.43


## Concluding remarks

The method used in the final exercise is called an **outcome test**. Someone actually won a Nobel Prize for this kind of work! 

Here's what we'll do in Part 2:
- Use 📊plots📈 to reduce the cognitive burden of reading long tables.
- Learn how to combine data from multiple sources
- Dig deeper into our results. Can we say anything about racial/ethnic discrimination based on our results? What additional tests can we conduct? How can we clearly present our findings?
