# 🤖 Inframarginality via simulation

❗❗❗ **Make sure to save a copy of this notebook to your Google Drive so your work isn't lost.**

## Introduction

In this tutorial, we'll use `R` to build a simulation of pretrial incarceration and investigate the concept of inframarginality. 

By the end of the tutorial, you'll have foundational understanding of the following:
1. ⛔ Why error rate comparisons suffer from the problem of inframarginality
2. 📈 How to write and plot the results of a simulation using `R`

## ✅ Set up

Make sure to run the cell below. It imports additional useful functions, adjusts `R` settings, and loads in data. 

In [0]:
# Load in additional functions
library(tidyverse)

# Use three digits past the decimal point
options(digits = 3)

# This makes our plots look nice!
theme_set(theme_bw())


## 🏛️ Primer on pretrial incarceration

In the lecture, we talked about how judges have to decide whether release or detain defendants who plead "not guilty" to a crime. The actual trial date may be days to years in the future.

> Pretrial detention is a controversial practice. This lab is not intended to condone or condemn the practice.

2️⃣ There are two typical reasons why a judge might choose to detain a defendant: 
1. The judge suspects the defendant will commit a crime if released.
2. The judge thinks the defendant will fail to appear (FTA) at a future required court date.

⬆️ Pretrial incarceration imposes high costs to the community. For example,
- Detained defendants may lose their jobs, and their families may suffer as a result. 
- Pretrial detention is generally much more expensive than community monitoring or check-ins.
- There are large costs associated with severe crimes and having to track down defendants who flee the jurisdiction. Of course, certain types of crime and forgotten court dates tend to be far less costly.

Judges have to weight the costs of detention with the costs of potential violations.

## 🎯 Error rates in pretrial detention decisions

For the purposes of this lab, we define a "pretrial violation" as either committing a crime while released or intentionally failing to appear at a hearing.

> Pretrial violations often include things like forgetting a court date or failing a drug test. For this lab, assume we  only consider severe violations.

It's generally expected that some released defendants will violate the terms of their pretrial release. 

⬇️ Judges try to minimize the rate of violations among released defendants.

> Of course, one way to minimize violations is to detain every defendant. The violation rate is an imperfect metric without more context about the jurisdiction.

## 🚀 Exercise: Constructing risk distributions

For this exercise, consider the following scenario:

- There are two groups, Group 1 and Group 2.
- If everyone in **Group 1** were released, we would observe a **50\% violation rate** for Group 1 defendants.
- For **Group 2**, the corresponding violation rate is **40\%**.

A few additional details about violation probabilities:

- We assume that everyone is either of **high risk** or **low risk**.
- **High risk** defendants, regardless of group membership, violate with a **60\% probability**.
- **Low risk** defendants, regardless of group membership, violate with a **30\% probability**.

Using `R`, determine the proportion of defendants in each group who are low risk.

> Helpful hint: You may find it easier to conceptualize this problem if you assume there are 1,000 individuals in each group, but the answer is the same regardless of the group size.

In [0]:
# Your code here!



## 🎶 Interlude: Functions in `R`

To define a function in `R`, use the following notation:

In [0]:
# Function to add two numbers
add = function(x, y) {
  total = x + y
  return(total)
}

add(100, 150)

🐍 For those more familiar with `Python`, the code above is equivalent to the following:

```
# Function to add two numbers
def add(x, y):
    total = x + y
    return total
```

## 🚀 Exercise: Write a function

Write a function `calc_low_risk_prop` to compute the proportion of defendants in an arbitrary group that are low risk.

Then, use your function to replicate the results of the previous exercise.

Your function should take the following inputs:
- The overall violation rate of the group, `vg`
- The probability of violation for low risk defendants if released, `vl`
- The probability of violation for high risk defendants if released, `vh`

In [0]:
# Your code here!



## 🤥 False negative rates

Let's consider pretrial detention as a binary decision, where 1 means detain and 0 means release.

By construction: 
- A **false negative (FN)** is the event that a released defendant violates.
- The **false negative rate (FNR)** is the release rate of those who would violate if released.

Intuition for the FNR:

1. The FNR is $\frac{\text{FN}}{\text{FN}+\text{TP}}$ (see slides).
2. $\text{FN}$ is the number of released defendants who violate.
3. $\text{TP}$ is the number of detained defendants who would have violated if released.
2. Therefore, $\text{FN}+\text{TP}$ is the total number of people who would violate if released.
3. So, the FNR is the release rate of those who would violate if released.

> The FNR can also be calculated by interpreting $\text{FN}$ as the *proportion* of released defendants who violate, and $\text{TP}$ as the *proportion* of detained defendants who would have violated if released. This will come in handy later!

## 🧑‍⚖️ Judicial decisions

Let's expand our pretrial detention scenario. Consider the following:
- Suppose there is a judge who can perfectly perceive whether a defendant is low risk or high risk.
- The judge uses a simple decision rule: release low risk defendants, and detain high risk defendants.

In the scenario above, **all defendants are treated identically** (i.e., irrespective of group membership).

## 🚀 Exercise: Comparing false negative rates

For the judge described above, write a function to calculate to the calculate the false negative rate for an arbitrary group with an overall violation rate of `vg`, where low risk defendants violate with probability `vl` and high risk defendants violate with probability `vb`.

> Your `calc_low_risk_prop` function may come in handy.

Then, calculate the expected false negative rate for each group.

> Important aside: It's impossible to calculate the true FNR from real data, since we do not observe violations for defendants who are detained. We can only estimate the true FNR under strong assumptions. 
>
> For this problem, we're taking the perspective of a statistical **oracle** who knows everything about the populations of interest.

What do you take away from your results? 

In [0]:
# Your code here!



## 🎶 Interlude: Vectors and `for` loops in `R`

A **vector** in `R` is a list of numbers, strings, or booleans.

> Not to be confused with a **list**, which is not covered in this tutorial. Lists can contain elements of any type.

Here's a shortcut for making a vector of arbitrary length with a constant value:

In [0]:
# NA is similar to None in Python
rep(NA, 10)

We can also use `c()` to create vectors.

> 🔎 The "c" in `c()` stands for **concatenate**.


In [0]:
c(10, 100, 1000)

We can extract **elements** from vectors using their **index**, or their place in line.

> 🔎 Unlike most other programming languages, `R` is 1-indexed, not 0-indexed. So, the first element in a vector `v` is `v[1]`, not `v[0]`.

In [0]:
my_vector = c(10, 100, 1000)
my_vector[1]

#### 🔎 Printing vectors 

If you explicitly `print` a vector, the output looks a little different:

In [0]:
print(my_vector)

Why is there a `[1]` on the left of the printed results? 

Printing a longer vector gives us a hint:

In [0]:
print(25:75)

The bracketed numbers on the left side of the printed results indicate the index of the element immediately to the right.

For example `[26]` indicates that `50` is the 26th element in the vector.

### `for` loops

For most data science projects, `for` loops should be avoided in favor of more efficient tools.

> For example, the `map` function and its variants from `furrr` package. Run `?map` for details.

However, for exposition, here's the `R` syntax for a `for` loop:

In [0]:
# Print the odd numbers between 1 and 20
for (i in seq(1, 20, 2)) {
  print(i)
}

🐍 The equivalent in `Python`:

```
# Print the odd numbers between 1 and 20
for i in range(1, 20, 2):
    print(i)
```

## 🚀 Exercise: Simulating different scenarios

For this exercise, we'll investigate how our results change as a function of the violation probabilities of low and high risk defendants.

As above, assume that the overall violation rate of Group 1 is 50\%, and 40\% for Group 2.

Use two nested `for` loops to iterate over the following two vectors:

1. `vl_vals = seq(0, 0.3, by=0.001)`

> These are possible values of the probability that a low risk defendant violates.

2. `vh_vals = seq(0.6, 1, by=0.001)`

> These are possible values of the probability that a high risk defendant violates.

At each iteration, calculate the difference in false negative rates between Groups 1 and 2. Call this value `diff`.

Make sure to create three vectors to store the values of `vl`, `vh`, and `diff` at each iteration. 

> You should initialize each of these vectors with `rep(NA, N)`, where `N` is the total number of iterations. This is faster than concatenating to the end of the results vector at each iteration.
>
> You can use the `summary` function on the results vectors to check if the values make sense. 

In the next exercise, we'll plot the results from this exercise.

In [0]:
# Your code here!



## 🚀 Final exercise: Plotting the results

Create a tibble (i.e., dataframe with extra features) with your three results vectors from the previous exercise. 

Using your tibble, make a plot using `geom_raster`.

> `?geom_raster` for more details!

Using `geom_raster`, put the `vl` results on the x-axis, the `vh` results on the y-axis, and map `fill` (not `color`!) to `diff`. 

What patterns do you notice? How do these relate to the FNR formula?

🖼️ Here's how to make a tibble:

In [0]:
# colon `:` is a shortcut for vectors of consecutive integers
x = 1:5
y = 6:10

# df is often used to denote a dataframe
df = tibble(
  x_vals = x,
  y_vals = y
)

df

In [0]:
# Your code here!

