## Supervised Learning

In this lecture, we'll make our first acquaintance to the *classification* task. Classification is a form of *supervised* machine learning. Here's the big-picture version of the supervised ML task. 

We are given: 

- A set of observations of **predictor variables**. We'll call the $i$th such observation $\mathbf{x}_i$. We write it this way because $\mathbf{x}_i$ is usually a *vector* of multiple variables, often called *features* or *covariates*. We often collect these observations into a matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$, where $n$ is the number of observations and $p$ is the total number of features. 
- A set of observations of a single **target variable**. We'll call the $i$th such observation $y_i$. We write it this way because (at least in this course) $y_i$ will always be a scalar number, rather than a vector. We can collect these observations into a (column) vector $\mathbf{y} \in \mathbb{R}^n$. 
- We can refer to a single observation as a pair $(\mathbf{x}_i, y_i)$. 

Big picture, the supervised machine learning task is to use $\mathbf{X}$ and $\mathbf{y}$ to find a function $f:\mathbb{R}^p \rightarrow \mathbb{R}$ with the property that 

$$
f(\mathbf{x}) \approx y
$$ 

[What does it mean for $$f(\mathbf{x}) \approx y$$? This requires mathematical fleshing-out that we'll do very soon.]{.aside}

for *new* observations $(\mathbf{x}, y)$. We can think of the function $f$ as an expression of the (unknown) *relationship* between the features $\mathbf{x}$ and the target $y$. If we can find an approximation of that pattern, then we'll have the ability to make predictions. 

We often use $\hat{y} = f(\mathbf{x})$ to denote the predicted value for $y$ based on $\mathbf{x}$. So, we want to choose $f$ so that $\hat{y} \approx y$. 


## Classification

We use the vector $\mathbf{y}$ to hold our observations of the target variable. We have assumed that each observation of the target variable is a real number (i.e. an element of $\mathbb{R}$). This looks reasonable for when the thing we want to predictive is a real number (like a stock price or a probability to like a post), but what about when we want to predict a categorical label? In this case, we can simply encode labels using integers: $0$ for one category, $1$ for the next category, $2$ for the one after that, and so on. 

## The COMPAS Recidivism Prediction Algorithm

Criminal *recidivism* occurs when a person is convicted of a crime, completes the legal terms of their punishment, and is then convicted of *another* crime after release. In the American penal system, predictions of recidivism play a role in determining whether or not a defendant will be released on bail before trial or granted parole after serving a portion of a prison sentence. In other words, the belief of the court about whether a person is *likely to commit a future crime* can have concrete consequences for that person's current and future freedom. Of course, it's difficult for a human to predict whether a defendant is likely to commit a future crime. Furthermore, humans are subject to bias. Wouldn't it be nice if we could use a machine learning algorithm to make this prediction for us? 

In 2016, the journalism website [ProPublica](https://www.propublica.org/) published an [investigative story](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) on COMPAS, a machine learning algorithm used to predict recidivism in Broward County, Florida. They obtained data for criminal defendants in Broward County in the years 2013 and 2014. These data include the COMPAS predictions, as well as demographic information (like age, gender, and race) and legal information (e.g. the crime with which the defendant was charged). The data also include an indicator of whether or not the defendant went on to be arrested of a crime within the two years following their initial trial. 

[The COMPAS algorithm actually uses information about the defendant beyond what is shown in this table; here is [an example](https://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE.html) of the survey used for COMPAS to form its prediction.]{.aside}


::: {.callout-tip}

## Activity

Here are three concepts: 

- Demographic data and legal information related to a defendant. 
- Whether or not the defendant proceeds to be arrested for a crime within the two years following their initial trial. 
- The COMPAS algorithm. 

Match these three concepts to the three mathematical symbols in the relationship 

$$f(\mathbf{x}) \approx y$$. 

:::

Let's look at an excerpt of the data that ProPublica obtained. I have chosen only a subset of the columns and I have filtered out some of the rows as well. The hidden code saves the data in a `pandas.DataFrame` called `df`, and then views it. 

[Click the little arrow to the right to view the code I used to display this table.]{.aside}

In [29]:
import pandas as pd
import seaborn as sns

df = pd.read_csv("https://github.com/middlebury-csci-0451/CSCI-0451/raw/main/data/compas-scores-two-years.csv")

# filtering as in the original analysis by ProPublica
# https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb

df = df[df.days_b_screening_arrest <= 30]
df = df[df.days_b_screening_arrest >= -30]
df = df[df.is_recid != -1]
df = df[df.c_charge_degree != "O"]
df = df[df.score_text != "NA"]
df = df[(df.race == "African-American") | (df.race == "Caucasian")]

col_list = df.columns

df["compas_prediction"] = 1*(df.score_text != "Low")
df = df.reset_index()
cols = ["race", "age", "compas_prediction", "two_year_recid"]
df = df[cols]

df

Unnamed: 0,race,age,compas_prediction,two_year_recid
0,African-American,34,0,1
1,African-American,24,0,1
2,Caucasian,41,1,1
3,Caucasian,39,0,0
4,Caucasian,27,0,0
...,...,...,...,...
5273,African-American,30,0,1
5274,African-American,20,1,0
5275,African-American,23,1,0
5276,African-American,23,0,0


The column `compas_prediction` is the COMPAS algorithm's prediction of whether the individual will be arrested again. 

- `0` means "no:" according to COMPAS, the individual does not have an elevated risk to be arrested for a crime within the next two years. 
- `1` means "yes:" according to COMPAS, the individual does have an elevated risk to be arrested for a crime within the next two years. 

 where `0` means  The column `two_year_recid` records the actual outcome: `0` means "no," the individual was not arrested within the next two years, while `1` means "yes," the individual was arrested within the next two years.  

 There are a number of other columns that I  have omitted, including the defendant name, the severity of the criminal charge, whether or not the charge is for a violent crime, presence of a prior record, sex, and other information. 

::: {.callout-tip}

## Discussion

Take some time to look at the excerpted data set and my description of it. What *questions* do you have when you look at the data? Try to find at least two questions about: 

- How the data was collected/gathered/presented by me. 
- What patterns might be present in the data? What concerns might you have that you would want to check? 

:::


## Evaluating Classification Algorithms

Was COMPAS *successful* at making its predictions? There are lots of ways to assess this. 

### Overall Accuracy

One way is the *overall accuracy* of the predictions: how often was it the case that the predictions were correct? The code below computes the proportion of the time that the COMPAS prediction matched reality: 

In [31]:
#| code-fold: false

df["accurate"] = df["compas_prediction"] == df["two_year_recid"]
df["accurate"].mean()

0.6582038651004168

Is this a good result? We can compare it to the performance of a hypothetical algorithm that simply always predicted that the individual would *not* reoffend. 

In [33]:
(1-df["two_year_recid"]).mean()

0.5295566502463054

This is an example of comparing against a *base rate*. There's no formal definition of a base rate, but you can think of it as the performance of the best approach to the problem that doesn't involve anything fancy. Here, the base rate is 53% and the accuracy of COMPAS is 66%, indicating that the COMPAS algorithm is significantly outperforming the base rate. 

## Classification Rates

While accuracy is a useful metric for classification problems, it's useful to break it down in more detailed ways. In the case of binary classification, there are four cases: 

[Recall that $\hat{y}$ is just another name for $f(\mathbf{x})$, the predicted value of $y$ based on $\mathbf{x}$.]{.aside}

- If $y = 1$ and $\hat{y} = 1$, we have a *true positive*. 
- If $y = 0$ and $\hat{y} = 1$, we have a *false positive*. 
- If $y = 1$ and $\hat{y} = 0$, we have a *false negative*. 
- If $y = 0$ and $\hat{y} = 0$, we have a *true negative*. 

The *false positive rate* is the fraction of all negative events for which the prediction is positive: 

[Here $\mathbb{1}$ is the *indicator function* that is 1 if its arguments all evaluate to true and 0 otherwise.]{.aside}

$$\mathrm{FPR}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\sum_{i=1}^n \mathbb{1}(\hat{y}_i = 1, y_i = 0)}{\sum_{i = 1}^n \mathbb{1}(y_i = 0)}$$

We can calculate the FPR like this: [`numpy` boolean arrays and `pandas` boolean columns can be multiplied to do entrywise Boolean `and`.]{.aside}

In [41]:
#| code-fold: show
def FPR(y, y_hat):
    return sum((y_hat == 1)*(y == 0))/sum(y == 0) 

One can also define the False Negative Rate, True Positive Rate, and True Negative Rate. Let's also do the False Negative Rate: 

In [43]:
#| code-fold: show
def FNR(y, y_hat):
    return sum((y_hat == 0)*(y == 1))/sum(y == 1) 

In [44]:
#| code-fold: show
y = df["two_year_recid"]
y_hat = df["compas_prediction"]
FPR(y, y_hat), FNR(y, y_hat)

(0.3302325581395349, 0.35481272654047524)

In other words: 

- Of people who were not arrested within two years, the COMPAS algorithm wrongly predicted that 33% of them would be arrested within two years (but correctly predicted that 67% of them would not be). 
- Of people who were arrested within two years, the COMPAS algorithm wrongly predicted that 35% of them would not be arrested within two years (but correctly predicted that 65% of them would be). 

In [57]:
#| code-fold: show

for label, fun in {"False positive rates": FPR, "False negative rates" : FNR}.items():
    print(label)
    print(df.groupby("race").apply(lambda df: fun(df["two_year_recid"], df["compas_prediction"])))
    print("")

False positive rates
race
African-American    0.423382
Caucasian           0.220141
dtype: float64

False negative rates
race
African-American    0.284768
Caucasian           0.496350
dtype: float64

