# STA130 Tutorial 9: Classification (and Ethics)


This week we'll be doing something a little different, and we'll be focusing on ethical topics in anticipation of our upcoming embedded ethics guest lecture.
Remember! It's always a good time for questions and discussions. If you don't understand something ask.

# This Week's Vocab (10 minutes):
If you are not familiar with any of these words, now is the time to ask!

- Classification / Classifier
- Prediction / Predictor(s)
- Covariate(s)
- Input(s) / Output(s)
- Training set/sample
- Validation
- Testing set/sample (or test set)
- Fitting a model
- Confusion matrix
- Category
- Tree / Node
- Terminal node (or leaf node)
- True positive (sensitivity)
- True negative (specificity)
- False positive / False negative
- Accuracy

# Ethics Primer (15 minutes)

- When we do statistics in the real world, our actions have consequences.
- Often classification models (and indeed other statistical methods we have learned such as linear regression and hypothesis test) are used to guide decisions.
- Our actions can affect our employers or the public.
- So when we make decisions, we must consider all the stakeholders who may be affected.

## Example:
Let's say we're building a classification model to decide whether to accept mortgage applications for a mortgage company.
1. The first stakeholder is the mortgage company.
    - We don't want to instruct them to accept bad mortgages which lose them money / cause instability.
2. The second stakeholder are the applicants.
    - We have a responsibility to ensure that people aren't being rejected for unfair reasons (e.g. race, gender, sexuality).

# Case Study and Discussion (30 minutes):

Take ~10 minutes to skim this article:
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

This article is an interesting read, and is worthwhile to finish it on your own time after tutorial. But here is the spark-notes version:

- A company, NorthPointe, made and sold a tool called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) to help judges, probation officers, and parole offices to assess the likelihood of a criminal to re-offend.
    - This data was used to determine criminal sentences, decide whether prisoners should be eligible for early release etc.
- COMPAS was fundamentally a classification model, it classified defendants into risk classes of likelihood to reoffend (commit more crimes after being released) based on some input data.

- Among defendants who did not reoffend, COMPAS predicted 45% of black defendants and 23% of white defendants were at higher risk to reoffend.
- Among defendants who did reoffend, COMPAS predicted 28% of black defendants and 48% of white defendants were at lower risk to reoffend.
- Controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 more likely to be assigned higher risks of recidivism by COMPAS


This case touches on the idea of *algorithmic bias,* systematic and repeatable errors that create unfair outcomes.

Discuss the following questions for a few minutes in small groups, then share answers with the class:
- List as many relevant stakeholders in this situation as you can.
- COMPAS is similarly *accurate* on both racial groups. Why does this not correlate to fairness?
- Why might COMPAS have generated these outcomes? (e.g. What biases might be in the data it built its model from? Might they have fit a poor model?)
    - What could the model-designer have potentially done to mitigate these problems?

# Confusion Matrices (15 minutes)
Let's recall how to read a simple confusion Matrix. Suppose that we have a classification model designed to judge whether patients have a certain disease, called disease D.
On a testing dataset, it has the following confusion matrix (columns are predicted labels, rows are real labels)

|            | P  | N   |
|-------------|----|-----|
| P | 17 | 2   |
| N           | 17 | 168 |

- What do the numbers in the different cells represent?
- What are the metrics (accuracy, sensitivity,  specificity)?

But which metrics do we care about?
- Inaccuracy is obviously bad.
- A false negative could potentially be very bad for a patient.
- Excessive false positives could drain resources (such as medicines, hospital beds, doctor time etc.) from the system.
- A false positive might cause a patient to undergo unnecessary (potentially dangerous) medical interventions.

# Discussion (30 minutes)
These types of considerations are situation dependant. Thus let's practice thinking about them (i.e. identify stakeholders, consequences, etc.), reason about which metrics matter the most, and what trade-offs we can accept.

- The above case about medical testing where the disease D being tested for is the common cold.
- The above case about medical testing where the disease D being tested for is a sexually transmitted infection like HPV or HIV.
- The above case about medical testing where the disease D being tested for is a serious life-threatening condition like cancer.
- The NorthPointe COMPAS case.
    - What extra considerations should we be making in this case?

# Tutorial Assignment (get started... )
Suppose we are designing a classification model for reviewing applications to a prestigious medical school. Discuss how much we should value the different metrics for judging performance of our model. (Hint: in doing so you should identify stakeholders and what effects false positives and negatives would have on them?)

### Notes on approaching the writing prompt

- Hand in the assignment on Quercus
- Use full sentences
- Grammar is *not* the main focus of the assessment, but it is important that you communicate in a clear and professional manner (without slang or emojis) 
- Aim for 200 - 500 words
- Do not spend more than 90 minutes on the prompt (unless you really need to...)