# STA130 Tutorial 9: Classification (and Ethics)


This week we'll be doing something a little different, and we'll be focusing on ethical topics in anticipation of our upcoming embedded ethics guest lecture.
Remember! It's always a good time for questions and discussions. If you don't understand something ask.

## This Week's Vocab (10 minutes):
If you are not familiar with any of these words, now is the time to ask!

- Classification / Classifier
- Prediction / Predictor(s)
- Covariate(s)
- Input(s) / Output(s)
- Training set/sample
- Validation
- Testing set/sample (or test set)
- Fitting a model
- Confusion matrix
- Tree / Node
- Terminal node (or leaf node)
- Accuracy
- True positive (sensitivity)
- True negative (specificity)
- False positive / False negative 

# Ethics Primer (20 minutes)
Doing statistics in the real world is different from doing textbook exercises: it can have consequences on the real world and the people in it. 

- Often classification models (and indeed other statistical methods we have learned such as linear regression and hypothesis test) are used to guide decisions.
- Our actions can affect our employers or the public.
- So when we make decisions, we must consider all the stakeholders who may be affected.

### Discussion
Brainstorm some examples from the course (or that you can think of) of how statistical methods might be used to make decisions (or otherwise have impact on people)

Since statistical analyses can have significant impacts on people, and we often have choices about how to perform those analyses, it is important to talk about the ethics of those choices. 
- Warning: there won’t be much statistics in the next 15 minutes, but we will be coming back to it!

One simple framework for thinking about the ethics of our actions (including statistical analyses): 
1. Consider the impacts of your action: does it help or harm people overall (e.g. their life, health, money, other things)? [If the action would harm people overall, you usually shouldn’t do it]
2. Does your action violate anyone’s rights? [If it violates someone’s rights, you usually shouldn’t do it]

### What's the difference between impacts and rights?

Consider the following case (which we’ll label *Organ Transplant*):
    Imagine you are a doctor at a small-town clinic. You have five patients who need organ transplants. A healthy patient walks into the clinic for a routine clinic. You determine that no one would notice if the patient went missing – they don’t have a family or friends.

Consider the two options.
1. Kill the healthy patient, use his organs to save the lives of the five patients.
2. Allow the five patients to die. 

### Discussion question (to the whole group): 
Should the doctor kill the healthy patient, take his organs, and use them to save the five other patients? You may want to do this by show of hands. 

Most of you will probably agree that we shouldn't kill the hitchhiker.

### Discussion question (to the whole group): 
What are the benefits and harms associated to options 1 and 2? 


The simple and most important benefits and harms you could have come up with are in terms of lives saved:
- In terms of benefits: option (A): five lives saved, option (B): one life saved
- In terms of harms: option (A): one life lost, option (B): five lives lost

You may have also come up with additional benefits or harms – e.g. other people will be sad if lives are lost, etc. That’s fine, but shouldn’t impact the overall calculus: option (A) seems to have better impact than option (B).  

Why do most students think that we shouldn’t kill the healthy patient, even though it produces the best impacts? 

We can describe their reasoning in terms of rights. 

The problem in Organ Transplant is that it seems like the healthy patient has a right not to be killed, even if killing him would result in net benefit to people overall (since four extra lives would be saved). 

### Discussion question (to the whole group): What are some other rights that you think people have?
Some examples of answers that students might give:
- Right to privacy
- Right to vote
- Right to freedom of speech
- Right not to be harmed (in certain ways) 

### Follow-up discussion question: which of these rights are most likely to be relevant in statistics? 
Note (which may have come up earlier): Students are going to disagree about how to compare different harms and benefits, and what rights we have! That’s fine. The goal isn’t to find the single correct answer, just to get clearer on how to think about these problems and justify your answers to them. The guest lecturer will say something about this next week. 

One particularly important answer is the tight to privacy – because it affects data collection. 

Next, we’re going to use this framework (impacts + rights) to help us think about two case studies, mixing together statistics review with ethical analysis!

# Case Study 1: 
### Medical Testing and False Positives and Negatives (30 minutes)
Let's begin by practicing with confusion matrices.

Suppose that we have a classification model designed to judge whether patients have a common cold. On a testing dataset, it has the following confusion matrix.


|            | Predict disease | Predict no disease |
|------------|-----------------|--------------------|
| Disease    | 17              | 2                  |
| No Disease | 61              | 168                |

- What do the numbers in the different cells represent? Interpret the confusion matrix.
- What are the metrics (accuracy, sensitivity,  specificity)? Compute them all from this confusion matrix.

### Small Group Discussion; 
We generally try to design our classification models to minimize the number of errors.


In the previous example, Consider the following stakeholders: (a) tested patients who have the illness, (b) tested patients who do not have the illness. Try to come to a consensus about your answers, but if you cannot, write down your disagreement: 
1.  What are the impacts (benefits and harms) of a false negative in this case? Consider both groups of stakeholders.
2. What are the impacts (benefits and harms) of a false positive in this case? Consider both groups of stakeholders. [Fill this out as a table]? 
3. What do your answers for (1) and (2) imply for which metric the designer of the classification model should prioritize (accuracy, sensitivity, specificity)? In your answer, you should take into account: (a) the relative number of people in each stakeholder group, (b) the extent of the benefits or harms that would happen to them, and (c) whether any of their rights are infringed by the error.
4. Would your answer to (3) change if the disease being tested for is a serious life-threatening condition? 

# Case study 2: Mortgage applications (30 minutes)


Many mortgage companies use classification models to predict whether someone will be likely to pay back their mortgage loan. (A mortgage is a loan offered by a bank to help someone buy something, often a residential property like a house or condo.)

Imagine we are a mortgage company, and we had a dataset stored in a dataframe ```mortgage_applications``` with the following fields:
- ```"Accept or Deny", "Annual Income" (in thousands), "Requested Loan Amount" (in thousands), "Applicant Age","Credit Score" ,"Application ID", "Dummy Variable"```

We want to make a classification model to decide based on the available information whether we should accept or deny a new application, which contains all relevant information from the dataset except for "Accept or Deny".```
Figure out as many problems with the following code snippet as you can. (They may or may not be coding errors)

```
clf = tree.DecisionTreeClassifier()
mortgage_applications_nonans = mortgage_applications[["Annual Income", "Requested Loan Amount", "Applicant Age","Credit Score", "Application ID"]] # We don't care about "Dummy Variable"
X = mortgage_applications_nonans.iloc[:,1:].dropna()
Y = mortgage_applications["Accept or Deny"].dropna()

clf = clf.fit(Y ~ X) # We fit our model with the availible data

clf.predict(60,500, 30, 680) # We want to know whether to accept the application of someone who makes 60k, wants 500k, is 30, and has a credit score of 680
```

Here are some problems you may or may not have noticed:
- Application ID should not be included in the training data, it only gives the decision tree an opportunity to make spurious correlations
- We should include ```.dropna()``` at the end of the line where we define ```mortgage_applications_nonans``` instead of on ```X``` and ```Y``` the way it is  done above leads to different rows being dropped in ```X``` and ```Y``` ruining the data.
- ```Y ~ X``` is the format for giving a formula in linear regression, we need different syntax when fitting a decision tree. Instead, it should be ```clf.fit(X,Y)```
- When using clf.predict(), we need to pass an array or dataframe containing the parameters instead of passing them directly as function arguments.

### Discussion
Imagine that you work for a mortgage company, and you are building a classification model to decide whether to accept mortgage applications. Please answer the following questions. Try to come to a consensus about your answers, but if you cannot, write down your disagreement:

- Why would it usually be useful for the algorithm to have more information about the applicant? 


- Consider the following information: {Applicant’s income, net worth, undergraduate GPA, profession, criminal history, marriage status (including common law), age}. 
    - In small groups, group the information into two categories:
        1. Information which is ethically appropriate for the bank to take into consideration in deciding whether to give someone a mortgage.
        2. Information which is ethically inappropriate for the bank to take into consideration in deciding whether to give someone a mortgage.
    - For each bit of information that you think is ethically inappropriate to take into consideration, explain why you think it is inappropriate, by picking one of the following options and expanding on it in a couple sentences:  
        1. Taking it into account would produce more harm than benefit overall. 
        2. Taking it into account would violate someone’s rights. 
        3. Taking it into account would be ethically wrong for some other reason. 

## Tutorial Assignment (get started... )
Suppose we are designing a classification model for reviewing applications to a prestigious medical school, with the goal of predicting likelihood of successful graduation, which would then be used to determine who should be admitted. Using the framework discussed, please answer the following two questions:
1. Consider the following information: {undergraduate GPA, MCAT score, volunteer activity, criminal history, marriage status (including common law), age}. 
- Sort the information into informations which is ethically appropriate or inappropriate to take into consideration whether to admit someone.
- For each bit of information that you think is ethically inappropriate to take into consideration, explain why you think it is inappropriate, by picking one of the following options and expanding on it in a couple sentences:  
    1. Taking it into account would produce more harm than benefit overall. 
    2. Taking it into account would violate someone’s rights. 
    3. Taking it into account would be ethically wrong for some other reason. 
2. Which metric or metrics (accuracy, sensitivity, specificity) should you prioritize in judging the performance of your model? How is your answer different from the mortgage company case discussed in tutorial?

### Notes on approaching the writing prompt

- Hand in the assignment on Quercus
- Use full sentences
- Grammar is *not* the main focus of the assessment, but it is important that you communicate in a clear and professional manner (without slang or emojis) 
- Aim for 200 - 500 words
- Do not spend more than 90 minutes on the prompt (unless you really need to...)