# STA130 Tutorial 9: Classification (and Ethics)


This week we'll be doing something a little different, and we'll be focusing on ethical topics in anticipation of our upcoming embedded ethics guest lecture.
Remember! It's always a good time for questions and discussions. If you don't understand something ask.

# This Week's Vocab (10 minutes):
If you are not familiar with any of these words, now is the time to ask!

- Classification / Classifier
- Prediction / Predictor(s)
- Covariate(s)
- Input(s) / Output(s)
- Training set/sample
- Validation
- Testing set/sample (or test set)
- Fitting a model
- Confusion matrix
- Tree / Node
- Terminal node (or leaf node)
- Accuracy
- True positive (sensitivity)
- True negative (specificity)
- False positive / False negative 

# Ethics Primer (20 minutes)

- When we do statistics in the real world, our actions have consequences.
- Often classification models (and indeed other statistical methods we have learned such as linear regression and hypothesis test) are used to guide decisions.
- Our actions can affect our employers or the public.
- So when we make decisions, we must consider all the stakeholders who may be affected.

### Discussion
Brainstorm some examples from the course (or that you can think of) of how statistical methods might be used to make decisions (or otherwise have impact on people)

Since statistical analyses can have significant impacts on people, it is important to talk about the ethics of those analyses.

One simple framework for what we should ethically do:
1. Consider the impacts of your action: does it help or harm people? [If they harm people overall, you shouldn’t do them]
2. Does your action violate anyone’s rights? [If it does, you shouldn’t do it]

### What's the difference between impacts and rights?
Imagine you are a doctor at a small-town clinic. You have five patients who need organ transplants. A hitchhiker walks into the clinic for a routine clinic. You determine that no one would notice if the hitchhiker went missing – they don’t have a family or friends.

Should you kill them, take his organs, and save the five patients?
[Answer: No!!!]

The problem in Hitchhiker is that it seems like the hitchhiker has a right to life, even if killing him would result in net benefit to other people.
- What are some other rights that people have?

Some example answers:
- Right to privacy
- Right to vote
- Right to freedom of X (speech, religion, movement etc.)

From here, we’re going to zoom in on two case studies, mixing together statistics review with ethical analysis!

# Case Study 1: Medical Testing and False Positives and Negatives (30 minutes)
Let's begin by practicing with confusion matrices.

Suppose that we have a classification model designed to judge whether patients have a certain disease. On a testing dataset, it has the following confusion matrix.


|            | Predict disease | Predict no disease |
|------------|-----------------|--------------------|
| Disease    | 17              | 2                  |
| No Disease | 61              | 168                |

- What do the numbers in the different cells represent? Interpret the confusion matrix.
- What are the metrics (accuracy, sensitivity,  specificity)? Compute them all from this confusion matrix.

### Discussion
Consider the following stakeholders: (a) people with the illness, (b) people without the illness, and (c) the medical system overall.
   1. What are the impacts of a false negative in this case? Consider all three groups of stakeholders.
   2. What are the impacts of a false positive in this case? Consider all three groups of stakeholders.
   3. What do your answers for (1) and (2) imply for what we should do about false positives and false negatives? In your answer, you should take into account, if applicable: (a) the relative number of people in each stakeholder group, (b) the extent to which they are affected, and (c) whether any of their rights are infringed.

### More discussion
Now consider this case about medical testing where the disease being tested for is a serious life-threatening condition like cancer.
   1. What are the impacts of a false negative in this case? Consider all three groups of stakeholders.
   2. What are the impacts of a false positive in this case? Consider all three groups of stakeholders.
   3. What do your answers for (1) and (2) imply for what we should do about false positives and false negatives? In your answer, you should take into account, if applicable: (a) the relative number of people in each stakeholder group, (b) the extent to which they are affected, and (c) whether any of their rights are infringed.

# Case study 2: Mortgage applications (30 minutes)


Imagine we are a mortgage company, and we had a dataset stored in a dataframe ```mortgage_applications``` with the following fields:
- ```"Accept or Deny", "Annual Income" (in thousands), "Requested Loan Amount" (in thousands), "Applicant Age","Credit Score" ,"Application ID", "Dummy Variable"```

We want to make a classification model to decide based on the available information whether we should accept or deny a new application, which contains all the fields from the dataset except for "Accept or Deny".```
Figure out as many problems with the following code snippet as you can. (They may or may not be coding errors)


```
clf = tree.DecisionTreeClassifier()
mortgage_applications_nonans = mortgage_applications[["Annual Income", "Requested Loan Amount", "Applicant Age","Credit Score", "Application ID"]] # We don't care about "Dummy Variable"
X = mortgage_applications_nonans.iloc[:,1:].dropna()
Y = mortgage_applications["Accept or Deny"].dropna()

clf = clf.fit(Y ~ X) # We fit our model with the availible data

clf.predict(60,500, 30, 680) # We want to know whether to accept the application of someone who makes 60k, wants 500k, is 30, and has a credit score of 680
```

Here are some problems you may or may not have noticed:
- Application ID should not be included in the training data, it only gives the decision tree an opportunity to make spurious correlations
- We should include ```.dropna()``` at the end of the line where we define ```mortgage_applications_nonans``` instead of on ```X``` and ```Y``` the way it is  done above leads to different rows being dropped in ```X``` and ```Y``` ruining the data.
- ```Y ~ X``` is the format for giving a formula in linear regression, we need different syntax when fitting a decision tree. Instead, it should be ```clf.fit(X,Y)```
- When using clf.predict(), we need to pass an array or dataframe containing the parameters instead of passing them directly as function arguments.

### Discussion
Imagine that you work for a bank, and you are building a classification model to decide whether to accept mortgage applications for a mortgage company.

1. Why would it usually be useful for you to have more information about the person who makes the application?
2. Consider the following information: {Income, net worth, profession, criminal history, marriage status (including common law), value of property, location of property} . In your groups, group the information into two categories:
    - Information which is appropriate for the bank to take into consideration in deciding whether to give someone a mortgage.
    - Information which is inappropriate for the bank to take into consideration in deciding whether to give someone a mortgage.

# Tutorial Assignment (get started... )
Suppose we are designing a classification model for reviewing applications to a prestigious medical school. Discuss how much we should value the different metrics for judging performance of our model. (Hint: in doing so you should identify stakeholders and what effects false positives and negatives would have on them?)

### Notes on approaching the writing prompt

- Hand in the assignment on Quercus
- Use full sentences
- Grammar is *not* the main focus of the assessment, but it is important that you communicate in a clear and professional manner (without slang or emojis) 
- Aim for 200 - 500 words
- Do not spend more than 90 minutes on the prompt (unless you really need to...)