# Risk Ratios for Journalists

Welcome! This notebook will take you through the basics of understanding and working with risk ratios. A risk ratio is a formula that is useful across a wide variety of analyses, including vaccine effectiveness, pay-to-play meetings with politicians, employment discrimination, and TSA security screening. Although the formula is simple, there are  connections to deep topics like false positive rates, causal inferrence, and the important notion of conditional probability.

In [1]:
import pandas as pd

## Example

A risk ratio, also called a relative risk, is the ratio of two probabilities. Each of these probabilities represents something happening to one of two groups, which are sometimes called the "exposed" and "unexposed" groups. For example, let's consider the probability of lung cancer for female smokers as compared to non-smokers. One [recent study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3296911/) reported the following numbers:



In [2]:
smokers = pd.DataFrame({'no cancer':[657,1902],'cancer':[485,609]}, index=['smoker','nonsmoker'])
smokers

Unnamed: 0,no cancer,cancer
smoker,657,485
nonsmoker,1902,609


We might naturally be interested in how much more likely a former smoker is to get cancer. There are various ways to express this mathematically, but one natural way is to examine the probability of developing cancer for each of the two groups.


In [3]:
p_cancer_given_smoker = smokers['cancer']['smoker'] / (smokers['cancer']['smoker'] + smokers['no cancer']['smoker'])
p_cancer_given_smoker

0.4246935201401051

In [4]:
p_cancer_given_nonsmoker = smokers['cancer']['nonsmoker'] / (smokers['cancer']['nonsmoker'] + smokers['no cancer']['nonsmoker'])
p_cancer_given_nonsmoker

0.24253285543608125

All we are doing here is calculating the percentage of smokers and non-smokers who got cancer, which is why we divide by the total number of smokers/non-smokers in the formulas above. We're using the notation `p_cancer_given_smoker` to mean "the probability of cancer given that someone is a smoker." This is the language of *conditional probability*, a key concept which we'll come back to.

From this, we can see that smokers have about a 42% of developing lung cancer, whereas non-smokers have a 24% chance. Clearly smoking is associated more strongly with cancer, but how should we quantify this? We might want to have a single number for a variety of reasons, such as to compare the effect of different risk factors -- is smoking more or less dangerous than never exercising? But it's also useful to have one number to summarize the risk, or rather, the *change in risk*, since in this sample there's a 24% chance of developing cancer even if you're a non-smoker.

The simplest way to do this would be to subtract the two probabilities:

In [5]:
p_cancer_given_smoker - p_cancer_given_nonsmoker

0.18216066470402384

This is a difference of probabilities, a number known as a *risk difference*. It's less commonly used, in part because a small risk difference can hide a big effect when an outcome is rare. If only 1% of the non-smokers in our example developed cancer while 5% did, then the risk difference would be 4% even though smokers are "five times as likely" to develop cancer. The phrase "times as likely" suggests a multiplication, and at last we come to the risk ratio:

In [6]:
p_cancer_given_smoker / p_cancer_given_nonsmoker

1.7510762382131426

We would say that smokers are "1.75 times as likely" to develop cancer than non-smokers.

## Definition

Given two groups and two outcomes, we can place these four numbers in a table like this:

| Group | Positive | Negative |
| - | - | - |
| Exposed | a | b |
| Unexposed | c | d |

Then the *risk ratio* is defined as `(a/(a+b)) / (c/(c+d))`. This is also sometimes called the *relative risk*.

Note that there is a closely related quantity called the *odds ratio* which is calculated as ratio of odds instead of a ratio of probabilities, that is `(a/b) / (c/d)`. Generally this is less used than the risk ratio but has advantages in [certain circumstances](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640017/). 

# Exercise: COVID vaccine risk ratio

Here's a [paper](https://www.nejm.org/doi/full/10.1056/nejmoa2035389) which reports on the phase 3 clinical trials of the Moderna vaccine. What is the risk ratio that describes the effectiveness of this vaccine? You will need to read the paper to find the four values a,b,c,d as above.

In [7]:
a = 11 # number of people who DID get the vaccine and DID get COVID
b = 15210-11 # number of people who DID get the vaccine and DID NOT get COVID
c = 185 # number of people who DID NOT get the vaccine and DID get COVID
d = 15210-185 # number of people who DID NOT get the vaccine and DID NOT get COVID

In [8]:
# Then calculate the risk ratio itself
(a/(a+b)) / (c/(c+d))

0.05945945945945945

What does it mean that this number is smaller than 1?



A much lower percentage of people got COVID after vaccination

This paper says that the vaccine "showed 94.1% efficacy at preventing Covid-19 illness." Where does this number come from?

It's one minus the risk ratio, expressed as a percentage (times 100).

# Risk ratio vs. risk difference

In general, there are several different ways to report vaccine efficacy, which carry similar but not identical information. Risk ratios are the most common, but you may be interested in [arguments](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7996517/) for using other quantitative measures of risk.

Picture

lese likely / more likely = risk difference / absolute risk reduction
times as likely = 

# Writing about risk

# Conditional Probability
Pictures

# VAERS example

The same logic of missing one half of the 2x2 table explains why the Vaccine Adverse Event Reporting System (VAERS) database cannot be used to calculate the deaths caused by the COVID vaccine. Actually, there are [all sorts of problems](https://www.nebraskamed.com/COVID/does-vaers-list-deaths-caused-by-covid-19-vaccines) with attempting to count vaccine-related deaths from this database. It's an open database and only collects reports of "adverse reactions"*after* vaccinations -- no attempt is made to determine causality at this stage. But if you understand risk ratios, you'll also understand immediately that it's not possible to determine if vaccines increase the risk of death from this sort of data.


# Exercise: Clinton Foundation Meetings

Oen great advantage of getting comfortable with risk ratios is to be able to recognize when something should really be a risk ratio, but isn't. This is often the case when trying to detect quid-pro-quo corruption from data only. 

In summer 2016 the [AP reported](https://apnews.com/article/82df550e1ec646098b434f7d5771f625) that "at least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs." Is this evidence that donating to the Clinton Foundation would get you a meeting with the Secretary of State?

In risk ratio terms, this would mean that the "risk" (probability) of getting a meeting for people who donated is higher than for those who did not donate, i.e. the risk ratio is greater than one. Let's try to calculate it.

In [9]:
a = 85 # number of people who DID donate and DID get a meeting
b = ??? # number of people who DID donate and DID NOT get a meeting
c = 154-85 # number of people who DID NOT donate and DID get a meeting
d = ??? # number of people who DID NOT donate and DID NOT get a meeting

SyntaxError: invalid syntax (<ipython-input-9-637cf19d54e2>, line 2)

It turns out we cannot calculate the risk ratio. The basic problem is that we only have information on people who met with Clinton, not people who wanted to meet with her but did not, regardless of whether they also donated to the Clinton foundation. The fact that only two numbers are reported in the story should be a clue that we can't calculate the advantage of donating -- we need four numbers to do that. (We could also calculate the advantage with two probabilities: the probability of getting a meeting if you donated, and the probability of getting a meeting if you did not.)

But surely there's still something suspicious here if so many of the meetings she took were with donors, right? It depends! In particular, it depends on what you think the missing numbers might be. What if there were a thousand peopple who wanted a meeting and didn't get one, and 600 of them donated?

In [None]:
a = 85 # number of people who DID donate and DID get a meeting
b = 600 # number of people who DID donate and DID NOT get a meeting
c = 154-85 # number of people who DID NOT donate and DID get a meeting
d = 400 # number of people who DID NOT donate and DID NOT get a meeting

In [None]:
# risk ratio calculation
(a/(a+b)) / (c/(c+d))

In this case the risk ratio is smaller than one, meaning that donation actually made your chances worse -- or more precisely, correlated with a lower probability of getting a meeting. We'll talk about causation a little later.

# Provincetown Example

# Relationship to Causality

# Coaches

Direct analysis

Suggest Logistic Regression

# Odds Ratio
Probability vs. Odds

Probability to odds conversion

Odds ratio

Probability to Odds ratio conversion


# TSA Screening Example
Or cancer screening example?

If the test is positive, what is the probabiity this person is actually carrying a terrorist?

Pictures to explain the problem, with conditional probability notation

# Bayes Theorem and The Base Rate Fallacy

Bayes Theorem
Uee it to solve the TSA example
Graph of P(terrorist|positive) vs P(positive|terrorist)


# What percentage of people with COVID are vaccinated?

real data from somewhere