<a href="https://colab.research.google.com/github/jason-howald/Math-125/blob/master/COVID_19_Fatalities_by_Race_and_Ethnicity_2020_04_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from scipy import stats
from math import sqrt
import pandas as pd

# Read race/ethnicity and fatality data

In [0]:
deaths = pd.read_csv("https://docs.google.com/spreadsheets/d/1-Ef2e_pyOOrkFMUf7bAAuGSC0M-K1hhKx6lKBbv-UFw/export?format=csv")
deaths

Unnamed: 0,race/ethnicity,prop_observed,prop_predicted,count_observed,count_predicted
0,hispanic,0.34,0.29,3301,2815
1,black,0.28,0.22,2718,2136
2,white,0.27,0.32,2621,3107
3,asian,0.07,0.14,680,1359
4,other,0.04,0.03,388,291


# Proportion of fatalities which are Hispanic -- a test

1. Identify hypotheses:
  - $H_A$: $p > 0.29$. More than 29% of (all potential) fatalities are hispanic: 
  - $H_0$: $p = 0.29$. 29% of fatalities (all potential) fatalities are hispanic.
2. Know your Null Story:
  - Hispanic people in NY are equally likely to die of COVID-19.
  - We should expect 29% of fatalities to be Hispanic.
  - We should expect some variability due to random chance (SE)
  - Deviations from 29% are explained by coincidence.
2. Calculate measurements from data
3. Calculate SE and Z:
$$SE = \sqrt{\frac{p(1-p)}{n}} $$
$$Z = \frac{\text{measured} - \text{observed}}{\text{SE}}$$
4. Deduce p from Z using a CDF tool.


In [0]:
#Copy p and p_hat from dataframe above.
p = 0.29
p_hat = 0.34

#Sample size n is sum of fatality counts.
n = deaths.count_observed.sum()

#Calculate SE and Z
SE = sqrt(p*(1-p)/n)
Z = (p_hat-p)/SE

#Calculate pvalue from Z. We could use a "CDF calculator" such as 
#https://homepage.divms.uiowa.edu/~mbognar/applets/normal.html
#...but we'll just use special function stats.norm.cdf 
#It gives a left tail so we use -Z instead of Z.
pvalue = stats.norm.cdf(-Z)
print(f" n={n} \n SE={SE:.4f} \n Z={Z:.2f} \n pvalue={pvalue:.35f}")

 n=9708 
 SE=0.0046 
 Z=10.86 
 pvalue=0.00000000000000000000000000092438036


The test statistic Z is enormous. The p-value is microscopic.

**Conclusion:** Reject the Null Hypothesis. $p>0.29$. 

What does this really mean in practice?

# Comparing Black and White fatalities

## Preliminary Attempt:



1. Identify hypotheses:
  - $H_A$: $p_1 > p_2$. $p_1-p_2>0$. Among fatalities, more are black than white.
  - $H_0$: $p_1 = p_2$. $p_1-p_2=0$. Fatalities are equally likely to be black vs. white.

This is a **poorly considered hypothesis:** 
- White New Yorkers outnumber black New Yorkers. 
- It's not natural to expect **equal** white and black fatalities. 
- It's natural to expect equal **probability of death by COVID-19** between two populations. 


## Correctly compared:

$p_b =$ proportion of black people in NYC who have died of COVID-19.

$p_w =$ [same for white people]

These will be rather small numbers -- thousands out of millions.

1. Identify hypotheses:
  - $H_A$: $p_b-p_w>0$. In NY, blacks more likely than whites to die of COVID-19.
  - $H_0$: $p_b-p_w=0$. In NY, blacks and whites equally likely.
2. Know your Null Story:
  - In NY, blacks and whites are in fact equally likely.
  - We should expect $p_b-p_w=0$
  - We should expect some variability due to random chance (SE of the difference of two proportions!)



  - Deviations from 0% are explained by coincidence.
2. Calculate measurements from data
3. Calculate SE and Z:
\begin{align}
SE &= \sqrt{\sqrt{\frac{\hat{p}_b(1-\hat{p}_b)}{n_b}}^2 + \sqrt{\frac{\hat{p}_w(1-\hat{p}_w)}{n_w}}^2}\\
&= \sqrt{\frac{\hat{p}_b(1-\hat{p}_b)}{n_b} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}}\\
\end{align}

$$Z = \frac{\text{measured} - \text{predicted}}{\text{SE}} = \frac{(\hat{p}_b - \hat{p}_w)-0}{SE}$$
4. Deduce p from Z using a CDF tool.



In [0]:
popNYC = 8399000  #Estimated total population.
n_b = popNYC*0.22 #From demographic data.
n_w = popNYC*0.32 #From demographic data.
fatal_b = 2718 #Reported fatalities, black
fatal_w = 2621 #Reported fatalities, white
p_hat_b = fatal_b / n_b #Fatality rate, black
p_hat_w = fatal_w / n_w #Fatality rate, white

SE = sqrt(p_hat_b*(1-p_hat_b)/n_b + p_hat_w*(1-p_hat_w)/n_w) 
#using pooled estimate: 
pooled = (fatal_b + fatal_w)/(n_b+n_w)
#using pooled estimate: 
SE = sqrt(pooled*(1-pooled)/n_b + pooled*(1-pooled)/n_w) 

Z = ((p_hat_b-p_hat_w) - 0)/SE

pvalue = stats.norm.cdf(-Z)
print(f" p_b = {p_hat_b:.6f} \n p_w = {p_hat_w:.6f} \n SE={SE:.8f} \n Z={Z:.2f} \n pvalue={pvalue:.55f}")

 p_b = 0.001471 
 p_w = 0.000975 
 SE=0.00003277 
 Z=15.13 
 pvalue=0.0000000000000000000000000000000000000000000000000005201


## Pooled estimates:

...but if $H_0$ says that $p_b=p_w$, why are we using two different proportions in the SE formula? 

**Pooled Estimate for p**:
If $H_0$ is true, the best estimator for the proportion $p = p_b = p_w$ is the pooled estimate:

$$p = \frac{\text{deaths}}{\text{sample}} = \frac{\text{fatal}_b + \text{fatal}_w}{n_b + n_w}$$

**Using the pooled estimate**:

Use wherever $p$ appears in the combined formula for SE:

$$SE = \sqrt{SE_1^2 + SE_2^2}$$



**Scientific Conclusion:**
  - Reject Null Hypothesis.
  - $p_b > p_w$.
  - Fatality rate among blacks **significantly higher** than fatality rate among whites.

What does **significant** mean here?
  - Significant doesn't mean **big**, though it is big.
  - Significant doesn't mean **important**, though it is important.
  - Significant doesn't tell us whether it's **blameworthy**, or whom to blame. 
  - Significant means **not coincidence**, **not due to chance**. The effect is **real**.

What these numbers can't tell us about higher mortality rates:
  - Is it due to biological difference?
  - Is it due to differences in COVID-19 treatments?
  - Is it due to different exposure rates?
  - Is it somebody's fault?
  - Is it due to differences in prior comorbidities?
  - Is it another symptom of long term systemic inequalities?

**Moral Conclusion:**

- We don't know why black fatalities are higher, but they are and it matters.
- When it comes to pandemic, the goal is survival, not equality in death. 
  - We must fight the virus with isolation, manufacturing, treatment, and vaccination, guided by science. 
  - We need honest, intelligent, compassionate leadership. Trump fails. Listen to your governor. You can check your voter registration at [vote.org](https://www.vote.org/)

# The $\chi^2$ Test for proportionality

- We tested a proportion (Hispanic fatalities) against a fixed value. 
- We tested a difference between two proportions. 
- These tests are (arguably) **overspecific**.

**The $\chi^2$ test**: Tests proportionality of several proportions, compared to some expectation.

In [0]:
deaths

Unnamed: 0,race/ethnicity,prop_observed,prop_predicted,count_observed,count_predicted
0,hispanic,0.34,0.29,3301,2815
1,black,0.28,0.22,2718,2136
2,white,0.27,0.32,2621,3107
3,asian,0.07,0.14,680,1359
4,other,0.04,0.03,388,291


**count_observed** is based on actual fatalities. 

**count_predicted** is based on overall racial/ethnic proportionality, using the same number of fatalities.

With the $\chi^2$ test we can test proportionality all at once:

**The $\chi^2$ hypotheses**

- $H_0$: Fatality rates across all races/ethnicities are consistent with overall population rates.

- $H_A$: Fatality proportions are different from overall population rates.

The $\chi^2$ test is an all-at-once approach. It will reject $H_0$ if any inconsistencies are detected.

Scipy.stats has a chi^2 test built in:

```
scipy.stats.chisquare(f_obs,f_exp)
```

This is how we use it:


In [0]:
observed = deaths['count_observed']
expected = deaths['count_predicted']
stats.chisquare(observed,expected)

Power_divergenceResult(statistic=690.0889843220035, pvalue=4.8776764001883355e-148)

The $\chi^2$ test detects difference.This data shows multiple strong trends. Notice that the p-value is absurdly small. ($4.88 \cdot 10^{-148}$)

**Conclusion:**

Statistically significant difference.

...But why? What to do?

In [0]:
stats.chi

<scipy.stats._continuous_distns.chi_gen at 0x7f9883acfcc0>