# The Astra Zeneca Covid Vaccine, innocent until proven gilty
> A visual analysis of the EMA report.

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]
- image: images/chart-preview.png

In [1]:
#hide
import pandas as pd
from fastdata.integrations import *
from fastdata.core import *
import plotly.express as px
from IPython.display import HTML

## Goal

This analysis aims to help you understand the potential risks of the Astra Zeneca Covid-19 vaccine using data. Specifically, we look at some of the risks reported by the European Medical Agency (EMA) on March 25th (report can be downloaded [here](https://www.ema.europa.eu/en/documents/prac-recommendation/signal-assessment-report-embolic-thrombotic-events-smq-covid-19-vaccine-chadox1-s-recombinant-covid_en.pdf)).


We do not aim to generate new analysis but rather make part of the EMA report more straightforward to understand without going through a lengthy document.

> Warning: The author is not an expert in the field, and is applying some general statistical thinking to the problem. Therefore, it may contain errors, omissions or otherwise not accurate information.

## Methodology

### Introduction

The EMA report performs some observed to expected analysis (OE) in the report to understand the vaccine's potential risks. We will focus on the study of EudraVigilance data (section 3.1.5 in the report), which looks at three potential categories of side-effects present in the database:
- Disseminated intravascular coagulation
- Cerebral Venous Sinus Thrombosis
- Embolic and thrombotic events

>Note: EudraVigilance is a database with information about suspected adverse reactions to medicines which have been authorised or being studied in clinical trials in the European Economic Area (EEA).[Source](https://www.ema.europa.eu/en/human-regulatory/research-development/pharmacovigilance/eudravigilance)

### Expected to observed analysis

This analysis's logic is to compare how many cases you have observed with one condition (observed) vs. how many usually happen (expected). With this, you can calculate an Observed to Expected ratio, which is defined as `# of observed cases / # expected cases`. If it is larger than 1, you are getting more cases than you "theoretically should." But statistical uncertainty will often be driven by the observed number of cases, which is often small (rare events). To deal with this statistical uncertainty around the total number of cases observed over the risk period of interest, a 95% confidence interval (95%CI) is often used (more on this later).

### Data sources

A key input for the analysis is the incidence rate of the specific condition to be able to determine the expected cases. It is also important to have data stratified by groups to be able to analyze not just the general population as a whole but also individual subgroups. That is because the OE ratio can be ok for the overall population but not for individual sub-groups who are more at risk (e.g., young people, people with certain pre-conditions, etc.).

The databases used for the main analysis for the three events investigated are:
- Coagulation disorder (this was used to compare with the SMQ Embolic and
thrombotic events): ARS from Italy
- Disseminated intravascular coagulation: FISABIO from Spain
- Cerebral venous sinus thrombosis: ARS from Italy 

## OE Analysis of potential side-effects

### Disseminated intravascular coagulation (DIC)

>Note: Disseminated intravascular coagulation (DIC) is a rare but serious condition that causes abnormal blood clotting throughout the body’s blood vessels. It is caused by another disease or condition, such as an infection or injury, that makes the body’s normal blood clotting process become overactive. [Source: US NIH](https://www.nhlbi.nih.gov/health-topics/disseminated-intravascular-coagulation)

For those of us that are not medicine experts, this diagram helps us understand the condition: it shows of a thrombus (blood clot) that has blocked a blood vessel valve.

![](https://upload.wikimedia.org/wikipedia/commons/c/c5/Blood_clot_diagram.png)

In [2]:
#hide
dic = gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="DIC")

In [3]:
#hide
dic = dic.drop(
    columns=["EEA Expected 14d","EEA Observed 14d From EV","EEA OE 14d with 95% c.i."])

In [4]:
#hide
dic["oe_ci_interval_min"] = dic["EEA+UK  OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="(\d+?[,.]\d+) - \d+?[,.]\d+")

In [5]:
#hide
dic["oe_ci_interval_max"] = dic["EEA+UK  OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="\d+?[,.]\d+ - (\d+?[,.]\d+)")

In [6]:
#hide
dic["oe"] = dic["EEA+UK  OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="before_character", 
    keep_unmatched=False, 
    character="(")

In [7]:
#hide
dic = dic.drop(
    columns=["EEA+UK  OE 14d with 95% c.i."])

In [8]:
#hide
dic = dic.astype(
    dtype={"IR per 100,000 Person years From FISABIO" : "float64", "EEA+UK Expected 14d" : "float64", "EEA+UK  Observed 14d From EV" : "float64", "oe_ci_interval_min" : "float64", "oe_ci_interval_max" : "float64", "oe" : "float64"})

When performing the OE analysis, we see that for age groups below 50, there are more expected cases than observed cases, **and therefore it is clear there is no risk**.

In [9]:
#hide_input
HTML(px.bar(dic,
    title="Expected vs. observed cases", 
    barmode="group", 
    template="seaborn", 
    x=["EEA+UK Expected 14d","EEA+UK  Observed 14d From EV"], 
    y="Age group").to_html(include_plotlyjs='cdn'))

In [10]:
#hide
dic = dic.query("(`Age group`=='20-29' or `Age group`=='30-49')", engine="python").copy()

In [11]:
#hide
dic["error_above_oe"] = dic["oe_ci_interval_max"].subtract(
    other=dic["oe"])

In [12]:
#hide
dic["error_below_oe"] = dic["oe"].subtract(
    other=dic["oe_ci_interval_min"])

But for ages below 50, there are more observed cases than expected.

Let's run through one example to understand this better:
- In the case of 30-49, we expect 2 cases but get 4. This means the ratio of observed to expected is about 2.
- But the confidence interval tells us that in 95% of cases, the number of observed will fall between 0.54 and 5.16. 
- This means that if cases appear with a certain probability, there are chances where you will get less and chances you will get more, but you 95% of the time, you will get between 0.54 and 5.16. You can think of flipping a coin, where on average, you get 50% heads or tails, but within a ten coin flip, you could get more heads or more tails.
- This means that be able to conclude statistically that there is something unusual that cloud not likely be attributed to chance, you need a number above 5.16, which is not the case (2<5.16)
- Intuitively, you can understand this as the numbers are so small, one small change could very easily skew the result, and thus it is not easy to know for sure with so few cases

**The conclusion is that for these groups (of 30-49 and especially for 20-29) the fact that the observed is greater than the expected is not statistically significant.**

In [13]:
#hide_input
HTML(px.scatter(dic,
    title="Observed to expected ration vs. confidence interval",
    y="Age group", 
    x="oe", 
    labels={"oe":"observed/expected"},
    template="seaborn", 
    error_x="error_above_oe", 
    error_x_minus="error_below_oe").to_html(include_plotlyjs='cdn'))

Something worth discussing is that many of the patients who are getting the vaccinations are probably not the most healthy (if we assume that some rational prioritization is taking place). Given this, it is fair to assume that diseases' incidence may be higher in this group than in the general population.

Unfortunately, it seems that with the given data, we can't control for that.
And this brings us to one of the main conclusion of this analysis: **We don't have good health data to answer these questions.**. For example, it is not easy to get the incidence rate breakdown by male/female or by pre-conditions and compare it with the observed cases.

### Embolic and thrombotic events

In [14]:
#hide
et = gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="ET")

In [15]:
#hide
et["oe_ci_interval_min"] = et["EEA OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="(\d+?[,.]\d+) - \d+?[,.]\d+")

In [16]:
#hide
et["oe_ci_interval_max"] = et["EEA OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="\d+?[,.]\d+ - (\d+?[,.]\d+)")

In [17]:
#hide
et["oe"] = et["EEA OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="before_character", 
    keep_unmatched=False, 
    character="(")

In [18]:
#hide
et = et.drop(
    columns=["EEA OE 14d with 95% c.i."])

In [19]:
#hide
et = et.astype(
    dtype={
           "EEA Expected 14d" : "float64", 
           "EEA Observed 14d From EV" : "float64", 
           "oe_ci_interval_min" : "float64", 
           "oe_ci_interval_max" : "float64", 
           "oe" : "float64"})

When looking at all Embolic and thrombotic events, all expected cases are above the observed except for the group of 30-49

In [20]:
#hide_input
HTML(px.bar(et,
    title="Expected vs. observed cases", 
    barmode="group", 
    template="seaborn", 
    x=["EEA Expected 14d","EEA Observed 14d From EV"], 
    y="Age group").to_html(include_plotlyjs='cdn'))

In [21]:
#hide
et = dic.query("(`Age group`=='30-49')", engine="python").copy()

In [22]:
#hide
et["error_above_oe"] = et["oe_ci_interval_max"].subtract(
    other=et["oe"])

In [23]:
#hide
et["error_below_oe"] = et["oe"].subtract(
    other=et["oe_ci_interval_min"])

But as seen before, we see that the value of observed to expected falls within the 95% interval.

In [24]:
#hide_input
HTML(px.scatter(et,
    title="Observed to expected ration vs. confidence interval",
    y="Age group", 
    x="oe", 
    labels={"oe":"observed/expected"},
    template="seaborn", 
    error_x="error_above_oe", 
    error_x_minus="error_below_oe").to_html(include_plotlyjs='cdn'))

### Cerebral Venous Sinus Thrombosis

In [25]:
#hide
cvst = gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="CVST")

In [26]:
#hide
cvst = cvst.drop(
    columns=["EEA Expected 14d","EEA Observed 14d From EV","EEA OE 14d with 95% c.i."])

In [27]:
#hide
cvst["oe_ci_interval_min"] = cvst["EEA+UK OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="(\d+?[,.]\d+) - \d+?[,.]\d+")

In [28]:
#hide
cvst["oe_ci_interval_max"] = cvst["EEA+UK OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="\d+?[,.]\d+ - (\d+?[,.]\d+)")

In [29]:
#hide
cvst["oe"] = cvst["EEA+UK OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="before_character", 
    keep_unmatched=False, 
    character="(")

In [30]:
#hide
cvst = cvst.drop(
    columns=["EEA+UK OE 14d with 95% c.i."])

In [31]:
#hide
cvst = cvst.astype(
    dtype={"IR per 100,000 Person years From ARS" : "float64", 
           "EEA+UK Expected 14d" : "float64", 
           "EEA+UK Observed 14d From EV" : "float64", 
           "oe_ci_interval_min" : "float64", 
           "oe_ci_interval_max" : "float64", 
           "oe" : "float64"})

For Cerebral Venous Sinus Thrombosis, we see similar patter as before, with OE > 1 for ages below 60.

In [32]:
#hide_input
HTML(px.bar(cvst,
    title="Expected vs. observed cases", 
    barmode="group", 
    template="seaborn", 
    x=["EEA+UK Expected 14d","EEA+UK Observed 14d From EV"], 
    y="Age group").to_html(include_plotlyjs='cdn'))

In [33]:
#hide
cvst = cvst.query("(`Age group`=='20-29' or `Age group`=='30-49' or `Age group`=='50-59')", engine="python").copy()

In [34]:
#hide
cvst["error_above_oe"] = cvst["oe_ci_interval_max"].subtract(
    other=cvst["oe"])

In [35]:
#hide
cvst["error_below_oe"] = cvst["oe"].subtract(
    other=cvst["oe_ci_interval_min"])

In [36]:
#hide_input
HTML(px.scatter(cvst,
    title="Observed to expected ration vs. confidence interval",
    y="Age group", 
    x="oe", 
    labels={"oe":"observed/expected"},
    template="seaborn", 
    error_x="error_above_oe", 
    error_x_minus="error_below_oe").to_html(include_plotlyjs='cdn'))

# Conclusions

We can conclude that there is no evidence that the Astra Zeneca vaccine leads to adverse conditions. But it's important not to confuse absence of evidence with evidence of absence. What this means is that while we can't prove there is a link, we can't disprove it either, and the link may still exist. From a risk perspective, the risk is not zero.

But not taking a vaccine has also a risk. The vaccine is objectively saving hundreds of thousands of lives at the moment, and that is why, after clinical trials passed, the burden of proof falls into proving it is not safe and not the other way around.

You can think of this logic as analogous to the presumption of innocence. While we may risk having guilty people running around in our society, the risk of a totalitarian state where this is not required is greater.
So for the moment, the Astra Zeneca Covid-19 vaccine is innocent until proven otherwise.
