# The Astra Zeneca Covid Vaccine, innocent until proven guilty
> A visual analysis of the EMA report.

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]
- image: images/chart-preview.png

In [1]:
#hide
import pandas as pd
from fastdata.integrations import *
from fastdata.core import *
import plotly.express as px
from IPython.display import HTML

## Goal

This analysis aims to make it easier for people from a non-medical background to understand the risks of the Astra Zeneca vaccine using data & visualizations. 

It does not aim to provide new analysis or insights, but rather communicate the conclusions from the latest report of the European Medical Agency from April 7th(report can be downloaded [here](https://www.ema.europa.eu/en/documents/prac-recommendation/signal-assessment-report-embolic-thrombotic-events-smq-covid-19-vaccine-chadox1-s-recombinant_en.pdf)).

> Warning: The author is not an expert in the field and applies some general statistical thinking to the problem. Therefore, it may contain errors, omissions, or otherwise not accurate information.

## Methodology

### Introduction

The EMA report analyzes various aspects and risks of the vaccine. In this article, we focus on the analysis performed using the main European Database that tracks adverse effects to medications (EudraVigilance):
- Disseminated intravascular coagulation (DIC)
- Cerebral Venous Sinus Thrombosis (CVST)
- Coagulation disorders (Embolic and thrombotic events)

### Expected to observed analysis

This type of analysis used in the report compares how many cases of a given condition have been observed (# observed) with the number of cases one would expect based on the incidence, i.e., the historical number of cases (# expected). It is defined as a ratio in the following way: `# of observed cases / # expected cases`. If it is larger than 1, you are getting more cases than you "theoretically should." 

**But statistical uncertainty will often be driven by the observed number of cases, which is often small** (rare events). To deal with this statistical uncertainty around the total number of cases observed over the risk period of interest, a 95% confidence interval (often indicated as `95%CI`) is often used (more on this later).

## OE Analysis of potential side-effects

### Disseminated intravascular coagulation (DIC)

**Definition:** Disseminated intravascular coagulation (DIC) is a rare but serious condition that causes abnormal blood clotting throughout the body’s blood vessels. It is caused by another disease or condition, such as an infection or injury, that makes the body’s normal blood clotting process become overactive. [Source](https://www.nhlbi.nih.gov/health-topics/disseminated-intravascular-coagulation)

For those of us that are not medical experts, this diagram helps us understand the condition: it shows a thrombus (blood clot) that has blocked a blood vessel valve.

![](https://upload.wikimedia.org/wikipedia/commons/c/c5/Blood_clot_diagram.png)

In [2]:
#hide
dic = gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="DIC")

In [3]:
#hide
dic = dic.drop(
    columns=["EEA Expected 14d","EEA Observed 14d From EV","EEA OE 14d with 95% c.i."])

In [4]:
#hide
dic["oe_ci_interval_min"] = dic["EEA+UK  OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="(\d+?[,.]\d+) - \d+?[,.]\d+")

In [5]:
#hide
dic["oe_ci_interval_max"] = dic["EEA+UK  OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="\d+?[,.]\d+ - (\d+?[,.]\d+)")

In [6]:
#hide
dic["oe"] = dic["EEA+UK  OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="keep_before_character", 
    keep_unmatched=False, 
    character="(")

In [7]:
#hide
dic = dic.drop(
    columns=["EEA+UK  OE 14d with 95% c.i."])

In [8]:
#hide
dic = dic.astype(
    dtype={"IR per 100,000 Person years From FISABIO" : "float64", "EEA+UK Expected 14d" : "float64", "EEA+UK  Observed 14d From EV" : "float64", "oe_ci_interval_min" : "float64", "oe_ci_interval_max" : "float64", "oe" : "float64"})

To perform the OE analysis, we need to compute how many people are "expected" to get the condition based on the incident rate (in the DIC case, using FISABIO data from Spain) and the number of people that have taken the vaccine.

When performing the OE analysis, we see that for age groups above 50, there are more expected cases than observed cases (blue bar is larger than red bar), **and therefore it is clear there is no risk**.

In [9]:
#hide_input
HTML(px.bar(dic,
    title="Expected vs. observed cases", 
    barmode="group", 
    template="seaborn", 
    x=["EEA+UK Expected 14d","EEA+UK  Observed 14d From EV"], 
    y="Age group").to_html(include_plotlyjs='cdn'))

In [10]:
#hide
dic = dic.query("(`Age group`=='20-29' or `Age group`=='30-49')", engine="python").copy()

In [11]:
#hide
dic["error_above_oe"] = dic["oe_ci_interval_max"].subtract(
    other=dic["oe"])

In [12]:
#hide
dic["error_below_oe"] = dic["oe"].subtract(
    other=dic["oe_ci_interval_min"])

But for ages below 50, there are more observed cases than expected. Let's run through one example to understand this better:

In the case of 30-49, we expect 2 cases but get 4. This means the OE ratio is about 2, i.e., we get more instances of DIC than we expect. But the confidence interval (c.i.) tells us that in 95% of cases, the OE will fall between 0.54 and 5.16. 

In [55]:
#hide_input
def custom_style(row):
    color = 'transparent'
    if row['Age group'] == '30-49':
        color = 'lightyellow'

    return ['background-color: %s' % color]*len(row.values)

gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="DIC")[['Age group','EEA+UK Expected 14d','EEA+UK  Observed 14d From EV','EEA+UK  OE 14d with 95% c.i.']].style.apply(custom_style, axis=1)

Unnamed: 0,Age group,EEA+UK Expected 14d,EEA+UK Observed 14d From EV,EEA+UK OE 14d with 95% c.i.
0,20-29,0.04,1,23.26 (0.30 - 129.41)
1,30-49,1.99,4,2.02 (0.54 - 5.16)
2,50-59,4.38,1,0.23 (0.00 - 1.27)
3,60-69,9.24,1,0.11 (0.00 - 0.60)
4,70-79,11.3,0,0.00 (0.00 - 0.32)
5,80+,5.37,0,0.00 (0.00 - 0.68)


We can also visualize the confidence interval to see where our OE ratio falls:

In [13]:
#hide_input
HTML(px.scatter(dic,
    title="Observed to expected ratio vs. confidence interval",
    y="Age group", 
    x="oe", 
    labels={"oe":"observed/expected"},
    template="seaborn", 
    error_x="error_above_oe", 
    error_x_minus="error_below_oe").to_html(include_plotlyjs='cdn'))

Because the OE ratio is 2.02, we can't confidently conclude that the vaccine leads to more DIC cases. If the number was >5.16, it would mean it is very unlikely to get that number of cases by chance (outside 95% of cases) and thus worrisome.

To understand this, you can think of flipping a coin. On average, you get 50% heads, but within 100 coin flips, you could get more heads or more tails. The question we ask is similar to trying to understand if a coin has been tampered with based on the number of heads you obtained in 100 coin flips. If the number of coin flips is above the 95% CI, you would know it is very unlikely to get that result by chance, and thus there is a strong signal that the coin has been tampered with.

If we look at the age range of 20-29, we also see that the OE ratio is >1, but falls within the confidence interval. Here is is also worth noticing how large the confidence interval is. The reason behind this is that the incidence is so low, that just one or two cases can skew the results so you need many cases to be confident.

Something worth discussing is that many of the patients who are getting the vaccinations are probably not the most healthy (if we assume that some rational prioritization is taking place). Given this, it is fair to assume the incidence may be higher in this group than in the general population.

Unfortunately, it seems that with the given data, we can't control for that.
And this brings us to one of the main conclusions of this analysis: **We don't have enough good health data to answer these questions.** For example, it is not easy to get the incidence rate breakdown by male/female or by pre-conditions and compare it with the observed cases.

### Cerebral Venous Sinus Thrombosis

**Definition:** Cerebral venous sinus thrombosis (CVST) occurs when a blood clot forms in the brain’s venous sinuses. This  prevents blood from draining out of the brain. As a result, blood cells may break and leak blood into the brain tissues, forming a hemorrhage. [Source](https://www.hopkinsmedicine.org/health/conditions-and-diseases/cerebral-venous-sinus-thrombosis)

In [14]:
#hide
cvst = gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="CVST")

In [15]:
#hide
cvst = cvst.drop(
    columns=["EEA Expected 14d","EEA Observed 14d From EV","EEA OE 14d with 95% c.i."])

In [16]:
#hide
cvst["oe_ci_interval_min"] = cvst["EEA+UK OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="(\d+?[,.]\d+) - \d+?[,.]\d+")

In [17]:
#hide
cvst["oe_ci_interval_max"] = cvst["EEA+UK OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="\d+?[,.]\d+ - (\d+?[,.]\d+)")

In [18]:
#hide
cvst["oe"] = cvst["EEA+UK OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="keep_before_character", 
    keep_unmatched=False, 
    character="(")

In [19]:
#hide
cvst = cvst.drop(
    columns=["EEA+UK OE 14d with 95% c.i."])

In [20]:
#hide
cvst = cvst.astype(
    dtype={"IR per 100,000 Person years From ARS" : "float64", 
           "EEA+UK Expected 14d" : "float64", 
           "EEA+UK Observed 14d From EV" : "float64", 
           "oe_ci_interval_min" : "float64", 
           "oe_ci_interval_max" : "float64", 
           "oe" : "float64"})

For Cerebral Venous Sinus Thrombosis, we see a similar pattern as before, with OE > 1 for ages below 50.

In [21]:
#hide_input
HTML(px.bar(cvst,
    title="Expected vs. observed cases", 
    barmode="group", 
    template="seaborn", 
    x=["EEA+UK Expected 14d","EEA+UK Observed 14d From EV"], 
    y="Age group").to_html(include_plotlyjs='cdn'))

In [22]:
#hide
cvst = cvst.query("(`Age group`=='20-29' or `Age group`=='30-49' or `Age group`=='50-59')", engine="python").copy()

In [23]:
#hide
cvst["error_above_oe"] = cvst["oe_ci_interval_max"].subtract(
    other=cvst["oe"])

In [24]:
#hide
cvst["error_below_oe"] = cvst["oe"].subtract(
    other=cvst["oe_ci_interval_min"])

And as was the case before, we see that the OE ratio for ages below under 50 lies well within the 95% confidence interval.

In [25]:
#hide_input
HTML(px.scatter(cvst,
    title="Observed to expected ratio vs. confidence interval",
    y="Age group", 
    x="oe", 
    labels={"oe":"observed/expected"},
    template="seaborn", 
    error_x="error_above_oe", 
    error_x_minus="error_below_oe").to_html(include_plotlyjs='cdn'))

### Coagulation disorders (Embolic and thrombotic events)

Coagulation disorders are disruptions in the body’s ability to control blood clotting. Coagulation disorders can result in either a hemorrhage (too little clotting that causes an increased risk of bleeding) or thrombosis (too much clotting that causes blood clots to obstruct blood flow). [Source](https://www.rileychildrens.org/health-info/coagulation-disorders)

In [26]:
#hide
et = gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=2, 
    sheet="ET")

In [27]:
#hide
et["oe_ci_interval_min"] = et["EEA OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="(\d+?[,.]\d+) - \d+?[,.]\d+")

In [28]:
#hide
et["oe_ci_interval_max"] = et["EEA OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="custom", 
    keep_unmatched=False, 
    regex="\d+?[,.]\d+ - (\d+?[,.]\d+)")

In [29]:
#hide
et["oe"] = et["EEA OE 14d with 95% c.i."].fdt.clean_text_column(
    mode="keep_before_character", 
    keep_unmatched=False, 
    character="(")

In [30]:
#hide
et = et.drop(
    columns=["EEA OE 14d with 95% c.i."])

In [31]:
#hide
et = et.astype(
    dtype={
           "EEA Expected 14d" : "float64", 
           "EEA Observed 14d From EV" : "float64", 
           "oe_ci_interval_min" : "float64", 
           "oe_ci_interval_max" : "float64", 
           "oe" : "float64"})

When looking at all embolic and thrombotic events, we have the same situation as before where OE > 1 for age < 50.

In [32]:
#hide_input
HTML(px.bar(et,
    title="Expected vs. observed cases", 
    barmode="group", 
    template="seaborn", 
    x=["EEA Expected 14d","EEA Observed 14d From EV"], 
    y="Age group").to_html(include_plotlyjs='cdn'))

In [33]:
#hide
et = dic.query("(`Age group`=='20-29' or `Age group`=='30-49')", engine="python").copy()

In [34]:
#hide
et["error_above_oe"] = et["oe_ci_interval_max"].subtract(
    other=et["oe"])

In [35]:
#hide
et["error_below_oe"] = et["oe"].subtract(
    other=et["oe_ci_interval_min"])

And as was the case before, we see that the OE ratio for ages below under 50 lies within the 95% confidence interval.

In [36]:
#hide_input
HTML(px.scatter(et,
    title="Observed to expected ratio vs. confidence interval",
    y="Age group", 
    x="oe", 
    labels={"oe":"observed/expected"},
    template="seaborn", 
    error_x="error_above_oe", 
    error_x_minus="error_below_oe").to_html(include_plotlyjs='cdn'))

# Conclusions

We can conclude that there is no strong evidence that the Astra Zeneca vaccine leads to adverse conditions, despite the fact that we are getting OE > 1 for populations under 50. But it's important not to confuse absence of evidence with evidence of absence. What this means is that while we can't prove there is a link, we can't disprove it either, and the link may still exist. From a risk perspective, the risk is not zero.

But not taking a vaccine also has a risk: a quarter of people who end up in intensive care with Covid have some form of clot resulting from the virus ([Source](https://www.bbc.com/news/explainers-56665396)). The vaccine is objectively saving hundreds of thousands of lives at the moment, and that is why, after clinical trials passed, the burden of proof falls into proving it is not safe and not the other way around.

You can think of this logic as analogous to the presumption of innocence. While we may risk having guilty people running around in our society, the risk of a totalitarian state where this is not required is greater.
So for the moment, the Astra Zeneca Covid-19 vaccine is innocent until proven otherwise.


# Data sources

A key input for the analysis is the **incidence rate** of the side-effect. It is required to be able to determine the expected cases. It is also important to have data stratified by groups to be able to understand the effects for individual subgroups and not just the overall population. That is because the OE ratio can be <1 for the overall population but out of proportion for individual subgroups who are more at risk (e.g., young people, people with certain pre-conditions, etc.).

The report uses incidence data from 3 different data sources:
- Disseminated intravascular coagulation (DIC): FISABIO from Spain
- Cerebral venous sinus thrombosis (CVST): ARS from Italy 
- Coagulation disorder (Embolic and thrombotic events): ARS from Italy

>Note: The incidence rate is a measure of the frequency with which a disease or other incident occurs over a specified time period. [Source](https://en.wikipedia.org/wiki/Incidence_(epidemiology))

## Incidence rate for Cerebral venous sinus thrombosis

In [37]:
#hide_input
gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=1, 
    sheet="ARS_IT_CVST")

Unnamed: 0,Age category,IR (per 100k person years)
0,20-29,0.64
1,30-49,1.8
2,50-59,1.0
3,60-69,1.29
4,70-79,1.91
5,80+,1.55


## Incidence rate for Disseminated intravascular coagulation

In [38]:
#hide_input
gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=1, 
    sheet="FISABIO_SP_DIC")

Unnamed: 0,Age category,IR (per 100k person years)
0,20-29,0.6
1,30-49,1.09
2,50-59,3.07
3,60-69,4.67
4,70-79,8.37
5,80+,11.66


## Incidence rate for Coagulation disorder

In [39]:
#hide_input
gsheet_to_df(
    url="https://docs.google.com/spreadsheets/d/11yJ8GbArmcazWG8UdsSWD2gY2VIVD7l_zaPOdiF2ePY", 
    start_row=1, 
    sheet="ARS_IT_CT")

Unnamed: 0,Age category,IR (per 100k person years)
0,20-29,40.14
1,30-49,85.08
2,50-59,200.73
3,60-69,427.56
4,70-79,912.0
5,80+,2055.95
