<a href="https://colab.research.google.com/github/oughtinc/ergo/blob/notebooks-readme/notebooks/covid-19-metaculus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Install [Ergo](https://github.com/oughtinc/ergo) (our forecasting library) and a few tools we'll use in this colab:

In [None]:
!pip install --quiet poetry  # Fixes https://github.com/python-poetry/poetry/issues/532
!pip install --quiet git+https://github.com/oughtinc/ergo.git@6396f5ec4a73a18d36faa1651b1cd9ad852f916e
!pip install --quiet pendulum seaborn

In [None]:
%load_ext google.colab.data_table

In [None]:
import re
import ergo
import pendulum
import pandas
import seaborn

from ergo import logistic

from types import SimpleNamespace
from typing import List
from pendulum import DateTime
from matplotlib import pyplot

# Questions

Here are Metaculus ids for the questions we'll load, and some short names that will allow us to associate questions with variables in our model:

In [None]:
question_ids = [3704, 3712, 3713, 3711, 3722, 3761, 3705, 3706]
question_names = [
  "WHO Eastern Mediterranean Region on 2020/03/27",
  "WHO Region of the Americas on 2020/03/27",
  "WHO Western Pacific Region on 2020/03/27",
  "WHO South-East Asia Region on 2020/03/27",
  "South Korea on 2020/03/27",
  "United Kingdom on 2020/03/27",
  "WHO African Region on 2020/03/27",
  "WHO European Region on 2020/03/27"
]

We load the question data from Metaculus:

In [None]:
metaculus = ergo.Metaculus(username="ought", password="")
questions = [metaculus.get_question(id, name=name) for id, name in zip(question_ids, question_names)]
ergo.MetaculusQuestion.to_dataframe(questions)

# Data

Our most important data is the data about confirmed cases (from Hopkins):

In [None]:
confirmed_infections = ergo.data.covid19.ConfirmedInfections()

# Assumptions

Assumptions are things that should be inferred from data but currently aren't:

In [None]:
assumptions = SimpleNamespace()

We'll manually add some data about [doubling times](https://ourworldindata.org/coronavirus#the-growth-rate-of-covid-19-deaths) (in days):

In [None]:
assumptions.doubling_time = {
  "World": 8,
  "China": 33,
  "Italy": 4,
  "Iran": 5,
  "Spain": 3,
  "France": 4,
  "United States": 3,
  "United Kingdom": 3,
  "South Korea": 12,
  "Netherlands": 1,
  "Japan": 8,
  "Switzerland": 5,
  "Philippines": 4,
  "Belgium": 1,
  "San Marino": 3,
  "Germany": 5,
  "Iraq": 8,
  "Sweden": 3,
  "Canada": 2,
  "Algeria": 4,
  "Australia": 4,
  "Egypt": 2,
  "Greece": 5,
  "Indonesia": 7,
  "Poland": 5
}

To estimate doubling times for places where we don't have data we'll specify which places are similar to which other places:

In [None]:
assumptions.similar_areas = {
  "Bay Area": ["United States", "Italy"],
  "San Francisco": ["United States", "Italy"],
  "WHO Eastern Mediterranean Region": ["Iraq", "Iran"],
  "WHO Region of the Americas": ["United States"],
  "WHO Western Pacific Region": ["United States", "Italy", "Spain", "South Korea"],
  "WHO South-East Asia Region": ["South Korea", "China", "Japan", "Philippines", "Indonesia"],
  "South Korea": ["South Korea"],
  "United Kingdom": ["United Kingdom"],
  "WHO African Region": ["Algeria"],
  "WHO European Region": ["Italy", "Spain","France","Germany","Greece"], # "Belgium",
}

# Model

Main model:

In [None]:
Area = str

def get_doubling_time(area: Area):
  similar_areas = assumptions.similar_areas[area]
  doubling_times = [assumptions.doubling_time[proxy] for proxy in similar_areas]
  proxy_doubling_time = ergo.random_choice(doubling_times)
  doubling_time = ergo.lognormal_from_interval(proxy_doubling_time - 0.5, proxy_doubling_time + 0.5)
  return doubling_time

def model(start: DateTime, end: DateTime, areas: List[Area]):
  for area in areas:
    doubling_time = get_doubling_time(area)
    confirmed = confirmed_infections(area, start)
    for i in range((end - start).days):
      date = start.add(days=i)
      confirmed = confirmed * 2**(1 / doubling_time)
      ergo.tag(confirmed, f"{area} on {date.format('YYYY/MM/DD')}")

Run the model:

In [None]:
# Model parameters
start_date = pendulum.now(tz="US/Pacific").subtract(days = 10)
end_date = max(question.resolve_time for question in questions).add(days = 3)
areas = [re.match("(.*)? on", name).groups()[0] for name in question_names]

# Get samples from model for all variables
samples = ergo.run(lambda: model(start_date, end_date, areas), num_samples=5000)

# Analysis

Look at raw samples from the model:

In [None]:
samples

Summary stats:

In [None]:
samples.describe()

Plot some marginals:

In [None]:
for question in questions:
  pyplot.figure()
  seaborn.distplot(samples[question.name])

# Submit predictions

Convert samples to Metaculus distributions and submit:

In [None]:
for question in questions:
  if question.name in samples:
    params = question.submit_from_samples(samples[question.name])
    print(f"Submitted Logistic{params} for {question.name}")
  else:
    print(f"No predictions for {question.name}")