# Data Literacy
#### University of Tübingen, Winter Term 2021/22
## Exercise Sheet 2
&copy; 2021 Prof. Dr. Philipp Hennig & Jonathan Wenger

This sheet is **due on Monday, November 8, 2021 at 10am sharp (i.e. before the start of the lecture).**

---

## Randomized Testing

In this week we will take a shallow dive into experimental design. We will work with the data obtained from the RKI about COVID-19 infections in Germany again. Our aim will be to design a randomized study to determine the rate of COVID-19 cases in Germany. 

In [None]:
# Make inline plots vector graphics
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats("pdf", "svg")

# Plotting setup
import matplotlib.pyplot as plt

# Package imports
import numpy as np
import pandas as pd
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import geopandas

### COVID-19: Relative Incidence in Germany

We will begin by computing the relative incidence (new cases normalized by population size) on a county (Landkreis) level for Germany.

**Task:** Load the most recent data from the RKI and find the cumulative cases per county (Landkreis) over time.

> #### Data Description of the RKI Covid-19-Dashboard (https://corona.rki.de)
>
> The data has the following features:
> - ...
> - Landkreis: Name of the county
> - ...
> - AnzahlFall: Number of cases in the respective population group.
> - ...
> - NeuerFall:
>    - 0: Case is contained in the data of today and the previous day
>    - 1: Case is only contained in today's data
>    - -1: Case is only contained in the previous day's data

Source (in German): https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74

In [None]:
# Link to current data of the RKI
url = "https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data"

# Read CSV data from URL
data_rki = pd.read_csv(url)

In [None]:
# Create new dataframe and sort by date

In [None]:
# Cumulative case numbers over time

Our aim is to visualize the relative incidence as a colored map of Germany. For this we will use the package `geopandas`.

**Task:** Load the provided shapefile `data/Kreisgrenzen_2017_mit_Einwohnerzahl.shp`. Geopandas will return a dataframe that contains population numbers ("EWZ") and a column called "geometry" which defines the polygons making up the map of counties.

In [None]:
import geopandas

# Geometric data and population numbers
germany_geo_df = geopandas.read_file("data/Kreisgrenzen_2017_mit_Einwohnerzahl.shp")
germany_geo_df.head()

In [None]:
# County IDs not in geometric data
county_ids_rki = data_rki.IdLandkreis.unique()
county_ids_geo = germany_geo_df.Kennziffer.unique()

# Find IDs only in one of the two county ID sets
unmatched_ids = np.setxor1d(county_ids_rki, county_ids_geo)
print(f"County IDs with non-matching IDs: \n{unmatched_ids}")
print(
    f"Counties with non-matching IDS: \n{data_rki[data_rki.IdLandkreis.isin(unmatched_ids)].Landkreis.unique()}"
)

In [None]:
# Aggregate data in Berlin in temporary data frame
data_rki_berlin = (
    data_rki_cases[data_rki_cases.id_county.isin(unmatched_ids)].groupby(["date"]).sum()
).reset_index()
data_rki_berlin.loc[:, "id_county"] = 11000
data_rki_berlin.loc[:, "name_county"] = "Berlin"

In [None]:
data_rki_berlin

In [None]:
# Drop Berlin rows from RKI data and append merged case numbers
data_rki_cases.drop(
    data_rki_cases.index[np.where(data_rki_cases.id_county.isin(unmatched_ids))[0]],
    inplace=True,
)
data_rki_cases = data_rki_cases.append(data_rki_berlin)

**Task:** Create a joint dataframe with an additional column that contains the relative incidences (new cases of COVID-19 divided by county population). What are the five top and bottom counties in terms of relative incidence for the current day?

In [None]:
# Merge into single data frame


In [None]:
# Compute relative incidence

# Compute relative cumulative case numbers


In [None]:
# Case numbers for most recent date with >0 new cases


# Top and bottom 5 counties in terms of relative cumulative incidence for today


**Task:** Using `geopandas` and the created dataframe plot Germany's counties and their current relative incidence color-coded. Where is the relative incidence currently highest? What might be the causes for this result? What type of colormap is appropriate for this visualization and why?

*Hint:* To use the native plotting functionality of `geopandas` convert the data frame you just created into a `GeoDataFrame`.

In [None]:
# Plot map


### Designing a Testing Strategy

Suppose you are in charge of estimating the relative incidence in Germany on a national level. Let's say you have a certain varying budget of tests to distribute each day. However, you do _not_ know the total number of tests available at the start of the day. Instead as the day progresses you are informed about new test capacities in batches of tests. You have to distribute this testing capacity immediately as it becomes available. To do so, after receiving a new batch of tests you can ask a designated contact in any county to test a certain number of randomly selected people in that county. 

How would you distribute the tests arriving in batches to estimate the relative incidence in Germany without introducing (sampling) bias?

**Task:** Implement an algorithm to sample from a categorical distribution over arbitrary categories given a vector of probability weights and a function returning uniform random samples on the unit interval. That is, an algorithm which draws with replacement from a fixed number of categories according to a set of weights.

*Note:* Any other sampling functionality from `numpy` or `scipy` beyond `np.random.uniform` should not be used!

In [None]:
def sample_categorical(categ, p, size=()):
    """
    Sample from a categorical distribution.

    Parameters
    ----------
    categ : array-like, shape=(n,)
        Categories to sample from.
    p : array-like, shape=(n,)
        Probability weights of drawing from the different categories.
    size : tuple
        Size of the sample.
    """
    raise NotImplementedError # TODO

In [None]:
sample_categorical(categ=["a", "b", "c"], p = [1, 4, 6], size=(4, 5))

**Task:** Using the above sampling algorithm design a testing strategy which allocates a newly received batch of tests across the different counties at any time of the day. 

In [None]:
def testing_strategy(n_tests, counties, population):
    """
    Testing strategy for COVID-19 on a county level.

    Parameters
    ----------
    n_tests : int
        Number of available tests.
    counties : array-like
        Counties where tests can be distributed.
    population : array-like
        Population of each county.
    """
    raise NotImplementedError # TODO

**Task:** How would you argue that your sampling strategy is *unbiased*, meaning that it constitutes a representative sample of the German population?