# Load Denmark COVID-19 cases by age group

### Andrei Paleyes
### May 6, 2020

In Denmark, official COVID-19 data is provided daily by Statens Serum Institut, SSI. Here is the reference web page: https://www.ssi.dk/aktuelt/sygdomsudbrud/coronavirus/covid-19-i-danmark-epidemiologisk-overvaagningsrapport

Age-stratified data is available in daily reports that are provided as PDF files. Each of them (except one so far) contain a single table that is of interest to us, that has cumulative number of COVID-19 cases for male and female in each age group (each group is a 10 year step).

Unfortunately at the time of writing this notebook there is no way to consume this data in machine-friendly format. Therefore this notebook attempts to parse the daily report PDF files and compile the dataset.

In [1]:
!java -version

java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)


In [2]:
%pip install -q tabula-py

You should consider upgrading via the '/Users/ines_admin/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import tabula

tabula.environment_info()

Python version:
    3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
    java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)
tabula-py version: 2.1.0
platform: Darwin-19.2.0-x86_64-i386-64bit
uname:
    uname_result(system='Darwin', node='Admins-MacBook-Pro.local', release='19.2.0', version='Darwin Kernel Version 19.2.0: Sat Nov  9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.2.0', '')
mac_ver: ('10.15.2', ('', '', ''), 'x86_64')
    


In [4]:
import pandas as pd

In [5]:
# path: the url of the report
# page: page in the document where the necessary table is found
# top: distance from the top of the page to the top of the table. defaults to 110 pt.
# bottom: distance from the top of the page to the bottom of the table. Defaults to 350 + (top - 110) pt.
#
# More details on how to measure top and bottom: https://stackoverflow.com/questions/45457054/tabula-extract-tables-by-area-coordinates

all_reports = {
    'May 5, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-05052020-s0l0", 'page': 11},
    'May 4, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-04052020-hu28", 'page': 11},
    'May 3, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-03052020-am43", 'page': 11},
    'May 2, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-02052020-l9i8", 'page': 11},
    'May 1, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-01052020-prst",  'page': 11},
    'April 30, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-30042020-2h7d",   'page': 11},
    'April 29, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-29042020-wl02",   'page': 11},
    'April 28, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-28042020-gg64",   'page': 11},
    'April 27, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-27042020-ce23",   'page': 11},
    'April 26, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-26042020-y34f",   'page': 11},
    'April 25, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-25042020-sr21",   'page': 11},
    'April 24, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-24042020-ds65",   'page': 11},
    'April 23, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-23042020-gl5b",  'page': 11},
    'April 22, 2020': {'path': "https://files.ssi.dk/covid19-overvaagningsrapport-22042020-lj45",  'page': 11},
    'April 21, 2020': {'path': "https://files.ssi.dk/covid19-overvaagningsrapport-21032020-hj78",   'page': 11},
    'April 20, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-20042020-2dd09",  'page': 11},
    'April 19, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-19042020-hba7",  'page': 11},
    'April 18, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-18042020-a8sg",   'page': 11},
    'April 17, 2020': {'path': "https://www.ssi.dk/-/media/arkiv/dk/aktuelt/sygdomsudbrud/covid19-rapport/17042020/covid19-overvaagningsrapport-17042020-gt90.pdf",   'page': 11},
    'April 16, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-16042020-hzz5",   'page': 11, 'top':170},
    'April 15, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-15042020-ht7b",   'page': 11, 'top':170},
    'April 14, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-14042020-wgkv",   'page': 10, 'top':170},
    'April 13, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-13042020-gy70",   'page': 10, 'top':170},
    'April 12, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-12042020-hh8b",   'page': 10, 'top':170},
    'April 11, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-11042020-ednk",   'page': 10, 'top':170},
    'April 10, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-10042020-21bn",   'page': 10, 'top':170},
    'April 9, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-09042020-31us",  'page': 10, 'top':125},
    'April 8, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-08042020-zm92",  'page': 10, 'top':125},
    'April 7, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-07042020-wvp1",  'page': 10, 'top':150},
    'April 6, 2020': {'path': "https://files.ssi.dk/covid19-overvaagningsrapport-06042020-hu4v", 'page': 10, 'top':150},
    'April 5, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-05042020-dd29", 'page': 10, 'top':150}, 
    'April 4, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-04042020-wdcs",  'page': 10, 'top':150},
    'April 3, 2020': {'path': "https://files.ssi.dk/covid-19-overvaagningsrapport-03042020-2",    'page': 10, 'top':150},
    'April 2, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-02042020-kl45",  'page': 9, 'top':150},
    'April 1, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-01042020-apl4",  'page': 9, 'top':170},
    'March 31, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-31032020-2us61",   'page': 9, 'top':150},
    'March 30, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-30032020-hb2a",   'page': 9, 'top':150},
    'March 29, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-29032020-f67s",   'page': 9, 'top':150},
    'March 28, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-28032020-wl35",   'page': 9, 'top':150},
    'March 27, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-27032020-gk38",  'page': 9, 'top':150},
    'March 26, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-26032020",   'page': 9, 'top':150},
    'March 25, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-25032020",   'page': 9, 'top':150},
    'March 24, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-24032020",  'page': 9, 'top':150},
    'March 23, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-23032020",   'page': 5, 'top':135, 'bottom': 285},
    'March 22, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-22032020",   'page': 5, 'top':135, 'bottom': 285},
    'March 21, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-21032020-2",   'page': 5, 'top':150, 'bottom': 300},
    'March 20, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-20032020",   'page': 5, 'top':160, 'bottom': 310},
    'March 19, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-19032020",   'page': 5, 'top':135, 'bottom': 285},
    'March 18, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-18032020",   'page': 4, 'top':515, 'bottom': 670},
    'March 17, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-17032020",   'page': 4, 'top':515, 'bottom': 670},
    'March 16, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-16032020",   'page': 3, 'top':545, 'bottom': 700},
    'March 13, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-13032020",   'page': 3, 'top':135, 'bottom': 285},
    # this one does not contain necessary info at all
    #'March 12, 2020': {'path': "https://files.ssi.dk/COVID19-overvaagningsrapport-12032020", 'page': 3, 'top':160, 'bottom': 310}
}

In [6]:
df = pd.DataFrame(index=range(10))
for date in all_reports:
  data_link = all_reports[date]
  top = data_link.get('top', 110)
  bottom = data_link.get('bottom', 350 + (top - 110))
  report_df = tabula.read_pdf(data_link['path'], pages=data_link['page'], guess=False, area=[top, 0.0, bottom, 550], pandas_options={'dtype': 'str'}, multiple_tables=False)[0]
  df[date] = report_df.head(10).iloc[:,-1]
  print('Processed ' + str(date))

Processed May 5, 2020
Processed May 4, 2020
Processed May 3, 2020
Processed May 2, 2020
Processed May 1, 2020
Processed April 30, 2020
Processed April 29, 2020
Processed April 28, 2020
Processed April 27, 2020
Processed April 26, 2020
Processed April 25, 2020
Processed April 24, 2020
Processed April 23, 2020
Processed April 22, 2020
Processed April 21, 2020
Processed April 20, 2020
Processed April 19, 2020
Processed April 18, 2020
Processed April 17, 2020
Processed April 16, 2020
Processed April 15, 2020
Processed April 14, 2020
Processed April 13, 2020
Processed April 12, 2020
Processed April 11, 2020
Processed April 10, 2020
Processed April 9, 2020
Processed April 8, 2020
Processed April 7, 2020
Processed April 6, 2020
Processed April 5, 2020
Processed April 4, 2020
Processed April 3, 2020
Processed April 2, 2020
Processed April 1, 2020
Processed March 31, 2020
Processed March 30, 2020
Processed March 29, 2020
Processed March 28, 2020
Processed March 27, 2020
Processed March 26, 2020

Got stderr: May 12, 2020 8:20:38 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
May 12, 2020 8:20:38 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
May 12, 2020 8:20:39 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>



Processed March 23, 2020
Processed March 22, 2020
Processed March 21, 2020
Processed March 20, 2020
Processed March 19, 2020
Processed March 18, 2020
Processed March 17, 2020
Processed March 16, 2020
Processed March 13, 2020


In [7]:
df

Unnamed: 0,"May 5, 2020","May 4, 2020","May 3, 2020","May 2, 2020","May 1, 2020","April 30, 2020","April 29, 2020","April 28, 2020","April 27, 2020","April 26, 2020",...,"March 24, 2020","March 23, 2020","March 22, 2020","March 21, 2020","March 20, 2020","March 19, 2020","March 18, 2020","March 17, 2020","March 16, 2020","March 13, 2020"
0,168.0,155.0,143.0,139.0,135.0,126.0,120.0,116.0,111.0,108.0,...,13,13,13,13,13,13,12,13,12,10
1,378.0,363.0,347.0,345.0,336.0,326.0,313.0,298.0,284.0,273.0,...,37,35,35,35,35,35,33,33,33,30
2,1.308,1.287,1.265,1.244,1.227,1.188,1.16,1.139,1.112,1.093,...,172,167,165,161,153,148,146,142,140,134
3,1.368,1.339,1.315,1.302,1.284,1.262,1.23,1.199,1.181,1.162,...,203,193,191,184,170,162,156,147,143,135
4,1.778,1.757,1.734,1.719,1.706,1.687,1.673,1.644,1.616,1.6,...,402,386,375,362,345,325,305,294,284,253
5,1.833,1.805,1.78,1.755,1.733,1.714,1.694,1.682,1.656,1.637,...,279,254,248,239,224,209,200,186,176,159
6,1.178,1.164,1.155,1.141,1.137,1.122,1.106,1.09,1.074,1.068,...,180,156,148,140,127,107,90,70,62,50
7,876.0,871.0,864.0,851.0,848.0,842.0,833.0,825.0,819.0,806.0,...,167,132,118,100,82,70,55,39,26,5
8,706.0,701.0,693.0,685.0,682.0,673.0,666.0,649.0,642.0,628.0,...,106,98,87,77,65,51,38,28,18,7
9,228.0,228.0,227.0,226.0,223.0,218.0,213.0,209.0,203.0,200.0,...,18,16,15,15,12,12,9,8,4,2


In [8]:
import numpy as np

# avoid all weirdness of commas and dots as delimiters
df = df.astype('string').apply(lambda x: x.str.replace('.','')).astype(int)

# set index to age groups
df = df.set_index(pd.Index([f'{i}-{i+9}' for i in range(0, 90, 10)] + ['90+'], dtype='str'))
df.columns = pd.to_datetime(df.columns.astype(str))
df = df[np.sort(df.columns)]
df

Unnamed: 0,2020-03-13,2020-03-16,2020-03-17,2020-03-18,2020-03-19,2020-03-20,2020-03-21,2020-03-22,2020-03-23,2020-03-24,...,2020-04-26,2020-04-27,2020-04-28,2020-04-29,2020-04-30,2020-05-01,2020-05-02,2020-05-03,2020-05-04,2020-05-05
0-9,10,12,13,12,13,13,13,13,13,13,...,108,111,116,120,126,135,139,143,155,168
10-19,30,33,33,33,35,35,35,35,35,37,...,273,284,298,313,326,336,345,347,363,378
20-29,134,140,142,146,148,153,161,165,167,172,...,1093,1112,1139,1160,1188,1227,1244,1265,1287,1308
30-39,135,143,147,156,162,170,184,191,193,203,...,1162,1181,1199,1230,1262,1284,1302,1315,1339,1368
40-49,253,284,294,305,325,345,362,375,386,402,...,1600,1616,1644,1673,1687,1706,1719,1734,1757,1778
50-59,159,176,186,200,209,224,239,248,254,279,...,1637,1656,1682,1694,1714,1733,1755,1780,1805,1833
60-69,50,62,70,90,107,127,140,148,156,180,...,1068,1074,1090,1106,1122,1137,1141,1155,1164,1178
70-79,5,26,39,55,70,82,100,118,132,167,...,806,819,825,833,842,848,851,864,871,876
80-89,7,18,28,38,51,65,77,87,98,106,...,628,642,649,666,673,682,685,693,701,706
90+,2,4,8,9,12,12,15,15,16,18,...,200,203,209,213,218,223,226,227,228,228


In [9]:
df.to_csv('denmark_cases_by_age.csv') 

In [0]:
# only if in colab
from google.colab import files
files.download('denmark_cases_by_age.csv')