# Get and calculate the data
Read the file ["Readme.ipynb"](Readme.ipynb) for more information.

- Altersgruppe: 0-4, 5-14, 15-34, 35-59, 60-79, 80+ sowie unbekannt
- Geschlecht: M, W und unbekannt
- AnzahlFall: Anzahl der Fälle in der entsprechenden Gruppe
- AnzahlTodesfall: Anzahl der Todesfälle in der entsprechenden Gruppe
- Referenzdatum: Erkrankungsdatum, wenn unbekannt Meldedatum
- AnzahlGenesen: Anzahl der Genesenen in der entsprechenden Gruppe

## Modules
Needed to use non-Python functionalities already programmed by someone else.

In [1]:
# Used to convert the API data from json-format into a Python list
from json2xml.utils import readfromurl
import json    # to save the data in "json"-format in a file
# Used to check if there is a local file with the data or if a new API pull is inevitable
import os.path

## Control
Set variables to "True" to trigger the action described by the comment and the variable's name.<br/><br/>
If multiple of the three variables "covid19_use_api", "covid19_use_api_backup" and "covid19_use_polished_data" are set to "True", the last one overwrites all data collected by the others. It is best practice to only set one variable to "True".<br/><br/>
If one data source seems to provide faulty data or the necessary files do not exist, try out the other options.

In [2]:
covid19_deaths_use_api = False    # pulls current COVID-19 case numbers from the API
covid19_use_api_backup = False    # polishes backup of old API pull
covid19_use_polished_data = True    # takes old, already polished data

In [3]:
reports_from_freiburg = readfromurl(
    "https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/RKI_COVID19/FeatureServer/0/"+
    "query?where=IdLandkreis%3D8311&orderByFields=Refdatum&f=pjson&outFields="+
    "Altersgruppe%2C+Geschlecht%2C+AnzahlFall%2C+AnzahlTodesfall%2C+AnzahlGenesen%2C+Refdatum")['features']

In [4]:
reports_from_freiburg[0]['attributes']

{'Altersgruppe': 'A15-A34',
 'Geschlecht': 'W',
 'AnzahlFall': 1,
 'AnzahlTodesfall': 0,
 'AnzahlGenesen': 1,
 'Refdatum': 1582416000000}

In [5]:
reports_from_freiburg[1]['attributes']

{'Altersgruppe': 'A35-A59',
 'Geschlecht': 'M',
 'AnzahlFall': 1,
 'AnzahlTodesfall': 0,
 'AnzahlGenesen': 1,
 'Refdatum': 1582588800000}

In [6]:
deaths_freiburg = 0
for report in reports_from_freiburg:
    deaths_freiburg += report['attributes']['AnzahlGenesen']

In [7]:
deaths_freiburg

7247

In [8]:
g=list()
g[1]

IndexError: list index out of range

### Check the Controls
Check if the necessary files to run the choices made by the controls above exist. Otherwise the data must be taken from somewhere else.<br/>
Pulling from the API takes a lot of time and ressources. If the user therefore chooses in the controls above not to pull from the API, this choice should only be changed if it is unavoidable.
<br/><br/>
There are three ways how data could be missing:
- Neither polished nor unpolished data about the German COVID-19 cases are saved on the machine. In this case a new pull from the API is inevitable.
- No polished version of the data exists on the machine, but a backup of an old API pull does. Therefore the program initiates a pull from the API or a "pull" from the backup file with the unpolished data.
- No backup of an old API pull exists, but a polished version of the data does. If not, the program initiates a pull from the API or uses the polished data. The file with the polished data exists due to the first condition.

In each respective case the global control variables are changed accordingly.

In [None]:
if (not(os.path.isfile("polished_data/german_covid19_deaths.txt")) and 
    not(os.path.isfile("unpolished_data/covid19_deaths/dates.txt"))):    # no files
    covid19_use_polished_data = False
    covid19_use_api_backup = False
    covid19_use_api = True
elif not(os.path.isfile("polished_data/german_covid19.txt")):    # no polished version
    # and os.path.isfile("unpolished_data/covid19/dates.txt") due to first condition
    covid19_use_polished_data = False
    # ensuring that one of the other two data sources is used
    covid19_use_api_backup = not(covid19_use_api)
elif not(os.path.isfile("unpolished_data/covid19/dates.txt")):    # no backup
    # and os.path.isfile("polished_data/german_covid19.txt") due to first condition
    covid19_use_api_backup = False
    # ensuring that one of the others is used
    covid19_use_polished_data = not(covid19_use_api)

The "number_of_counties" is also set here: It determines how many counties must be present in the data. If there are fewer or more, the current data source is declared a fail and (if possible) another one is used.

In [None]:
number_of_counties = 412

## Get the Geographical Data of Every German County
If "counties_geography_use_polished_data" is set to "True" and the required file exists, the polished data from that file is used. <br/>
If "counties_geography_use_polished_data" is set to "False" by the user or if the required file does not exist, the file "get_geographical_data_of_german_counties.ipynb" is called to provide new polished data.<br/>
For more information on where the data comes from and how it is polished check out the file "get_geographical_data_of_german_counties.ipynb". 

In [None]:
if not(os.path.isfile("polished_data/german_counties_geography.txt")):
    counties_geography_use_polished_data = False

In [None]:
if counties_geography_use_polished_data:
    with open("polished_data/german_counties_geography.txt", "r") as file:
        counties_geography = json.loads(file.read())
    print("Polished county data from file is ready to go!")
else:
    no_outputs_from_file_get_shapes_of_german_counties = True
    %run get_geographical_data_of_german_counties.ipynb

## Get the COVID-19 Cases of Every German County
Saves the COVID-19 cases of every German county since the start of the pandemic in the dictionary "covid19" (reachable by the countys AdmUnitID) and the corresponding dates in the dictionary "non_county_specific_data".

### Helper Functions
**url_county(AdmUnitID, True_for_dates_False_for_covid19_cases = False)**: returns url<br/>
Used to get the url for the COVID-19 cases of the German county determined by the AdmUnitID.<br/>
*AdmUnitId*<br/>
-> identifier of the county whichs covid19 cases should be requested<br/>
*True_for_dates_False_for_covid19_cases* (default: False)<br/>
-> Determines whether the dates in Unix time format or the actual COVID-19 cases should be requested

In [None]:
def url_county(AdmUnitID, True_for_dates_False_for_covid19_cases = False):
    url = ("https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/" +
           "rki_history_hubv/FeatureServer/0/query?where=AdmUnitId%3D" +
           str(AdmUnitID) + "&outFields=")
    if True_for_dates_False_for_covid19_cases:
        return url + "Datum&orderByFields=Datum&f=pjson"
    return url + "KumFall&orderByFields=Datum&f=pjson"

**find_alternative_source_of_data_and_activate_it()**: returns void (modifys multiple global variables)<br/>
Gets called when the data from a data source is faulty. Deletes faulty data to prevent use of faulty data. Checks which other data source could be used and modifies the global variables accordingly.

In [None]:
def find_alternative_source_of_data_and_activate_it():
    global covid19_use_api
    global covid19_use_api_backup
    global covid19_use_polished_data
    global copy_of_covid19_for_debugging_purposes
    global covid19
    global non_county_specific_data
    copy_of_non_county_specific_data_for_debugging_purposes = non_county_specific_data.copy()
    copy_of_covid19_for_debugging_purposes = covid19.copy()
    del non_county_specific_data    # to prevent accidental use of faulty data
    del covid19    # to prevent accidental use of faulty data
    # check if a local pull of the API exists otherwise use the polished data
    if os.path.isfile("unpolished_data/covid19/dates.txt"):
        covid19_use_api_backup = True
    elif os.path.isfile("polished_data/german_covid19.txt"):
        covid19_use_polished_data = True
    # neither local backup nor polished data found
    if not(covid19_use_api_backup) and not(covid19_use_polished_data):
        raise Exception("No usable data found!")

### Pull from API
If "covid19_use_api" is set to "True", the program pulls from the ["COVID-19 Datenhub"](https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/6d78eb3b86ad4466a8e264aa2e32a2e4_0). The data of each county must be pulled separatedly because the API only allows for 1,000 datapoints at a time and all counties times the number of days is well over 100,000. The identifiers of the counties originate from the keys of the dictionary "counties_geography".<br/><br/>
First, the received data is checked: If any county has fewer timestamps than the dates stored in "non_county_specific_data['unixtime']", all data gets deleted to prevent the use of faulty data and an alternative data source is chosen.<br/><br/>
If the unpolished data passes this rudimentary test, it is stored as it is in a ".txt-file" with its AdmUnitID as its name in the folder "covid19" inside the folder "unpolished_data". If any of the folders or any of the files do not yet exist, they are created.<br/>
This file can be used in further executions as local backup of the API-pull.
<br/><br/>
At the end of this chapter the polished version of the data is stored in the dictionary "covid19".

In [None]:
if covid19_use_api:
    print("Pulling from API...")
    covid19 = dict()
    non_county_specific_data = dict()
    # check if the needed directory is available - otherwise create it
    if not(os.path.isdir("unpolished_data/covid19")): os.makedirs("unpolished_data/covid19")
    number_of_timestamps = -1
    
    for AdmUnitID in list(counties_geography.keys()):
        # get dates of first county
        if number_of_timestamps == -1:
            raw_dates = readfromurl(url_county(AdmUnitID, True))
            if len(raw_dates['features']) < 200:
                print("The dates of {} sends to little timestamps ({}) - check the url"
                      .format(AdmUnitID, len(raw_dates['features'])))
                find_alternative_source_of_data_and_activate_it()
                covid19_use_api = False
                break
            number_of_timestamps = len(raw_dates['features'])
            non_county_specific_data['unixtime'] = [e['attributes']['Datum'] for e in raw_dates['features']]
            # save raw data
            with open("unpolished_data/covid19/dates.txt", "w") as file:
                file.write(json.dumps(raw_dates))

        # get countys covid19 data
        raw_covid19_data = readfromurl(url_county(AdmUnitID))
        if number_of_timestamps != len(raw_covid19_data['features']):
            print("The provided data from the API does not have the same number of timestamps of " +
                  "{}, it has {}.".format(number_of_timestamps, len(raw_covid19_data['features'])))
            find_alternative_source_of_data_and_activate_it()
            covid19_use_api = False
            break
        with open("unpolished_data/covid19/" + AdmUnitID + ".txt", "w") as file:
            file.write(json.dumps(raw_covid19_data))
        covid19[AdmUnitID] = dict()
        covid19[AdmUnitID]['cases'] = [e['attributes']['KumFall'] for e in raw_covid19_data['features']]
        
    if covid19_use_api:
        covid19_use_polished_data = False
        covid19_use_api_backup = False
        print("Covid19 Data directly from API is ready to go!")

### "Pull" from Local API Backup
If the use of the data from a local backup of the API-pull is requested and possible, the data is read from the files in the folder "covid19" inside the folder "unpolished_data". The name of the files should represent the "AdmUnitID" of the county.<br/>
The data is polished and stored in the dictionary "covid19" during the reading progress.
<br/><br/>
The received data is checked: If any county has fewer timestamps than the dates stored in "non_county_specific_data['unixtime']", all data gets deleted to prevent the use of faulty data and an alternative data source is chosen.

In [None]:
if not(covid19_use_api) and covid19_use_api_backup:
    print("Reading backup of old API pull...")
    covid19 = dict()
    non_county_specific_data = dict()
    list_of_countys = list(counties_geography.keys())
    # get the dates
    with open("unpolished_data/covid19/dates.txt", "r") as file:
        raw_dates = json.loads(file.read())
    non_county_specific_data['unixtime'] = [e['attributes']['Datum'] for e in raw_dates['features']]
    number_of_timestamps = len(non_county_specific_data['unixtime'])

    for root, dirs, files in os.walk('unpolished_data/covid19'):
        # to little dates - something is wrong. Checking here to skip for-loop
        if len(raw_dates['features']) < 200:
            print("There are only {} dates - check your backup or make a new pull from the api."
                  .format(len(raw_dates['features'])))
            find_alternative_source_of_data_and_activate_it()
            covid19_use_api = False
            break
        for filename in files:
            AdmUnitID = filename[:-4]
            if AdmUnitID == 'dates':    # already done
                continue

            list_of_countys.remove(AdmUnitID)
            covid19[AdmUnitID] = dict()
            with open(os.path.join(root, filename), "r") as file:
                covid19[AdmUnitID]['cases'] = [e['attributes']['KumFall'] for e in
                                               json.loads(file.read())['features']]

            if number_of_timestamps != len(covid19[AdmUnitID]['cases']):
                print("The data from file {} does not have {} timestamps, it has {}."
                      .format(filename, number_of_timestamps, len(covid19[AdmUnitID])))
                find_alternative_source_of_data_and_activate_it()
                covid19_use_api_backup = False
                break

    if len(list_of_countys) > 0 and covid19_use_api_backup:
        print("No backup found for {}".format(list_of_countys))
        find_alternative_source_of_data_and_activate_it()
        covid19_use_api_backup = False

    if covid19_use_api_backup:
        covid19_use_polished_data = False
        covid19_use_api = False
        print("Covid19 Data from (maybe old) API-pull-backup is ready to go!")

### Calculate the Seven Days Incidence and Get the Cases, Incidences and Inhabitants of Germany
The calculation of the incidence needs the number of cases seven days prior (set to zero if not defined), the cases of the current day (both from "covid19[AdmUnitID]['cases']") and the number of inhabitants of the county ("counties_geography[AdmUnitID]['population']").
<br/><br/>
To get all new cases in that county within the last seven days the program subtracts the accumulated cases seven days earlier from the accumulated cases of the current day. Afterwards this number of cases is divided by the county's population. In order to scale it to 100,000 inhabitants, the result is multiplied by 100,000.<br/>
This is done for every case number of every county.
<br/><br/>
The highest and lowest seven days incidence and the highest and lowest case number are stored in the dictionary "non_county_specific_data" as a reference.
<br/><br/><br/>
The number of inhabitants of Germany is calculated by adding the number of inhabitants of the counties. The same applies to accumulated number of COVID-19 cases for every day. The seven days incidence is calculated as described above.

In [None]:
if not(covid19_use_polished_data):
    non_county_specific_data['population_germany'] = 0
    for county in counties_geography.values():
        non_county_specific_data['population_germany'] += county['population']

    ncsd = non_county_specific_data
    ncsd['cases_germany'] = len(ncsd['unixtime'])*[0]

    non_county_specific_data['highest_case_number'] = 0
    non_county_specific_data['lowest_case_number'] = 100000000000000
    non_county_specific_data['highest_incidence'] = 0
    non_county_specific_data['lowest_incidence'] = 100000000000000
    for AdmUnitID in covid19.keys():
        covid19[AdmUnitID]['incidences'] = list()
        for timestamp in range(len(covid19[AdmUnitID]['cases'])):
            cases_7_days_prior = 0
            cases_on_day = covid19[AdmUnitID]['cases'][timestamp]
            non_county_specific_data['cases_germany'][timestamp] = (cases_on_day +
            non_county_specific_data['cases_germany'][timestamp])

            if timestamp >= 7:
                cases_7_days_prior = covid19[AdmUnitID]['cases'][timestamp - 7]
            incidence = (((cases_on_day - cases_7_days_prior) * 100000) /
                         counties_geography[AdmUnitID]['population'])
            covid19[AdmUnitID]['incidences'].append(incidence)
            if non_county_specific_data['highest_case_number'] < cases_on_day:
                non_county_specific_data['highest_case_number'] = cases_on_day
            if non_county_specific_data['lowest_case_number'] > cases_on_day:
                non_county_specific_data['lowest_case_number'] = cases_on_day
            if non_county_specific_data['highest_incidence'] < incidence:
                non_county_specific_data['highest_incidence'] = incidence
            if non_county_specific_data['lowest_incidence'] > incidence:
                non_county_specific_data['lowest_incidence'] = incidence

    non_county_specific_data['incidences_germany'] = list()
    for timestamp in range(len(non_county_specific_data['cases_germany'])):
        cases_7_days_prior = 0
        cases_on_day = non_county_specific_data['cases_germany'][timestamp]
        if timestamp >= 7:
            cases_7_days_prior = non_county_specific_data['cases_germany'][timestamp - 7]
        incidence = (((cases_on_day - cases_7_days_prior) * 100000) /
                     non_county_specific_data['population_germany'])
        non_county_specific_data['incidences_germany'].append(incidence)

## Get the Names of the German Federal States
This data is hardcoded because it is unlikely to change. Even if the names of the federal states become outdated and do not fit the current official name, the functionality of this project will not be affected.
The names are taken from the ["COVID-19 Datenhub"](https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/rki_admunit_hubv/FeatureServer/0/query?where=AdmUnitId%3C20&resultType=none&outFields=*&f=pjson).

In [None]:
if not(covid19_use_polished_data):
    non_county_specific_data['states'] = {
        "1" : "Schleswig-Holstein",
        "2" : "Hamburg",
        "3" : "Niedersachsen",
        "4" : "Bremen",
        "5" : "Nordrhein-Westfalen",
        "6" : "Hessen",
        "7" : "Rheinland-Pfalz",
        "8" : "Baden-Württemberg",
        "9" : "Bayern",
        "10" : "Saarland",
        "11" : "Berlin",
        "12" : "Brandenburg",
        "13" : "Mecklenburg-Vorpommern",
        "14" : "Sachsen",
        "15" : "Sachsen-Anhalt",
        "16" : "Thüringen"}

## Check and Save the Polished Covid19 Data
Before the COVID-19 cases are saved in the file "german_covid19.txt" inside the folder "polished_data", they are checked once again to ensure that during the polishing nothing gets lost or is changed.
<br/>
It is checked if there are fewer or more counties than defined in the variable number_of_counties and if every list of cases is as long as the dedicated dates.

In [None]:
if not(covid19_use_polished_data):
    covid19_data_seems_to_be_flawless = True    # Assume everything is correct
    if len(covid19) != number_of_counties:
        print("covid19 has not the right amount of counties: {} instead of {}."
                .format(len(covid19), number_of_counties))
        covid19_data_seems_to_be_flawless = False
    for AdmUnitID in covid19.keys():
        if len(covid19[AdmUnitID]['cases']) != len(non_county_specific_data['unixtime']):
            print("The county {} has not the right amount of dates: {} instead of {}."
                    .format(county, len(covid19[AdmUnitID]['cases']),
                            len(non_county_specific_data['unixtime'])))
            covid19_data_seems_to_be_flawless = False

In [None]:
if not(covid19_use_polished_data) and covid19_data_seems_to_be_flawless:
    # check if the needed directory is availlable - otherwise create it
    if not(os.path.isdir("polished_data")): os.makedirs("polished_data")
    with open("polished_data/german_covid19.txt", "w") as file:
        file.write(json.dumps((covid19, non_county_specific_data)))
    print("Saved seemingly flawless covid19 data.")

## Get the Polished Data
If the pull from the API and/or the "pull" from the local backup failed or the user chose to use the polished data, the file "german_covid19.txt" inside the folder "polished_data" is opened and the data is stored in the variables "covid19" and "non_county_specific_data".

In [None]:
if covid19_use_polished_data:
    covid19_use_api_backup = False
    covid19_use_api = False
    with open("polished_data/german_covid19.txt", "r") as file:
        covid19, non_county_specific_data = json.loads(file.read())
    print("Polished covid19 data from file is ready to go!")

##  Add UTC Time and Additional Dates
Humans are generally not used to the Unix time; this is why the more accessible kind of time format UTC is chosen. The exact hour in Germany and the time shift are not taken to account because the data is only compared to other data with the same time shift.
<br/><br/>
The UTC time is added after saving the data because the UTC time format cannot be saved in json format. Therefore it must always be generated anew. Calculating it inside the file "get_data.ipynb" keeps the plotting of the data strictly separated from the pulling and polishing of the data.

In [None]:
non_county_specific_data['UTC'] = [datetime.date.fromtimestamp(date//1000).strftime('%d.%m.%Y')
                           for date in non_county_specific_data['unixtime']]

In [None]:
non_county_specific_data['UTC+7days'] = non_county_specific_data['UTC'].copy()
for e in range(1,8):
    non_county_specific_data['UTC+7days'].insert(0,
    datetime.date.fromtimestamp((non_county_specific_data['unixtime'][0]
                                - (e*86400000))//1000).strftime('%d.%m.%Y'))