# Get and calculate the data
This program is used to plot COVID-19 incidences in Germany by county from March 1, 2020 to the current date (for as long as the API provides the data).
<br/><br/>
This project contains three files:
   - "get_geographical_data_of_german_counties.ipynb"
   - "plot_data.ipynb"
   - "get_data.ipynb"

The file "get_geographical_data_of_german_counties.ipynb" saves the following information reachable through the "AdmUnitID" (Gemeindeschlüssel) in the dictionary "counties_geography":
   - name: Name of the county
   - population: Number of inhabitants from the last official guess, also used in the official incidence calculations
   - geometry: The shape of the county stored in one or more polygons - also determining the position of the county
   - raw_geometry: The original version of the geometry used to draw the surroundings of the counties
   - area_in_m2: Area of the county in square meters, can not be calculated from the polygons stored in geometry








The program is called from "plot_data.ipynb" or it is run by itself.
<br/><br/><br/>
The main file is calles "plot_data.ipynb" 

Das main file nennt 
This project i














The program provides the dictionary "counties_geography" with different kinds of geographical information about every German county. For more information check out the file "get_geographical_data_of_german_counties.ipynb".
<br/><br/>
The program saves the accumulated COVID-19 case number of every german county and every day of the pandemic in the dictionary "covid19" (reachable by the county's AdmUnitID).
<br/><br/>
Finally the program calculates the seven days incidence for every county and every day of the pandemic. Additionally it calculates each county's population density and converts the Unix time into UTC.
<br/><br/>
The data from the API is called "unpolished" or "unmodified". The process of "polishing the data" contains checking the data, cutting away unnecessary data (e.g. object IDs) and for us useless dictionary shells.
So that this program provides the slimest version of the COVID-19 data and the geographical data in the end, called "polished version" or "modified data".<br/><br/>
There are three different ways the COVID-19 data and the geographical data are stored:
- Stored on the server of the RKI, reachable through the API.
- Stored unpolished and unmodified on this machine as backup if the API doesn't work anymore.
- Stored polished and ready to go on this machine.

## Modules
Needed to use non-Python functionalities already programmed by someone else.

In [1]:
# Used to convert the API data from json-format into a python list
from json2xml.utils import readfromurl
import json    # to save the data in "json"-format in a file
# Used to check if there is a local file with the data or if a new API pull is inevitable
import os.path
import datetime    # to convert Unix time to UTC

## Control
Set variables to "True" to trigger the action described by the comment and the variable's name.<br/><br/>
If multiple of the three variables "covid19_use_api", "covid19_use_api_backup" and "covid19_use_polished_data" are set to "True", the last one overwrites all data collected by the others. It is best practice to only set one to "True".<br/><br/>
If one data source seems to provide faulty data or the necessary files do not exist, the other options are tried.

In [2]:
# The program uses polished geographical data about the counties (True)
# or calls get_geographical_data_of_german_counties.ipynb to produce that data (False)
counties_geography_use_polished_data = True

covid19_use_api = False    # pulls current COVID-19 case numbers from the API
covid19_use_api_backup = False    # polishes backup of old API pull
covid19_use_polished_data = True    # takes old, already polished data

### Check the controls
Check if the necessary files to run the choices made by the controls above exist. Otherwise the data must be taken from somewhere else.<br/>
Pulling from the API takes a lot of time and ressources. If the user therefore chooses in the controls above not to pull from the API, this choice should only be changed if it is unavoidable.
<br/>
There are three ways how data could be missing:
- Neither polished nor unpolished data about the German COVID-19 cases are saved on the machine. In this case a new pull from the API is inevitable.
- No polished version of the data exists on the machine, but a backup of an old API pull. Therefore the program initiates a pull from the API or a "pull" from the backup file with the unpolished data.
- No backup of an old API pull exists, but a polished version of the data. If not, the program initiates a pull from the API or uses the polished data. The file with the polished data exists, due to the first condition.


<br/>
In any case the global control variables are changed accordingly.
<br/><br/>
The "number_of_counties" is also set here: It determines how many counties must be present in the data. If there are less or more, the current data-source is declared a fail and (if possible) another one is used.

In [3]:
if (not(os.path.isfile("modified_data/german_covid19.txt")) and 
    not(os.path.isfile("unmodified_data/covid19/dates.txt"))):    # no files
    covid19_use_polished_data = False
    covid19_use_api_backup = False
    covid19_use_api = True
elif not(os.path.isfile("modified_data/german_covid19.txt")):    # no polished version
    # and os.path.isfile("unmodified_data/covid19/dates.txt") due to first condition
    covid19_use_polished_data = False
    # ensuring that one of the other two data sources is used
    covid19_use_api_backup = not(covid19_use_api)
elif not(os.path.isfile("unmodified_data/covid19/dates.txt")):    # no backup
    # and os.path.isfile("modified_data/german_covid19.txt") due to first condition
    covid19_use_api_backup = False
    # ensuring that one of the others is used
    covid19_use_polished_data = not(covid19_use_api)

In [4]:
number_of_counties = 412

## Get geographical data of every german county
If "counties_geography_use_polished_data" is set to True and the needed file exists the polished data from that file is used. <br/>
If "counties_geography_use_polished_data" is set to False by the user or if the needed file does not exist the file "get_geographical_data_of_german_counties.ipynb" gets called to provide new polished data.<br/>
For more information where the data comes from and how it gets polished check out the file "get_geographical_data_of_german_counties.ipynb". 

In [5]:
if not(os.path.isfile("modified_data/german_counties_geography.txt")):
    counties_geography_use_polished_data = False

In [6]:
if counties_geography_use_polished_data:
    with open("modified_data/german_counties_geography.txt", "r") as file:
        counties_geography = json.loads(file.read())
    print("Polished county data from file is ready to go!")
else:
    no_outputs_from_file_get_shapes_of_german_counties = True
    %run get_geographical_data_of_german_counties.ipynb

Polished county data from file is ready to go!


## Get the COVID-19 cases of every german county
Saves the COVID-19 cases of every german county since the start of the pandemic in the dictionary "covid19" (reachable by the countys AdmUnitID) and the corresponding dates in the dictionary "non_county_specific_data".
<br/><br/>
In the control chapter the user presets what shall be done and the program checks whether the actions are possible or not.<br/>
If "covid19_use_api" is set True, the program pulls from the [COVID-19 Datenhub provided from the Robert-Koch-Institut](https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/6d78eb3b86ad4466a8e264aa2e32a2e4_0). The data of each county must get pulled separatedly because the API only allows 1000 datapoints at a time and all counties times the number of days is well over 100.000. The identifiers of the counties originate from the keys of the dictionary "counties_geography".<br/><br/>
The raw data gets checked: If any county has less timestamps than the dates stored in "non_county_specific_data['unixtime']", all data gets deleted to prevent use of faulty data and an alternative data-source is chosen.<br/><br/>
If the unpolished data passes this rudimentary test it is stored as it is in a txt-file with its AdmUnitID as name in the folder "covid19" inside the folder "unmodified_data". If any of the folders or any of the files do not yet exist they get created.<br/>
This files can be used in further executions as local backup of the API-pull.
<br/>
<br/>
If the use of the data from a local backup of the API-pull is requested and possible, the data is provided without further tests.
<br/><br/>
The polished version of the data gets stored in the variable "covid19" and contains the accumulated cases from every day of the pandemic in a list sorted by the AdmUnitID of the county.

### url_county(AdmUnitID, True_for_dates_False_for_covid19_cases = False): returns url
Used to get the url for the COVID-19 cases of the german county determined by the AdmUnitID.<br/>
**AdmUnitId**<br/>
-> identifier of the county whichs covid19 cases shall be requested<br/>
**True_for_dates_False_for_covid19_cases** (default: False)<br/>
-> Determines whether the dates in unixtime or the actual COVID-19 cases shall be requested
### find_alternative_source_of_data_and_activate_it(): returns void (modifys multiple global variables)
Gets called when the data from a data-source is faulty. Deletes faulty data to prevent use of faulty data. Checks which other data-source could be used and modifys the global variables accordingly

In [7]:
def url_county(AdmUnitID, True_for_dates_False_for_covid19_cases = False):
    url = ("https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/" +
           "rki_history_hubv/FeatureServer/0/query?where=AdmUnitId%3D" +
           str(AdmUnitID) + "&outFields=")
    if True_for_dates_False_for_covid19_cases:
        return url + "Datum&orderByFields=Datum&f=pjson"
    return url + "KumFall&orderByFields=Datum&f=pjson"

In [8]:
def find_alternative_source_of_data_and_activate_it():
    global covid19_use_api
    global covid19_use_api_backup
    global covid19_use_polished_data
    global copy_of_covid19_for_debugging_purposes
    global covid19
    global non_county_specific_data
    copy_of_non_county_specific_data_for_debugging_purposes = non_county_specific_data.copy()
    copy_of_covid19_for_debugging_purposes = covid19.copy()
    del non_county_specific_data    # to prevent accidentall use of faulty data
    del covid19    # to prevent accidentall use of faulty data
    # check if a local pull of the API exists otherwise use the polished data
    if os.path.isfile("unmodified_data/covid19/dates.txt"):
        covid19_use_api_backup = True
    if os.path.isfile("modified_data/german_covid19.txt"):
        covid19_use_polished_data = True
    # neither local backup nor polished data found
    if not(covid19_use_api_backup) and not(covid19_use_polished_data):
        raise Exception("No usable data found!")

### Pull from API

In [9]:
# check if new pull from the API is necessary or wished and
# if it is even possible otherwise "pull" from local backup
if covid19_use_api:
    print("Pulling from API...")
    covid19 = dict()
    non_county_specific_data = dict()
    # check if the needed directory is availlable - otherwise create it
    if not(os.path.isdir("unmodified_data/covid19")): os.makedirs("unmodified_data/covid19")
    number_of_timestamps = -1
    
    # get data - every county must be called individually because of the Max Record Count of the API
    for AdmUnitID in list(counties_geography.keys()):
        # get dates of first county
        if number_of_timestamps == -1:
            raw_dates = readfromurl(url_county(AdmUnitID, True))
            if len(raw_dates['features']) < 200:
                print("The dates of {} sends to little timestamps ({}) - check the url"
                      .format(AdmUnitID, len(raw_dates['features'])))
                find_alternative_source_of_data_and_activate_it()
                covid19_use_api = False
                break
            number_of_timestamps = len(raw_dates['features'])
            non_county_specific_data['unixtime'] = [e['attributes']['Datum'] for e in raw_dates['features']]
            # save raw data
            with open("unmodified_data/covid19/dates.txt", "w") as file:
                file.write(json.dumps(raw_dates))

        # get countys covid19 data
        raw_covid19_data = readfromurl(url_county(AdmUnitID))
        if number_of_timestamps != len(raw_covid19_data['features']):
            print("The provided data from the API does not have the same number of timestamps of " +
                  "{}, it has {}.".format(number_of_timestamps, len(raw_covid19_data['features'])))
            find_alternative_source_of_data_and_activate_it()
            covid19_use_api = False
            break
        covid19[AdmUnitID] = dict()
        covid19[AdmUnitID]['cases'] = [e['attributes']['KumFall'] for e in raw_covid19_data['features']]
        with open("unmodified_data/covid19/" + AdmUnitID + ".txt", "w") as file:
            file.write(json.dumps(raw_covid19_data))
        
    if covid19_use_api:
        covid19_use_polished_data = False
        covid19_use_api_backup = False
        print("Covid19 Data directly from API is ready to go!")

### "Pull" from local API backup

In [10]:
# Use data from local backup originating from old API pull
# covid19_use_api could be modified in the if-statement - therefore no else-statement here
if not(covid19_use_api) and covid19_use_api_backup:
    print("Reading backup of old API pull...")
    covid19 = dict()
    non_county_specific_data = dict()
    list_of_countys = list(counties_geography.keys())
    # get the dates
    with open("unmodified_data/covid19/dates.txt", "r") as file:
        raw_dates = json.loads(file.read())
    non_county_specific_data['unixtime'] = [e['attributes']['Datum'] for e in raw_dates['features']]
    number_of_timestamps = len(non_county_specific_data['unixtime'])

    for root, dirs, files in os.walk('unmodified_data/covid19'):
        # to little dates - something is wrong. Checking here to skip for-loop
        if len(raw_dates['features']) < 200:
            print("There are only {} dates - check your backup or make a new pull from the api."
                  .format(len(raw_dates['features'])))
            find_alternative_source_of_data_and_activate_it()
            covid19_use_api = False
            break
        for filename in files:
            AdmUnitID = filename[:-4]
            if AdmUnitID == 'dates':    # already done
                continue

            list_of_countys.remove(AdmUnitID)
            covid19[AdmUnitID] = dict()
            with open(os.path.join(root, filename), "r") as file:
                covid19[AdmUnitID]['cases'] = [e['attributes']['KumFall'] for e in json.loads(file.read())['features']]

            if number_of_timestamps != len(covid19[AdmUnitID]['cases']):
                print("The data from file {} does not have {} timestamps, it has {}."
                      .format(filename, number_of_timestamps, len(covid19[AdmUnitID])))
                find_alternative_source_of_data_and_activate_it()
                covid19_use_api_backup = False
                break

    if len(list_of_countys) > 0 and covid19_use_api_backup:
        print("No backup found for {}".format(list_of_countys))
        find_alternative_source_of_data_and_activate_it()
        covid19_use_api_backup = False

    if covid19_use_api_backup:
        covid19_use_polished_data = False
        covid19_use_api = False
        print("Covid19 Data from (maybe old) API-pull-backup is ready to go!")

## Get names of the german federal states
This data is hardcoded because it is unlikely to change. Even if the names of the federal states get outdatet and don't fit the current official name the functionality of this project will not be affected.
The names originate from  the [COVID-19 Datenhub provided from the Robert-Koch-Institut](https://services7.arcgis.com/mOBPykOjAyBO2ZKk/arcgis/rest/services/rki_admunit_hubv/FeatureServer/0/query?where=AdmUnitId%3C20&resultType=none&outFields=*&f=pjson).

In [11]:
if not(covid19_use_polished_data):
    non_county_specific_data['states'] = {
        "1" : "Schleswig-Holstein",
        "2" : "Hamburg",
        "3" : "Niedersachsen",
        "4" : "Bremen",
        "5" : "Nordrhein-Westfalen",
        "6" : "Hessen",
        "7" : "Rheinland-Pfalz",
        "8" : "Baden-Württemberg",
        "9" : "Bayern",
        "10" : "Saarland",
        "11" : "Berlin",
        "12" : "Brandenburg",
        "13" : "Mecklenburg-Vorpommern",
        "14" : "Sachsen",
        "15" : "Sachsen-Anhalt",
        "16" : "Thüringen"}

### Calculate seven days incidence and	population density
The incidence is calculated from the number of cases seven days prior (set to zero if not defined), the cases of the current day (both from "covid19[AdmUnitID]['cases']") and the number of inhabitants ("counties_geography[AdmUnitID]['population']").<br/>
To get all new cases in that county within the last seven days the program subtracts the accumulated cases seven days earlier from the accumulated cases of the current day. Afterwards this number of cases is divided by the counties population. To scale it to 100.000 inhabitants, the result is multiplied by 100.000.<br/>
This is done for every case number of every county.<br/><br/>
The population density is calculated by dividing the population number by the area in square meters. To scale it to kilometers, the result is multiplied by 1000000.
<br/><br/>
The highest and lowest seven days incidence, the highest and lowest case number and the highest and lowest population density are stored in the dictionary "non_county_specific_data" to have a reference.

In [12]:
if not(covid19_use_polished_data):
    non_county_specific_data['highest_case_number'] = 0
    non_county_specific_data['lowest_case_number'] = 100000000000000
    non_county_specific_data['highest_incidence'] = 0
    non_county_specific_data['lowest_incidence'] = 100000000000000
    for AdmUnitID in covid19.keys():
        covid19[AdmUnitID]['incidences'] = list()
        for timestamp in range(len(covid19[AdmUnitID]['cases'])):
            cases_7_days_prior = 0
            cases_on_day = covid19[AdmUnitID]['cases'][timestamp]
            if timestamp >= 7:
                cases_7_days_prior = covid19[AdmUnitID]['cases'][timestamp - 7]
            incidence = (((cases_on_day - cases_7_days_prior) * 100000) /
                         counties_geography[AdmUnitID]['population'])
            covid19[AdmUnitID]['incidences'].append(incidence)
            if non_county_specific_data['highest_case_number'] < cases_on_day:
                non_county_specific_data['highest_case_number'] = cases_on_day
            if non_county_specific_data['lowest_case_number'] > cases_on_day:
                non_county_specific_data['lowest_case_number'] = cases_on_day
            if non_county_specific_data['highest_incidence'] < incidence:
                non_county_specific_data['highest_incidence'] = incidence
            if non_county_specific_data['lowest_incidence'] > incidence:
                non_county_specific_data['lowest_incidence'] = incidence

In [13]:
if not(covid19_use_polished_data):
    # is calculated here instead inside the get_shapes_of_german_counties.ipynb-file
    # to be able to put it together in one dictionary non_county_specific_data
    non_county_specific_data['highest_population_density'] = 0
    non_county_specific_data['lowest_population_density'] = 100000000000000
    for county in counties_geography.values():
        county["population_density"] = (county['population'] * 1000000)/county['area_in_m2']
        if non_county_specific_data['highest_population_density'] < county["population_density"]:
            non_county_specific_data['highest_population_density'] = county["population_density"]
        if non_county_specific_data['lowest_population_density'] > county["population_density"]:
            non_county_specific_data['lowest_population_density'] = county["population_density"]

## Check and save the polished covid19 data
Before the COVID-19 cases get saved into the file "german_covid19.txt" inside the folder "modified_data" they get checked once again to ensure that during the handling nothing got lost or changed.
<br/>
It is checked if there are less or more counties than defined in the variable number_of_counties and if every list of cases is as long as the dedicated dates.

In [14]:
if not(covid19_use_polished_data):
    covid19_data_seems_to_be_flawless = True    # Assume everything is correct
    if len(covid19) != number_of_counties:
        print("covid19 has not the right amount of counties: {} instead of {}."
                .format(len(covid19), number_of_counties))
        covid19_data_seems_to_be_flawless = False
    for AdmUnitID in covid19.keys():
        if len(covid19[AdmUnitID]['cases']) != len(non_county_specific_data['unixtime']):
            print("The county {} has not the right amount of dates: {} instead of {}."
                    .format(county, len(covid19[AdmUnitID]['cases']),
                            len(non_county_specific_data['unixtime'])))
            covid19_data_seems_to_be_flawless = False

In [15]:
if not(covid19_use_polished_data) and covid19_data_seems_to_be_flawless:
    # check if the needed directory is availlable - otherwise create it
    if not(os.path.isdir("modified_data")): os.makedirs("modified_data")
    with open("modified_data/german_covid19.txt", "w") as file:
        file.write(json.dumps((covid19, non_county_specific_data)))
    print("Saved seemingly flawless covid19 data.")

## Get the polished data
If the pull from the API and/or the "pull" from the local backup failed or the user chose to use the polished data, the file "german_covid19.txt" inside the folder "modified_data" gets opened and the data stored in the variables "covid19" and "non_county_specific_data".

In [16]:
if covid19_use_polished_data:
    covid19_use_api_backup = False
    covid19_use_api = False
    with open("modified_data/german_covid19.txt", "r") as file:
        covid19, non_county_specific_data = json.loads(file.read())
    print("Polished covid19 data from file is ready to go!")

Polished covid19 data from file is ready to go!


##  Add UTC time
Humans are in general not used to the Unix time that is why the more present kind of time format UTC is chosen. The exact hour in Germany and the time shift is not taken to account, because the data is only compared to other data with the same time shift.
<br/><br/>
The UTC time is added after saving the data, because the UTC time format can not be saved in json format. Therefore it must always be done. Calculating it inside the file "get_data.ipynb" keeps the plotting of the data strictly separated from the pulling and modifying of the data.

In [17]:
non_county_specific_data['UTC'] = [datetime.datetime.utcfromtimestamp(date/1000)
                           for date in non_county_specific_data['unixtime']]