# Geospatial Map of Germany

This Jupyter Notebook contains code to demonstrate a potential visualization (a geospatial choropleth map) of standardized inpatient treatment costs of a selected ICD-10-diagnosis across german federal lands ("Bundesländer").

A similar graph at other administrative levels (e.g. german distrcits) can be easily made by changing a few lines of the code.

I identified the following necessary tasks:

1. find and load an adequate GeoFile (preferably a ShapeFile, shp.) --> I used an official shape file from the Federal Ministry of Cartography of Germany

2. create random, but reproducable data to visualize standardized inpatient treatment costs (= e.g. costs per inhabitants)
--> data on inpatient treatment costs in Germany can be requested from "DeStatis", the Federal Ministry of Statistics in Germany ("DRG-Statistik"). However, data from the Ministry are not free. Therefore, I decided to create my own, random data.

3. create a plesant and informative graph with Matplotlib (other good options would have been Plotly or Seaborn)


## Setup:

In [None]:
# import necessary modules
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt

## __1.) find and load an adequate GeoFile__

At first, an adequate GeoFile had to be found and downloaded. I decided to look for officially approved GeoFiles and found the ones provided by the german Federal Ministry of Cartography. The GeoFiles for different administrative levels and in different scales can be found on the following link:

https://gdz.bkg.bund.de/index.php/default/digitale-geodaten/verwaltungsgebiete.html

I used the data with the 1:2 500 000 scale which can be found on this link:

https://gdz.bkg.bund.de/index.php/default/digitale-geodaten/verwaltungsgebiete/verwaltungsgebiete-1-2-500-000-stand-31-12-vg2500-12-31.html

Furthermore, I chose the GeoFile with the coordinate reference system (CRS) "3-degree Gauss-Kruger zone 3", which corresponds to the CRS EPSG:31467.
(for an introduction to coordinate reference systems and the use of geospatial data in python, enroll in the corresponding course on https://kaggle.com/learn/geospatial-analysis)

Last, but not least, I decided to use the file for the administrative level of the german federal lands ("Bundesländer"), which is marked by the abbreviation "LAN". 

In [None]:
# read in GeoFile
inpatient_costs_germany = gpd.read_file("Path-to-GeoFile.shp")

# check CRS
print(inpatient_costs_germany.crs)

# check the data
print(inpatient_costs_germany.head())
print(inpatient_costs_germany.tail())
print(inpatient_costs_germany.columns.value)
inpatient_costs_germany.plot()
plt.show()

After checking the data, I noticed that the graph showed more boundaries than expected (there are only 16 federal lands in Germany). Therefore, I read the accompanying use file provided by the Ministry of Cartography ("Produktdokumentation VG2500"). This use file stated that there are different boundaries on the administrative level of the federal lands, depending on whether one wants to visualize the additional areas that belong to the German State but are located in water (concerns areas in the North Sea, Baltic Sea and Lake Constance). I was only interested in the boundaries located on the land mass of Germany. That's why I used the variable "GF" (which stands for "Geofactor") to restrict the data to the land mass of Germany (= value 9):

In [None]:
# restrict the data
inpatient_costs_germany = inpatient_costs_germany.loc[inpatient_costs_germany['GF'] == 9]

# check the data again
inpatient_costs_germany.plot()
plt.show()
# this looks good now!

Then I restricted the DataFrame to the two necessary columns for the geospatial choropleth map: "GEN" (contains the names of the 16 german federal lands) & "geometry" (contains the polygons / geospatial data)

In [None]:
inpatient_costs_germany = inpatient_costs_germany[['GEN', 'geometry']]
print(inpatient_costs_germany.shape)
print(inpatient_costs_germany.head())

## __2.) create random, but reproducable data__

With real data, one would have to extract, clean, format, and normalize the data of the "DRG-Statistik" before creating the geospatial choropleth map. In my case (with random data) none of these steps are necessary. I just had to create a random array of plausible data and merge it to the DataFrame. 

I chose a random ICD-10-diagnosis for which I simulated to provide inpatient treatment costs per inhabitant (= normalization to account for the varying number of inhabitants in the federal lands of Germany): ICD-C34 / "malignant neoplasm of bronchia and lung".

First, I had to come up with plausible data. Therefore, I looked for a study that calculated the average economic burden of lung cancer in the European Union. I found corresponding calculations in the following paper of Luengo-Fernandez et al. (2013): _"Economic burden of cancer across the European Union: a population-based cost analysis"_

The paper can be found in "The Lancet - Oncology" on the following link:
https://doi.org/10.1016/S1470-2045(13)70442-X

Luengo-Fernandez et al. (2013) calculated that all cancer disease in the European Union cost €126 billion in 2009.
Lung cancer had the highest economic burden of all cancer diseases, with costs of €18.8 billion - which are 15% of the total of €126 billion cancer costs in the EU. 
They also calculated that, on average, Germany spent €182 per inhabitant for health care services in relation to cancer diseases. 

In recourse to these calculations, I extrapolated the average cost for health care services in relation to lung cancer diseases in the following way:

total health care costs per inhabitant for all cancer diseases in Germany * proportion of costs of lung cancer diseases in relation to the total costs of all cancer idseases in the EU
= €182 * 0.15 = €27.3 per inhabitant. 
So, I assumed that the average health care costs of lung cancer diseases in Germany per inhabitant were €27.3
