# Project 2: Corona/Weather Data Science - FYP 2021
## How weather influences the spread of the pandemic 
---
### Group 9: Aidan Stocks, Christian Margo Hansen, Jonas-Mika Senghaas, Malthe Pabst, Rasmus Bondo Hansen
Submission: 19.03.2021 / Last Modified: 09.03.2021

---

This notebook contains the step-by-step data science process performed on the `IBM Weather Data` from 2020 and the official `Corona Statistics` in 2020. 
The goal was to inform the authorities of **Germany** (*Group 9 Focus*) about the development of the pandemic in 2020 investigate possible relations between environmental conditions and the spread of the pandemic.

The raw datasets were given by the course mananger. *Link source*

> *Contact for technical difficulties or questions related to code: Jonas-Mika Senghaas (jsen@itu.dk)*

### Introduction
The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in December 2019 in Wuhan, China. The World Health Organization declared the outbreak a Public Health Emergency of International Concern in January 2020 and a pandemic in March 2020. As of 5 March 2021, more than 115 million cases have been confirmed, with more than 2.56 million deaths attributed to COVID-19, making it one of the deadliest pandemics in history. (Source: [Wikipedia](https://en.wikipedia.org/wiki/COVID-19_pandemic)).

The goal of this project is to analyse the development of the pandemic in Germany from February 2020, while paying special attention to environmental factors that may relate to higher chances of infection and therefore a faster spread of the pandemic.

## Running this Notebook
---
This notebook contains all code to reproduce the findings of the project as can be seen on the [GitHub page](https://github.com/jonas-mika/fyp2021p02g09) of this project. In order to read in the data correctly, the global paths configured in the section `Constants` need to be correct. The following file structure - as prepared in the `submission.zip` - was followed throughout the project and is recommended to use (alternatively the paths in the section `Constants` can be adjusted):

```
submission
|   github_link.webloc
│   project_report.pdf
│   gitlog.txt    
│
└───data
│   │
│   | raw
│   | │   
│   | │   
│   | │   
|   |
|   | 
|  
│   
└───notebooks
    |   project2.ipynb (current location)
    |   project2
        |   __init__.py
        |   visualisations.py
        |   ...
    
```

# Required Libraries
---
Throughout the project, we will use a range of both built-in and external Python Libraries. This notebook will only run if all libraries and modules are correctly installed on your local machines. 
To install missing packages use `pip install <package_name>` (PIP (Python Package Index) is the central package management system, read more [here](https://pypi.org/project/pip/)). 

In case you desire further information about the used packages, click the following links to find detailled documentations:
- [Pandas Homepage](https://pandas.pydata.org/)
- [Numpy Homepage](https://numpy.org/)
- [Matplotlib Homepage](https://matplotlib.org/stable/index.html)
- [Folium Documentation](https://python-visualization.github.io/folium/)
- [Scipy Homepage](https://www.scipy.org/)

In [None]:
import pandas as pd                                    # provides major datastructure pd.DataFrame() to store the datasets
import numpy as np                                     # used for numerical calculations and fast array manipulations
import matplotlib.pyplot as plt                        # visualisation of data
import matplotlib.dates as mdates
import datetime as dt
import re
import folium                                          # spatial visualisation
import json                                            # data transfer to json format
import os                                              # automates saving of export files (figures, summaries, ...)
import random                                          # randomness in coloring of plots

## Own Package
---
Since this project makes heavy use of functions to achieve maximal efficiency, all functions are stored externally in the package structure `project1'. The following imports are necessary for this notebook to run properly.

In [None]:
from project2.processing import check_columns_for_missing_values
from project2.numerical_summary import get_uniques_and_counts, get_fivenumsummary, compute_numerical_summary
from project2.visualisations import initialise_summary, barplot, histogram, boxplot, categorical_scatterplot, categorical_association_test
from project2.spatial_visualisation import plot_marker, random_color, map_accidents
from project2.save import save_csv, save_json, save_figure, save_dict, save_map, save_all_single_variable_analysis, save_all_categorical_scatters, save_all_categorical_associations

**REMARK**: All function used in this project are well documented in their `Docstring`. To display the docstring and get an short summary of the function and the specifications of the input argument (including data tupe and small explanation) as well as their return value, type **`?<function_name>`** in Juptyer (*see example*)

# Constants
---
To enhance the readibilty, as well as to decrease the maintenance effort, it is useful for bigger projects to define contants that need to be accessed globally throughout the whole notebook in advance. 
The following cell contains all of those global constants. By convention, we write them in caps (https://www.python.org/dev/peps/pep-0008/#constants)

In [None]:
# path lookup dictionary to store the relative paths from the directory containing the jupyter notebooks to important directories in the project
PATH = {}

PATH['data'] = {}
PATH['data']['raw'] = "../data/raw/"
PATH['data']['interim'] = "../data/interim/"
PATH['data']['processed'] = "../data/processed/"
PATH['data']['external'] = "../data/external/"

PATH['corona'] = 'corona/'
PATH['metadata'] = 'metadata/'
PATH['shapefiles'] = 'shapefiles/'
PATH['weather'] = 'weather/'

PATH['reports'] = "../reports/"

# filename lookup dictionary storing the most relevant filenames
FILENAME = {}
FILENAME['weather'] = 'weather.csv'
FILENAME['corona'] = 'de_corona.csv'
FILENAME['metadata'] = 'de_metadata.json'
FILENAME['shapefiles'] = 'de.geojson'

# defining three dictionaries to store data. each dictionary will reference several pandas dataframes
DATA_RAW = {}
DATA_GERMANY = {}
DATA_EXTERNAL = {}

REGION_LOOKUP = {}

# automising all plots requires a lot of additional information (ie. what plot-type to use on the different variables, whether or not we need a fivenumber-summary, etc.). this information is stored in the summary dictionary
SUMMARY = {}
    
NAMES = {}
NAMES['datasets'] = ['weather', 'corona']
NAMES['jsons'] = ['metadata', 'shapefiles']

In [None]:
# load in metadata using json library
for JSON in NAMES['jsons']:
    with open(PATH['data']['raw'] + PATH[JSON] + FILENAME[JSON]) as infile:
        DATA_GERMANY[JSON] = json.load(infile)

In [None]:
REGION_LOOKUP['iso'] = {
    DATA_GERMANY['metadata']['country_metadata'][i]['iso3166-2_code']: 
    DATA_GERMANY['metadata']['country_metadata'][i]['iso3166-2_name_en'] 
        for i in range(len(DATA_GERMANY['metadata']['country_metadata']))} # key: iso, value: region name

REGION_LOOKUP['region'] = {
    DATA_GERMANY['metadata']['country_metadata'][i]['iso3166-2_name_en']: 
    DATA_GERMANY['metadata']['country_metadata'][i]['iso3166-2_code']                            
        for i in range(len(DATA_GERMANY['metadata']['country_metadata']))} # key: region name, value: iso

# Loading, Inspection and Processing of Datasets (TASK 0)
---

## Loading in the Datasets
---
The data analysis revolves around a handful of datasets from different resources: 
> *CSV*: Corona (DE) -  Contains the `Number of new infections (per day)` and `Number of new casualties (per day)` filtered by day and region in Germany.

> *CSV*: Weather - Contains information about several indicators of weather conditions for each region in Germany, Denmark, Sweden and the Netherlands for each day during the period 13.02.2020 - 21.02.2021 (if `weather.csv` and `weather2.csv` are combined)

> *JSON*: Metadata (DE) - Contains more information about the different regions in Germany

> *GEOJSON*: Geojson (DE) - Holds the `geojson` data for the different regions in Germany

We conveniently load in the `csv` datasets into individual Pandas `DataFrames` using the built-in pandas method `pd.read_csv()` ([Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). We store those in our `DATA_RAW` dictionary in the corresponding keys.

For the `json` and `geojson` files, we use Pythons built-in library `json` in order to store their content in Python `dicts`.

In [None]:
# load in weather and corona data using pandas into the predefined dictionary 'DATA_RAW'
for dataset in NAMES['datasets']:
    DATA_RAW[dataset] = pd.read_csv(PATH['data']['raw'] + PATH[dataset] + FILENAME[dataset], sep = '\t')

Unfortunately, the weather data was given in two separate dataframes, where `weather.csv` contains data in the period of 13.02.2020 - 14.11.2020 and `weather2.csv` contains data in the period 15.11.2020 - 21.02.2021.

For our analysis we want to consider all weather records in the combined time periods. We therefore read in the `weather2.csv` into another `pd.DataFrame` and then use `pd.concat` to combine our intially loaded weather data with the additional weather data along the vertical axis (`axis=0`).

In [None]:
additional_weather = pd.read_csv(PATH['data']['raw'] + PATH['weather'] + 'weather2.csv', sep= '\t')
DATA_RAW['weather'] = pd.concat([DATA_RAW['weather'], additional_weather])

## Inspection of Datasets
---
We can now have a look at our two dataset to get a good first impression for what kind of data we are dealing with. We start by reporting the number of records and fields/  variables in each of the datasets by using the `shape` property of the Pandas `DataFrame`. 
We then continue to have an actual look into the data. Similiar to the `head` command in terminal, we can use the method `head()` onto our `DataFrames`, which outputs a nice inline representation of the first five data records of the dataset.

### Time Period

In [None]:
for dataset in NAMES['datasets']:
    print(f"{dataset.capitalize()}: {DATA_RAW[dataset]['date'].iloc[0].replace('-','.')} - {DATA_RAW[dataset]['date'].iloc[-1].replace('-','.')}")

### Size (Number of Records and Fields)

In [None]:
for dataset in NAMES['datasets']:
    print(f"{dataset.capitalize()}: {DATA_RAW[dataset].shape}")

### Peek into Datasets (Describing Attributes)

In [None]:
DATA_RAW['weather'].head()

We see, that the main dataset `weather` stores weather information for the four countries Germany (DE), Denmark (DK), Sweden (SE) and the Netherlands (NL) . It consist of 16.953 rows, corresponding to reports of weather conditions on a specified day in a specified region in one of the countries. The dataset holds the weather data recorded for each of the region from the 13th of February 2020 to the 21st of February 2021. The following variables appear in the dataset:

> `DATE` (YYYY-MM-DD): The day of the weather reports (*Time Attribute*)

> `ISO 3166-2`: [ISO 2 Code]() for the region, in which the weather report is (*Geographic Attribute*)

> `RELATIVE HUMIDITY SURFACE`: (*Numerical Attribute*)

> `SOLAR RADIATION`: (*Numerical Attribute*)

> `SURFACE PRESSURE`: (*Numerical Attribute*)

> `TEMPERATURE ABOVE GROUND`: (*Numerical Attribute*)

> `TOTAL PRECIPITATION`: (*Numerical Attribute*)

> `UV INDEX`: (*Numerical Attribute*)

> `WIND SPEED`: (*Numerical Attribute*)

In [None]:
DATA_RAW['corona'].head()

We see, that the dataset `corona` holds information about newly infected cases and new deaths per day per region in Germany for most of the days in 2020. It consist of 5602 rows, corresponding to 5.602 reports of new infections and deaths per day in the different regions of Germany. 

> `DATE` (YYYY-MM-DD): The day of the reports of new infections and deaths per region

> `REGION CODE`: Region in Germany (*Geographic Attribute*)

> `CONFIRMED ADDITION`: The number of newly confirmed infections in the region on the specified day (*Numerical Attributes*)

> `DECEASED ADDITION`: The number of newly confirmed deaths in the region on the specified day (*Numerical Attributes*)

### Summary
...

## Initial Sanity Check
---
Before continuing with the data analysis, we want to make sure that the datasets are clean. There are a plentiful of methods to check this. In the following, we will stick to the following three:
>(a) **Date Representation** (*Are the dates entered in consistent syntax?*)

>(b) **Missing Values in Columns** (*Are there missing values? If yes, how many and in which columns?*)

>(c) **Missing Values in Rows** (*Are there days, where there haven't been reports of the weather or corona development?)

### Date Representation

In [None]:
def check_date_representation():
    for date in DATA_RAW[dataset]['date']:
        if not re.match('\d\d\d\d-\d\d-\d\d', date):
            return False
    return True

In [None]:
for dataset in NAMES['datasets']:
    print(f"{dataset.capitalize()}: {check_date_representation()}")

### Missing Values
We check in all columns in all datasets the number of missing values (encoded as `' '` (*empty strings*)) to see if there are missing values (empty string) to get a feeling on which columns we need to further do processing.

In [None]:
for dataset in NAMES['datasets']:
    print(dataset.capitalize())
    check_columns_for_missing_values(DATA_RAW[dataset])

## Filtering Datasets 
---
We got a good first feeling for both datasets and have proven them to be clean. The next step is to filter both datasets to only hold data related to Germany. Our goal is to not only have a dataframe that holds the `weather` and `corona` data for the whole of Germany, but also for each of the 16 regions in Germany.



### Entire Germany
We therefore start by defining another empty dictionary in `DATA_GERMANY['corona']` and `DATA_GERMANY['weather']`. These will hold the `corona` and `weather` data for the whole of Germany at key `all` and each region's at the corresponding `region_name` as a key.

In [None]:
for dataset in NAMES['datasets']:
    DATA_GERMANY[dataset] = {}

The `corona` and `weather` data for entire Germany is easy to obtain. Since the `corona` dataset is already filtered for Germany, we only need to copy the raw corona data into the key `all`.
The `weather` data, however, contains information for four countries, namely Denmark, Sweden, Netherlands and Germany. Here we need to filter. We do this by the unique region_codes that are used to map the weather data to the different secondary-class regions of each country. We can obtain these codes from the metadata provided for Germany that is saved at `DATA_RAW['metadata']`

In [None]:
region_codes = [DATA_GERMANY['metadata']['country_metadata'][i]['iso3166-2_code'] 
                for i in range(len(DATA_GERMANY['metadata']['country_metadata']))]

DATA_GERMANY['corona']['all'] = DATA_RAW['corona']
DATA_GERMANY['weather']['all'] = DATA_RAW['weather'][DATA_RAW['weather']['iso3166-2'].isin(region_codes)]

### Regional Filtering
We now aim to filter the obtained datasets that hold the `corona` and `weather` data for all of Germany for each of the 16 regions and save them at the corresponding key in `DATA_GERMANY`. We do this by iterating over the region codes (ISO 3166-2 Standard) that we obtained from the metadata. 

In the `weather` dataset, we can use these iso-values to use a simple mask onto the german weather data in order to filter for each region. 
In the `corona` dataset, however, the regions are not defined by their iso-value, but their regular name. In order to filter correctly, we use our `REGION_LOOKUP` mapping that for an iso-value as key, which returns the regular region name. 

In [None]:
for region in region_codes:
    mask = DATA_GERMANY['weather']['all']['iso3166-2'] == region
    DATA_GERMANY['weather'][REGION_LOOKUP['iso'][region].lower()] = DATA_GERMANY['weather']['all'][mask].reset_index()

In [None]:
for region in region_codes:
    mask = DATA_GERMANY['corona']['all']['region_code'] == REGION_LOOKUP['iso'][region]
    DATA_GERMANY['corona'][REGION_LOOKUP['iso'][region].lower()] = DATA_GERMANY['corona']['all'][mask].reset_index()

### Inspection
---
With our newly filtered datasets, we have reduced the number of `weather` and `corona` data both to the whole of Germany and to each of its regions, and we can convienently access them by looking them up in `DATA_GERMANY[dataset]` using the region name as a key. To see, with what kind of data we are dealing now, it makes to recompute the size of each of the datasets. However, our analysis of the column attributes still holds, since we have only filtered across `axis=0` (the rows), while we left the attributes untouched. 

Again, we report the number of records and fields/ variables in each of the datasets by using the `shape` property of the Pandas `DataFrame`. 

In [None]:
for dataset in NAMES['datasets']:
    print(f"{'-'*5}{dataset.capitalize()}{'-'*5}")
    for key in DATA_GERMANY[dataset].keys():
        print(f"{key.title()}: {DATA_GERMANY[dataset][key].shape}")

### Saving Filtered Data
---
We can now saave our filtered data to the `../data/interim/`, which saves our filtered data into `csv` format for later inspection.

In [None]:
for dataset in NAMES['datasets']:
    for key in DATA_GERMANY[dataset].keys():
        save_csv(DATA_GERMANY[dataset][key], path=PATH['data']['interim'] + PATH[dataset], filename=f"{dataset}_{key}", index=False) 

## Process Data
---
As our sanity checks have proven, the provided data is already quite clean. In all datasets, there are neither values missing nor inconcistencies in the recording of data (ie. representation of dates). 

However, to make our futher analysis more pleasant, there are some minor changes that we will perform on our datasets:
> (a) **Renaming of Columns** (The naming of the columns in the original dataframes is inconcistent and - in parts - poorly descriptive (What are ie. `Confirmed Addition`?))

> (b) **Relative Data** (In our later analysis, we do not only want to look at absolute values, but want to interpret those in relation to the size of the region we are looking at)

> (c) **Metric Change** (Here we change the `Temperature` that was recorded in Kelvin into Celcius for easier readability)

### a. Renaming of Columns

In [None]:
for dataframe in DATA_GERMANY['weather'].keys():
    DATA_GERMANY['weather'][dataframe].rename(columns={
        'date': 'date', 
        'iso3166-2': 'iso3166-2',
        'RelativeHumiditySurface': 'relative_humidity_surface', 
        'SolarRadiation': 'solar_radiation', 
        'Surfacepressure': 'surface_pressure', 
        'TemperatureAboveGround': 'temperature_above_ground', 
        'Totalprecipitation': 'total_precipiation', 
        'UVIndex': 'uv_index', 
        'WindSpeed': 'wind_speed'}, inplace=True)

In [None]:
for dataframe in DATA_GERMANY['corona'].keys():
    DATA_GERMANY['corona'][dataframe].rename(columns={
        'date': 'date', 
        'region_code': 'region', 
        'confirmed_addition': 'absolute_infections', 
        'deceased_addition': 'absolute_deaths'}, inplace=True)

### b. Relative Data

In [None]:
population_map = {DATA_GERMANY['metadata']['country_metadata'][i]['iso3166-2_name_en']:
                  DATA_GERMANY['metadata']['country_metadata'][i]['population'] 
                  for i in range(len(DATA_GERMANY['metadata']["country_metadata"]))
                 }
for region in DATA_GERMANY['corona'].keys():
    DATA_GERMANY['corona'][region]['population'] = DATA_GERMANY['corona'][region]['region'].map(population_map)
    DATA_GERMANY['corona'][region]['relative_infections'] = DATA_GERMANY['corona'][region]['absolute_infections']/DATA_GERMANY['corona'][region]['population']*100
    DATA_GERMANY['corona'][region]['relative_deaths'] = DATA_GERMANY['corona'][region]['absolute_deaths']/DATA_GERMANY['corona'][region]['population']*100

### c. Metric Change

In [None]:
for region in DATA_GERMANY['weather'].keys():
    DATA_GERMANY['weather'][region]['temperature_above_ground'] = DATA_GERMANY['weather'][region]['temperature_above_ground'] - 273.15

### Export Processed Datasets
--- 
Finally, we export the processed datasets into new subfolders in `../data/processed`. From now on, the Jupyter will sorely work with these processed datasets.

In [None]:
for dataset in NAMES['datasets']:
    for key in DATA_GERMANY[dataset].keys():
        save_csv(DATA_GERMANY[dataset][key], path=PATH['data']['processed'] + PATH[dataset], filename=f"{dataset}_{key}", index=False) 

# Single Variable Analysis (TASK 1)
---
We have obtained filtered and processed datasets from TASK 0. We will now turn our focus to analysing each attribute in each of the datasets both in a numerical summary (ie. through the number of uniques, counts of uniques or a five-number summary, where it makes sense) and visual representation. Depending on the type of the variable, there arouse different preferred ways of visual repsentation, to get a visual feeling for the data we are dealing with. 
> `Barplot` (*Categorical Variables*)
>> We usually want to plot categorical variables in a bar plot, where the x-axis is labelled with the unique values measured in the specific column and the y-axis represents the number of occurences of each unique value. An example of a bar plot in this project is ie. the Number of Accidents per Week Day.

> `Histogram` (*Numerical Variables*)
>> Most numerical variables (especially those with a wide range), are plotted in a histogram (which accounts for the fact that numerical variables are continuous). An example of a histogram is ie. the Age Distribution of Casualties involved in an Accident.

> `None` (*Others*)
>> For some variables it doesn't make sense to plot them, ie. geographical (`Longitude`, ...) or relatinonal attributes (`Accident_Index`, ...).

## Numerical Analysis
---


In [None]:
for dataset in TABLENAMES:
    compute_numerical_summary(SUMMARY[dataset], DATA_LEEDS[dataset])

### Saving Numerical Reports


## Visualisation
---
An essential step of each data exploration is to visualise our data. Only this makes the abstract numbers and characters meaningful.

We are therefore visualising the results of the single variable analysis in this section using the data structure `SUMMARY` we have step-by-step developed throughout the project. Depending on the properties of the attribute to plot, there arise three types of visualisatin:
> `Barplot`: The standard way of representing `categorical variables` using rectangular bars with heights and widths ([More information](https://en.wikipedia.org/wiki/Bar_chart))

> `Histogram`: A histogram is an approximate representation of the distribution of numerical data ([More information](https://en.wikipedia.org/wiki/Histogram))

> `Boxplot`: Boxplots are an important statistical tool to represent five-number summaries of numerical data visually ([More information](https://en.wikipedia.org/wiki/Box_plot))

To efficiently plot everything, we defined three unique functions `barplot`, `histogram` and `boxplot` that all take in the `SUMMARY` data structure as their main argument and 'intelligently' combine all informatation we have previously gathered to give a nice visualisation of the data.
We use the function `save_all_single_variable_analysis` that depends on `save_figure` to iterate over all columns and for each attributes computes and saves the correct plot in the correct directory. 

In [None]:
pd.to_datetime(DATA_GERMANY['weather']['hamburg']['date'], format='%Y-%m-%d')

In [None]:
def plot_weather(data, region, condition, aggregation='daily', dimensions=(16,9)):
    fig = plt.figure(figsize=dimensions)
    ax = fig.add_axes([.15,.15,.7,.7]) # [left, bottom, width, height]

    region_data = data[region]
    region_data['date'] = pd.to_datetime(region_data['date'], format='%Y-%m-%d')

    # data to plot for each level of aggregation
    if region == 'all':
        region_data = pd.DataFrame(region_data.groupby('date')[condition].mean()).reset_index()

    if aggregation == 'daily':
        to_plot = region_data[['date', condition]]
    elif aggregation == 'weekly':
        to_plot = pd.DataFrame(region_data.groupby(pd.Grouper(key='date', freq='W-MON'))[condition].mean()).reset_index()
    elif aggregation == 'monthly':
        to_plot = pd.DataFrame(region_data.groupby(pd.Grouper(key='date', freq='M'))[condition].mean()).reset_index()

    if region == 'all': title = f"{condition.title().replace('_',' ')} in Germany"        
    else: title = f"{condition.title().replace('_',' ')} in {region.title()}"

    # plot
    ax.plot(to_plot['date'], to_plot[condition], '-o', color='darkblue', linewidth='2')

    # labeling
    ax.set_title(title, fontweight='bold', fontsize=14)
    ax.set_xlabel("Date")
    ax.set_ylabel(f"{condition.title().replace('_', ' ')}")

    # labelling of x-axis (https://stackoverflow.com/questions/9627686/plotting-dates-on-the-x-axis-with-pythons-matplotlib)
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%Y'))
    plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
    plt.gcf().autofmt_xdate()

    ax.set_ylim(min(data['all'][condition]), max(data['all'][condition]))

    ax.text(.76,.89, f"{aggregation.capitalize()} Records", verticalalignment='center', transform = ax.transAxes, fontweight='bold')
    ax.text(.76,.84, f"Period: Feb 2020-Feb 2021", verticalalignment='center', transform = ax.transAxes)

    # activate grid
    ax.grid(True)

    return fig

In [None]:
fig = plot_weather(
    data = DATA_GERMANY['weather'],
    region = 'hamburg',
    condition = 'temperature_above_ground',
    aggregation= 'daily'
)

In [None]:
def plot_weather_all(data, condition, dimensions=(16,9)):
    fig = plt.figure(figsize=dimensions)
    ax = fig.add_axes([.15,.15,.7,.7]) # [left, bottom, width, height]

    for region in list(DATA_GERMANY['weather'].keys())[1:]:
        x = pd.to_datetime(data[region]['date'], format='%Y-%m-%d')
        y = data[region][condition]
        # plot
        ax.plot(x,y, label=region.title())

    # labeling
    ax.set_title(f"{condition.title().replace('_', ' ')} in all Regions (Feb 2020-Feb 2021)", fontweight='bold', fontsize=14)
    ax.set_xlabel("date")
    ax.set_ylabel(f"{condition.title().replace('_', ' ')}")

    # labelling of x-axis (https://stackoverflow.com/questions/9627686/plotting-dates-on-the-x-axis-with-pythons-matplotlib)
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%Y'))
    plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
    plt.gcf().autofmt_xdate()

    ax.set_ylim(min(data['all'][condition]), max(data['all'][condition]))

    # activate grid and legend
    ax.grid(True)
    ax.legend(loc='upper right')

    return fig

In [None]:
fig = plot_weather_all(
    data = DATA_GERMANY['weather'],
    condition = 'uv_index'
)

In [None]:
def plot_covid_cases(data, region, focus='infections', scale='absolute', aggregation='daily', dimensions=(16,9)):
    fig = plt.figure(figsize=dimensions)
    ax = fig.add_axes([.15,.15,.7,.7]) # [left, bottom, width, height]

    condition = f"{scale}_{focus}"

    region_data = data[region]
    region_data['date'] = pd.to_datetime(region_data['date'], format='%Y-%m-%d')

    # data to plot for each level of aggregation
    if region == 'all':
        region_data = pd.DataFrame(region_data.groupby('date')[condition].mean()).reset_index()

    if aggregation == 'daily':
        to_plot = region_data[['date', condition]]
    elif aggregation == 'weekly':
        to_plot = pd.DataFrame(region_data.groupby(pd.Grouper(key='date', freq='W-MON'))[condition].mean()).reset_index()
    elif aggregation == 'monthly':
        to_plot = pd.DataFrame(region_data.groupby(pd.Grouper(key='date', freq='M'))[condition].mean()).reset_index()

    if region == 'all': title = f"{scale.title()} Number of {focus.title()} in Germany"        
    else: 
        title = f"{scale.title()} Number of {focus.title()} in {region.title()}"
        ax.set_ylim(0, max(data['all'][condition]))
                    
    # plot
    ax.plot(to_plot['date'], to_plot[condition], '-o', color='darkblue', linewidth=2)

    # labeling
    ax.set_title(title, fontweight='bold', fontsize=14)
    ax.set_xlabel("Date")
    ax.set_ylabel(f"{scale.capitalize()} Number of {focus.capitalize()}")

    # labelling of x-axis (https://stackoverflow.com/questions/9627686/plotting-dates-on-the-x-axis-with-pythons-matplotlib)
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%Y'))
    plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
    plt.gcf().autofmt_xdate()

    # activate grid
    ax.grid(True)

    return fig

In [None]:
fig = plot_covid_cases(data = DATA_GERMANY['corona'], region='hamburg', focus='infections', scale='absolute', aggregation='weekly')

## Saving Figures
---
...

**REMARK**: Be aware that the cell takes some time to compute, since it plots and saves all figures at once.

In [None]:
%%capture
for region in DATA_GERMANY['weather'].keys():
    for condition in list(DATA_GERMANY['weather']['all'])[2:]:
        save_figure(figure = plot_weather(data = DATA_GERMANY['weather'], region = region, condition = condition),
                    path = PATH['reports'] + f"{region}/weather", filename = f"{condition.lower().replace(' ', '_')}", save_to = 'pdf')

In [None]:
%%capture
for region in DATA_GERMANY['weather'].keys():
    for scale in ['absolute', 'relative']:
        save_figure(figure = plot_covid_cases(data = DATA_GERMANY['corona'],region = region,condition = 'Confirmed Addition',scale = scale), path = PATH['reports'] + f"{region}/infections", filename = f"{scale}_infections_per_day", save_to = 'pdf')
        
        save_figure(figure = plot_covid_cases(data = DATA_GERMANY['corona'],region = region,condition = 'Deceased Addition',scale = scale), path = PATH['reports'] + f"{region}/deceased", filename = f"{scale}_deceased_per_day", save_to = 'pdf')

## Reports of Single Variable Analysis
---
`TASK 1` specifically asks us to report []...
> a. ...

> b. ...

The following reports (that can also be found in the automatically generated plots in `reports/`) answer these questions.

### Something

In [None]:
#

# Associations (TASK 2)
---
An equally important, and perhaps more challenging, part of any data analysis is to find associations in a dataset. Especially when analysing *Road Safety* this is a key aspect of the analysis, as we can investigate which accident, casualty or vehicle attribute lead to more frequent or severe accidents. These insights, in turn, allow us to proactively counteract through political measures. 

Since our dataset is dominated by `categorical variables` (as seen in the `Initial Inspection of Datasets`), we mostly want to relate `categorical variables` to each other and the few `numerical variables` that exist there. The following methods of associating are used in the following section to investigate these associations:
> **`Categorical/ Numerical`**
>> For relating a categorical to a numerical variable we use so-called `categorical scatterplots`, which are a special kind of scatterplot which displays the distribution of a numerical variable for different attributes in the cateogrical variable. To plot these kind of plots we use our own function `categorical_scatterplot` that at its core depends on `sns.catplot` ([Documentation](https://seaborn.pydata.org/generated/seaborn.catplot.html))

> **`Categorical/ Categorical`**
>> Relating two categorical attributes to one another is slightly more difficult. To do this, we count the number of occurrences in all possible combinations of the two categorical variables and plot those in individual plots. To investigate the associativity between the two categorical variables, we use the [Pearson Chi Squared](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) test. It is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It follows the general assumption, that we expect no change in the relative observed values if there is no association at all. 

## Picking Focus
---
...

## Assocations
---
...

## Saving all Associations
---

## Report: Association between `Accident` and `Vehicle/ Casualty` Attribute 
---
`TASK 2` ...

# Spatial Visualisation (TASK 3)
---
So far, we have completely ignored the `geographic attributes` that were recorded for each accident in `accidents`. We will change this now by plotting Leed's accidents on a map to get a good visual intuition on where the accidents happen. To do this, we use a combination of function, namely `map_accidents` that under the hood calls `plot_point`. The functions are based on `folium` - an external Python package - that was earlier introduced and makes use of leaflet.js, which is a JavaScript package used to create interactive maps. 
After export, can be explored as an `html` file in the browser or inline within Jupyter. The map will contain a heat_map (if function argument `heat_map` = `True`), and for an arbitary of column in `accidens` as an argument for `focus` will make a visual distinctions for different values of this column.

## Plot all Accidents in Leeds onto Map + Save Map Visualisation
---
Through the below cell, we save all maps generated with focuses on different attributes into the directory `../reports/leeds/maps`. We can study them in further detail by opening the `html` files in the browser.

**REMARK**: Note that the execution of the below cell may take some time, since two maps (each ~3MB) need to be rendered.

In [None]:
data_corona_region = pd.DataFrame()
labels = np.array(list(DATA_GERMANY['corona']['all']))[[2,3,5,6]]
for region in list(DATA_GERMANY['corona'].keys())[1:]:
    row = DATA_GERMANY['corona'][region].sum().drop(['date', 'population','region'])
    _dict = {labels[i]:row.iloc[i] for i in range(len(row))}
    _dict['region_code'] = REGION_LOOKUP['region'][region.title()]
    data_corona_region = data_corona_region.append(pd.Series(_dict, name=region.title()))

In [None]:
DATA_GERMANY['corona']['hamburg'].mean()

In [None]:
data_weather_region = pd.DataFrame()
labels = list(DATA_GERMANY['weather']['all'])[2:]
for region in list(DATA_GERMANY['weather'].keys())[1:]:
    row = DATA_GERMANY['weather'][region].mean()
    _dict = {labels[i]: row.iloc[i] for i in range(len(row))}
    _dict['region_code'] = REGION_LOOKUP['region'][region.title()]
    data_weather_region = data_weather_region.append(pd.Series(_dict, name=region.title()))

In [None]:
data_weather_region

In [None]:
def chloropleth(data, condition):
    #create the folium.Choropleth
    m = folium.Map(location = [51.3, 10.3], zoom_start = 6,)

    # plot onto chloropleth
    folium.Choropleth(
        geo_data = DATA_GERMANY['shapefiles'],
        data = data,
        columns = ["region_code", condition],
        key_on = "properties.iso_3166_2",
        fill_color = "OrRd",
        fill_opacity = .6,
        line_opacity = .5,
        legend_name = f"{condition.title().replace('_', ' ')} (Time Period: February 2020 - February 2021)",
    ).add_to(m)

    return m

In [None]:
chloropleth(data=data_weather_region, condition='wind_speed')

In [None]:
chloropleth(data=data_corona_region, condition='relative_infections')

## Report: Relevant Map
---
TASK 2 specifically asks us to visualise  ...

In [None]:
# reported map

# Open Question (TASK 4)
---
The final `TASK 4` asks us to investigate a self-chosen research question. 

## 1. Introduction 
---
Relevant Literature