## 0. Introduction

This file contains the description and the code of an analysis around the current most dominant topic of the world: The Corona crisis. It was developed as part of the Udacity education program "Data Scientist". 

**Disclaimer: The entire project and all analysis results MUST NOT be considered as scientifically proven insights about the Corona epidemic. It is a project of an inexperienced student in Data Science and its primary purpose is to 
a) finalize the Udacity Data Science training program, and 
b) to generate some insights about the potential of publicly available data in a world-wide crisis.**

The analysis does follow the "Cross-industry standard process for data mining", in short CRISP-DM process model. Thus the first two chapter of the notebook do contain a summary of "Business Understanding" and "Data Understanding". The further steps of the CRISP-DM model are integrated into each of the 5 questions (see chapter below.)

## 1. Business Understanding

Over the first few month in 2020, the Covid-19/Corona virus pandemic has put the entire world into a serious health, social and economic crisis. The pandemic has spread rapidly, and is still spreading rapidly, across all countries. Most countries have closed their borders and shut down their normal societal and economic life.

The Corona pandemic is omnipresent all over the world and many people are working hard to better understand 

1. how quickly the virus spreads across countries and societies,
2. what measures might have an impact on the infection and/or death growth rate,
3. and what the major implications of the virus pandemic will be. 

The insights resulting from such analysis could then be used by politicians to take decisions on if and how to either loosen or intensify societal and economical restrictions.

The analysis in this project focusses on the following questions:

1. Which sources of trustful data around the Corona pandemic exists?
2. What kind of data is provided in the various sources and how can it be used to help understand, how fast the virus spreads per country?
3. Are there significant differences between countries in how the pandemic develops? And what might be the reasons for these differences?
4. How do the number of Corona infections and Corona deaths correlate with each other? Are there significant differences between countries?
5. How can the implications of the Corona pandemic be measures by data?

## 2. Data Understanding

The first major challenge of such a project is to identify appropriate and trustworthy data sources. 

After investigation of various publically available sources, I have selected the "European Centre for Disease Prevention and Control" (ECDC) as data provider. The agency provides downloadable data files, which are updated daily and contain the latest available public data on COVID-19.

A detailed discussion of potential data sources is included in the COVID-19 data exploration blog.

### Import libraries and gain some first insights about the provided data

In [1]:
# Import required libraries
import pandas as pd
import numpy as np


The ECDC dataset can be downloaded as CSV file. For this project the data from the 14th of April has been used and stored in the subfolder "data". From there it can be read via the standard Pandas function read_csv.

In [2]:
# Read the ECDC dataset into a Pandas data frame
filename = "data/ECDC_COVID19_20200414.csv"

df_cases = pd.read_csv(filename)

# Show some more information about the data 
df_cases.head()

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
0,14/04/2020,14,4,2020,58,3,Afghanistan,AF,AFG,37172386.0
1,13/04/2020,13,4,2020,52,0,Afghanistan,AF,AFG,37172386.0
2,12/04/2020,12,4,2020,34,3,Afghanistan,AF,AFG,37172386.0
3,11/04/2020,11,4,2020,37,0,Afghanistan,AF,AFG,37172386.0
4,10/04/2020,10,4,2020,61,1,Afghanistan,AF,AFG,37172386.0


The ECDC dataset contains 10 columns:

+ dateRep: Reporting day inlcuding year and month
+ day: Reporting day
+ month: Reporting month
+ year: Reporing year
+ cases: Number of new infections over the last 24h
+ death: Number of new deaths over the last 24h
+ countriesAndTerritories: Name of the country or territory which reported the number
+ geoId: ID of the reporting country
+ countryterritoryCode: Official country code of the reporting country
+ popData2018: Size of the population of the reporting country

Check the data types and convert as appropriate

In [3]:
# Check imported data types
print("Data Types before conversion:")
print(df_cases.dtypes)

df_cases.dateRep = pd.to_datetime(df_cases.dateRep, format='%d/%m/%Y')


print("\nData Types after conversion:")
print(df_cases.dtypes)

Data Types before conversion:
dateRep                     object
day                          int64
month                        int64
year                         int64
cases                        int64
deaths                       int64
countriesAndTerritories     object
geoId                       object
countryterritoryCode        object
popData2018                float64
dtype: object

Data Types after conversion:
dateRep                    datetime64[ns]
day                                 int64
month                               int64
year                                int64
cases                               int64
deaths                              int64
countriesAndTerritories            object
geoId                              object
countryterritoryCode               object
popData2018                       float64
dtype: object


In a next step we collect some statistics to verify the data further.

In [4]:
# Number of rows
print("Overall number of rows in the data set: {}\n\n".format(df_cases.shape[0]))

# What are the first and latest reporint days?
print("First reporting day in the data set: {}".format(min(df_cases['dateRep'])))
print("Latest reporting day in the data set: {}\n\n".format(max(df_cases['dateRep'])))

# Number of current total infections and total death?
print("Number of current total infections: {}".format(sum(df_cases['cases'])))
print("Number of current total deaths: {}\n\n".format(sum(df_cases['deaths'])))

# Number of unique countries or territories
print("Number of unique countries or territories within the data set: {}\n\n".format(df_cases.countriesAndTerritories.nunique()))

# Are there any NaN values in the data set?
print("Number of NaN values per column in the data set: {}\n\n".format(df_cases.isna().sum()))


Overall number of rows in the data set: 10742


First reporting day in the data set: 2019-12-31 00:00:00
Latest reporting day in the data set: 2020-04-14 00:00:00


Number of current total infections: 1873265
Number of current total deaths: 118854


Number of unique countries or territories within the data set: 206


Number of NaN values per column in the data set: dateRep                      0
day                          0
month                        0
year                         0
cases                        0
deaths                       0
countriesAndTerritories      0
geoId                       31
countryterritoryCode       107
popData2018                 67
dtype: int64




Some rows have no geoID, CountryterritoryCode or popData2018. Lets analyse these rows in more detail and adjust as appropriate.

#### Data Understanding Results

This first analysis of the data and of the data quality shows that there are some issues, which we need to consider in the subsequent data mining activities. To just name a few: All columns have dtype=object, column names are strange and might cause problems, many missing values.

## 3. General Data Preparation

ECDC's daily data files contain the number of newly reported cases and deaths. As we want to analyse the growth rate per country, we will convert the data into a pivoted table with cumlative cases and deaths.

In [None]:
# Step 1: Make sure the data is properly sorted by country and date
df_sorted = df_cases.sort_values(['countriesAndTerritories','dateRep'])

# Step 2: Calcluate two additional columns with cumulative cases and deaths
df_cum = df_sorted
df_cum["cum_cases"] = df_cum.groupby(['countriesAndTerritories', 'dateRep'])["cases"].sum().groupby(level=0).cumsum().reset_index()["cases"]
df_cum["cum_deaths"] = df_cum.groupby(['countriesAndTerritories', 'dateRep'])["deaths"].sum().groupby(level=0).cumsum().reset_index()["deaths"]

In [None]:
df_sorted = df_cases.sort_values(['countriesAndTerritories','dateRep'])

# Step 2: Calcluate two additional columns with cumulative cases and deaths
df_cum = df_sorted


df_cum.groupby(['countriesAndTerritories', 'dateRep'])["cases"].sum().groupby(level=0).cumsum().reset_index()["cases"]

#### Data Preparation Next Steps

Further data adjustments are needed to answer the above stated questions of interest. These further data preparation activities will be performed "by question" to make it easier for the reader to follow the origianl train of thoughts. 

## 4. Data Understanding, Preparation, Modeling, Evaluation and Development by Question

The further steps of the CRISP-DM model are integrated into each of the 4 questions (see chapter Business Understanding)