# European past floods

In this notebook, we analyze a dataset on past floods in Europe: <https://www.eea.europa.eu/data-and-maps/data/european-past-floods> .

## Read the documentation
This dataset is quite well documented, which is noteworthy as it is not always the case.
I strongly encourage you to take a few minutes to carefully read the documentation on the landing page, in this instance :
- description,
- table definition,
- metadata.

This reading should enable you to answer a set of common, basic questions that will help guide your analysis, such as :

- Who created this dataset and for what purpose ?

- How was the dataset created ?

- What do the instances that comprise the dataset represent (eg. people, companies, events, photos...) ?

- What data does each instance consist of ? Are they "raw" data or (computed) features ?

- Are the instances related in some way ? If so, are there specific fields that enable cross-reference ?

The documentation for a dataset is always written with some purpose, for an intended type of reader, in a certain context, hence it is very likely that you will not find all the answers in the documentation.

Equipped with this knowledge, you can start the exploratory analysis of the data to gather the missing information to complete your answers.

## Exploratory analysis
First, you need to retrieve the dataset :
1. Download the dataset exported as a zip file containing a **set of CSV files**,
2. Extract the CSV files to a folder. This folder can be anywhere you like, provided you can easily retrieve its path. The simplest and least confusing solution might be to store your data next to your notebook (eg. create a `data` folder in the same directory as this notebook).

Now we are ready to start exploring the dataset with [pandas](https://pandas.pydata.org/).

In [2]:
import pandas as pd

### Tabular information
Read the CSV file `floodphenomena.csv` with the `read_csv` function.
The data will be read into a pandas DataFrame, see the [pandas intro tutorial 01](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html) .

In [3]:
# NB : you most certainly need to change the path to the file
df = pd.read_csv('../data/FloodPhenomena_2015_public_csv/floodphenomena.csv')

We can display the DataFrame by typing its name.
By default pandas displays the column headers, the first and last five rows with their row index (at the left end of the row in **bold**), the total number of rows and columns (below the last line).

In [9]:
df

Unnamed: 0,cc,FloodPhenomenaID,Year,StartDate,EndDate,Number_FE,Number_FL,EUUOMCODE,FP_Severity,FP_Duration,FP_Extension,SourceOfFlooding,CharacteristicsOfFlooding,MechanismOfFlooding,FrequencyCategory,Area,Source,OtherSources
0,AL,AL-1992-11-17,1992,17/11/1992,19/11/1992,,,,Very High,2.0,,,,,,,EM-DAT,
1,AL,AL-1995-08-19,1995,19/08/1995,26/08/1995,,,,Very High,8.0,,,,,,,DFO,
2,AL,AL-1995-09-20,1995,20/09/1995,20/09/1995,,,,Very High,,,,,,,,EM-DAT,
3,AL,AL-1995-09-21,1995,21/09/1995,24/09/1995,,,,Very High,4.0,,,,,,,DFO,
4,AL,AL-1995-12-27,1995,27/12/1995,27/12/1995,,,,Moderate,,,,,,,,EM-DAT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3690,XK,XK-2008-04-10,2008,10/04/2008,15/04/2008,,,,Moderate,5.0,Local,,,,Frequent,20.0,National Authorities,
3691,XK,XK-2010-01-07,2010,07/01/2010,12/01/2010,,,,Very High,5.0,Regional,,,,Rare,8.0,National Authorities,
3692,XK,XK-2013-03-14,2013,14/03/2013,20/03/2013,,,,High,6.0,Local,,,,Frequent,10.0,National Authorities,
3693,XK,XK-2014-04-19,2014,19/04/2014,23/04/2014,,,,High,4.0,Regional,,,,Rare,30.0,National Authorities,


### About the data table
We can display a summary of the DataFrame with `info()`, including for each column its index, name, number of non-null values, and data type (`dtype`).
For more information, you can read the [pandas intro tutorial 02](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html).

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3695 entries, 0 to 3694
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   cc                         3695 non-null   object 
 1   FloodPhenomenaID           3695 non-null   object 
 2   Year                       3695 non-null   int64  
 3   StartDate                  2268 non-null   object 
 4   EndDate                    2246 non-null   object 
 5   Number_FE                  3271 non-null   float64
 6   Number_FL                  2823 non-null   float64
 7   EUUOMCODE                  2798 non-null   object 
 8   FP_Severity                3695 non-null   object 
 9   FP_Duration                2244 non-null   float64
 10  FP_Extension               132 non-null    object 
 11  SourceOfFlooding           2636 non-null   object 
 12  CharacteristicsOfFlooding  1050 non-null   object 
 13  MechanismOfFlooding        1431 non-null   objec

`info` also displays the memory usage of the DataFrame.

How does it compare to the file size on disk ?

### Selecting subsets

One of the fundamental operations on DataFrames is to be able to filter the dataset on a certain condition, to keep only certain rows or columns.

The basic operators for selection are square brackets `[]`, `loc` and `iloc`, and you can select rows or columns by their position or label, or with a conditional expression on values, see the [pandas intro tutorial 03](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html).

Choose a European country present in the dataset and

* filter rows corresponding to the flood events that happened in that country,
* filter columns to keep only the country, phenomena id, year, start and end date, severity and duration.

### Plotting data

You can create plots from a DataFrame easily using pandas' integration of matplotlib, see the [pandas intro tutorial 04](https://pandas.pydata.org/docs/getting_started/intro_tutorials/04_plotting.html).

Create plots to visualize different columns (try to come up with different types of plots).

### Summary statistics
You can compute various summary statistics that depend on the type of variable in each column, see the [pandas intro tutorial 06](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html).

Compute summary statistics for several columns from different types, and combinations of columns that could provide interesting insights.

### Sorting data
The entries are sorted by country, but it would make sense to try a different order and sort by date, see the [pandas intro tutorial 07](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html).

Sort entries by date.

### Creating new columns and combining data from different tables

The DataFrame contains country codes, which are not great for the uninformed reader.

Luckily, another CSV file in the dataset contains a list of country codes and names.

Find the relevant file and read it into another DataFrame, using `read_csv`.

This second table enables you to map country codes to country names, see the [pandas intro tutorial 08](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html).

You can rename columns, see the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html).

Add a column with country names to the first DataFrame, and name it `country_name`.

### Working with dates
pandas has a specific data type for dates. You can explicitly ask pandas to use this type for specific columns, either during `read_csv` or after, see the [pandas intro tutorial 09](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html).

Convert columns `StartDate` and `EndDate`.

This specific data type makes it easy to filter floods by month, or to know what day of the week it happened on, or to compute the time between any two floods in Europe or within a given country.

Draw one of these plots (or any other interesting one that relies on processing dates).

### Working with textual data
Another CSV file in the dataset contains the event locations of the floods.

pandas provides a number of functions to process text strings, see the [pandas intro tutorial 10](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html).

Read the event locations in another DataFrame and search all event locations that mention the "Danube" river in English or "Donau" in German.

## Concluding remarks
This dataset is informative and useful to answer certain questions but it is only scratching the surface of the current state of data collection on floods.
Much more detailed and precise data are collected to enable fine-grained analysis and prediction, see for example :
<https://www.eea.europa.eu/data-and-maps/indicators/river-floods-3/assessment> .