<img src="../../img/data_preparation_summary.png" alt="Data preparation summary" style="width: 100%; border-radius: 20px;"/>

## Objective
To facilitate new users in preparing data for our modeling and to provide a concise overview of the entire preprocessing procedure conducted in *01_data_preparation*, this notebook aims to summarize all the steps from the following notebooks:
- 01_Dataset_Merging.ipynb
- 02_Dataset_Reduction_27_species.ipynb
- 03_EEA_Grid_Assignment.ipynb 

The following steps are executed:
- **Standardize raw data:** The raw data received from the ornithologists is transformed into a uniform schema for use in all our modeling notebooks. This involves standardizing column names, data types, date formats, precisions, as well as species IDs and names. These steps are further explained in the notebook *01_Dataset_Merging.ipynb*.
- **Filter for selected 27 species:** The data is filtered for the 27 species of interest, as chosen by the ornithologists. The detailed procedure is described in the notebook *02_Dataset_Reduction_27_species.ipynb*.
- **Assign to EEA grids:** Each bird sighting is assigned to an EEA grid. The detailed procedure is explained in the notebook *03_EEA_Grid_Assignment.ipynb*.

All functions used in this notebook can be found in the *utils* directory under `data_preparation.py`.

In [1]:
%%HTML
<style>
    body {
        --vscode-font-family: "Itim"
    }
</style>

In [2]:
import sys
sys.path.append('../')

import pandas as pd
from utils.data_preparation import *

#
<p style="background-color:#4A3228;color:white;font-size:240%;border-radius:10px 10px 10px 10px;"> &nbsp; 0️⃣ Specify your paths </p>

In order to run the notebook, the following datasets are required:
- Swiss dataset: *[birds_ch_2018-2022.csv](https://drive.google.com/drive/folders/1eznk8GyIKt8fPJCb4TVqEIkrNcwonn9m)*
- German dataset: *[birds_de_2018-2022.csv](https://drive.google.com/drive/folders/1eznk8GyIKt8fPJCb4TVqEIkrNcwonn9m)*
- ID translator file that translates german species ID's into ornitho species ID's: *[translation_species_id_germany_vs_ornitho.csv](https://drive.google.com/drive/folders/1VN87gPc_XA212rpyaq2xpJcOSDu8hN5v)* <br>
- Name translator file that translates swiss species names into ornitho species names: *[translation_species_names_de_vs_ch.csv](https://drive.google.com/drive/folders/1VN87gPc_XA212rpyaq2xpJcOSDu8hN5v)*
- Table containing the species list that the ornithologists decided on: *[selected_species_of_interest.csv](https://drive.google.com/drive/folders/1SbXMiMweOrHgfGJZ0cOtPJzQo6bbvyJJ)*
- Shapefile of 50x50km EEA grids of Europe: *[eea_50_km_ref-grid-europe/inspire_compatible_grid_50km.shp](https://drive.google.com/drive/folders/1atS5eomHYxX-q_5b8WGqFVDtqP3-d8qP)*

If you wish to store the resulting dataset, please specify a target path where it should be stored.

In [3]:
data_path_ch = '../../../01_Data/datasets/birds_ch_2018-2022.csv'  # Provide data path of swiss dataset
data_path_de =  '../../../01_Data/datasets/birds_de_2018-2022.csv'  # Provide data path of german dataset

path_translator_ids = '../../../01_Data/translators/translation_species_id_de_vs_ornitho.csv'  # Provide path to translator file for species ids
path_translator_names = '../../../01_Data/translators/translation_species_names_de_vs_ch.csv'  # Provide path to translator file for species names

data_path_selected_species = '../../../01_Data/datasets/selected_species_of_interest.csv'  # Provide path to file with selected species of interest

path_eea_grids = '../../../01_Data/shp_files/grids/eea_europe_grids_50km/inspire_compatible_grid_50km.shp'  # Provide path to EEA shp file

target = '../../../01_Data/datasets/master_bird_data_quick.csv'  # Provide data path where merged dataset shall be saved

#
<p style="background-color:#4A3228;color:white;font-size:240%;border-radius:10px 10px 10px 10px;"> &nbsp; 1️⃣ Step-by-step: What we do to get from raw data to dataset to train / predict on  </p>

## 1. Standardize German dataset

In [4]:
data_de = pd.read_csv(data_path_de, delimiter=get_delimiter(data_path_de), low_memory=False)

data_de = standardize_data(data_de, 
                           path_translator_species_names=path_translator_names,
                           adjust_ids=True,
                           path_translator_species_ids=path_translator_ids,
                           date_format='%d.%m.%Y')

## 2. Standardize Swiss dataset

In [5]:
data_ch = pd.read_csv(data_path_ch, delimiter=get_delimiter(data_path_ch), low_memory=False)

data_ch = standardize_data(data_ch,
                           path_translator_species_names=path_translator_names,
                           adjust_ids=False,
                           date_format='%Y-%m-%d')

## 3. Merge datasets

In [6]:
data_de['country'] = 'de'
data_ch['country'] = 'ch'

master_data = pd.concat([data_de, data_ch])

## 4. Filter for selected 27 species

In [7]:
species = pd.read_csv(data_path_selected_species, usecols = ['ornithoid','namedt', 'finale Auswahl'])
selected_species = species[species['finale Auswahl']==1]
master_selected_species = master_data[master_data.id_species.isin(selected_species.ornithoid)]

## 5. Assign EEA grids

In [8]:
master_selected_species = assign_eea_grids(master_selected_species, path_eea_grids)

## 6. Store as csv

In [9]:
master_selected_species.to_csv(target)

## The final dataset structure:

In [10]:
master_selected_species.head()

Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,altitude,total_count,atlas_code,id_observer,country,eea_grid_id
14,29666972,8,Haubentaucher,2018-01-01,,53.15776,8.676993,place,0,0,,37718,de,50kmE4200N3300
17,29654244,397,Braunkehlchen,2018-01-01,,53.127639,8.957263,square,0,2,,37803,de,50kmE4250N3300
30,29654521,463,Wiesenpieper,2018-01-01,,50.850941,12.146953,place,0,2,,39627,de,50kmE4450N3050
49,29666414,8,Haubentaucher,2018-01-01,,51.076006,11.038316,place,0,8,,38301,de,50kmE4350N3100
77,29656211,8,Haubentaucher,2018-01-01,,51.38938,7.067282,place,0,10,,108167,de,50kmE4100N3100
