# Evaluation of implemented approaches 🏅
This notebook is dedicated to the evaluation of the approaches we have implemented for validation assessment. <br>
The ornithologists have provided us with a dataset containing both correct and falsified data. The falsified data was generated by duplicating correct data and making slight modifications to render them invalid, such as altering the date or bird species information.

This dataset lacks information regarding the validity of the datapoints (i.e. if the datapoint is correct or was falsified by the ornithologists). Therefore, in this notebook, we create a column named 'IS_INVALID_PREDICTION' which indicates whether our model predicts the data to be invalid or not. The ornithologists subsequently compare this column with the ground truth and provide us with feedback.

***

You can download the validation data [here](https://drive.google.com/drive/folders/1emvbXc5ExoEgv7Pmwy_Y5rjNc9k8hrNs).

In [102]:
%reload_ext autoreload

import sys
sys.path.append('../')

import pickle
import pandas as pd

from utils.data_preparation import *

In [103]:
# Data we want to predict on
path_validata = '../../../01_Data/datasets/validata_ornitho_ch_2023.csv'
date_format = '%d.%m.%Y'  # ch: '%d.%m.%Y'; de: %m/%d/%Y

# Data we need for data preparation
path_translator_names = '../../../01_Data/translators/translation_species_names_de_vs_ch.csv'
path_eea_grids = '../../../01_Data/eea_gridfiles/eea_europe_grids_50km/inspire_compatible_grid_50km.shp'

# Where to store the results
target_path = '../../../01_Data/results/results_emergent_filters_ch_5%.csv'

In [104]:
validata = pd.read_csv(path_validata, delimiter=get_delimiter(path_validata), low_memory=False)
validata = standardize_data(validata, 
                            date_format=date_format,
                            path_translator_species_names=path_translator_names, 
                            eea_shapefile_path=path_eea_grids)

## 1️⃣  Emergent Filters
In order to conduct the evaluation using Emergent Filters, it is necessary to obtain the Emergent Filters themselves. These can be generated using the notebook 02_Emergent_Filters.ipynb. The Emergent Filters for the dataset *selected_bird_species_with_grids_50km.csv* can be downloaded from here in the form of a file named *emergent_filters_selected_species_grids_50km.pkl*.

In [105]:
threshold = 0.05

In [106]:
filters_path = '/Users/marinasiebold/Library/Mobile Documents/com~apple~CloudDocs/Studium/Bird_Research/01_Data/models/emergent_filters_selected_species_grids_50km.pkl'  # path to emergent filters

In [107]:
with open(filters_path, 'rb') as file:
    emergent_filters = pickle.load(file)

In [108]:
validata['day_of_year'] = pd.to_datetime(validata.date).dt.dayofyear
validata = validata[['name_species', 'eea_grid_id', 'day_of_year']]

To apply the emergent filters for assessing the validation data provided by ornitho, we use the `is_unlikely`-function from *02_Emergent_Filter.ipynb*. <br>
For each datapoint in the validation dataframe, this function simply extracts the plausibility value from the emergent filters that correspond to the given species, grid, and date. If this value falls below the predetermined threshold, the `flagged_for_review` flag is set to False.

In [109]:
def is_unlikely(sighting, emergent_filters_lookup, threshold=0.05):
    key = (sighting.name_species, sighting.eea_grid_id, sighting.day_of_year)
    plausibility = emergent_filters_lookup.get(key, None)
    return plausibility is not None and plausibility < threshold

In [110]:
validata['flagged_for_review'] = validata.apply(is_unlikely, args=(emergent_filters,threshold,), axis=1)
validata

Unnamed: 0,name_species,eea_grid_id,day_of_year,flagged_for_review
0,Alpenschneehuhn,50kmE4300N2650,166,False
1,Bergente,50kmE4250N2700,97,True
2,Bergente,50kmE4250N2700,87,True
3,Gelbspötter,50kmE4300N2700,133,False
4,Haubentaucher,50kmE4250N2700,49,False
...,...,...,...,...
87054,Haubentaucher,50kmE4150N2550,3,True
87055,Schwarzkehlchen,50kmE4100N2550,71,False
87056,Braunkehlchen,50kmE4150N2500,138,True
87057,Knäkente,50kmE4150N2500,76,True


In [111]:
validata.flagged_for_review.value_counts()

flagged_for_review
False    82709
True      4350
Name: count, dtype: int64

### Append review flags to original dataframe

In [112]:
original_data = pd.read_csv(path_validata, delimiter=get_delimiter(path_validata), low_memory=False)

In [113]:
original_data['ERROR_DETECTED'] = validata.flagged_for_review
original_data.to_csv(target_path)