# Emergent Filters
Our first approach to detecting unusual sightings is adapted directly from eBird. Part of their process is what they call Emergent Filters, which only factor in the location and date of the sighting. They describe them as follows:
> [...] [An emergent filter] is a measure of how frequently a species is reported, which is calculated using the number of checklists that reported the species divided by the total number of checklists submitted for a specific region. The result is a measure of the “likelihood” of observing a specific bird species [...] at any spatial level and for any date. [...] We set the emergent data filters at 5% of total frequency as a threshold level to identify all outlier observations.
In diesem Notebook möchten wir diese Idee prototypisch umsetzen und für die von ornitho.de und ornitho.ch ausgewählten 27 Spezien Emergent Filters berechnen.

In [17]:
import pandas as pd
import plotly.express as px

In [18]:
data = ''  # please provide path to selected_bird_species_with_grids_50km.csv

In [19]:
df = pd.read_csv(data, index_col=0, low_memory=False).reset_index(drop=True)
df.head()

Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,altitude,total_count,atlas_code,id_observer,country,eea_grid_id
0,29666972,8.0,Haubentaucher,2018-01-01,,53.15776,8.676993,place,0,0.0,,37718.0,de,50kmE4200N3300
1,29654244,397.0,Schwarzkehlchen,2018-01-01,,53.127639,8.957263,square,0,2.0,,37803.0,de,50kmE4250N3300
2,29654521,463.0,Wiesenpieper,2018-01-01,,50.850941,12.146953,place,0,2.0,,39627.0,de,50kmE4450N3050
3,29666414,8.0,Haubentaucher,2018-01-01,,51.076006,11.038316,place,0,8.0,,38301.0,de,50kmE4350N3100
4,29656211,8.0,Haubentaucher,2018-01-01,,51.38938,7.067282,place,0,10.0,,108167.0,de,50kmE4100N3100


## Creating Emergent Filters
Using historical data, the filters calculate the relative proportions of bird sightings for each species in the past year. From this, they derive the probability of a bird sighting for each species, grid, and date in the following year. This process results in probability curves throughout the year for sighting a particular bird species within a specific grid.

In this notebook, we compute the emergent filters for the year 2023 for the 27 species of interest. This computation is based on the data from the year 2022. To achieve this, we narrow down the available data to sightings in the year 2022. Similar to the approach used in eBird, each sighting is assigned a 'day of year' value, ranging from 1 to 365 [1].

In [20]:
bird_sightings = df[df.date.str.contains('2022')].copy()
bird_sightings['day_of_year'] = pd.to_datetime(bird_sightings.date).dt.dayofyear
bird_sightings.drop(columns=['id_sighting', 'id_species', 'timing', 'date', 'coord_lat', 'coord_lon', 'precision', 'altitude', 'total_count', 'atlas_code', 'id_observer', 'country'], inplace=True)

In [21]:
bird_sightings

Unnamed: 0,name_species,eea_grid_id,day_of_year
3760,Rohrammer,50kmE4150N2750,1
3761,Bergpieper,50kmE4050N3100,1
3762,Wiesenpieper,50kmE4050N3050,1
3763,Wiesenpieper,50kmE4150N2800,1
3764,Haubentaucher,50kmE4100N3000,1
...,...,...,...
2660035,Steinschmätzer,50kmE4200N2600,213
2660036,Steinschmätzer,50kmE4200N2700,288
2660037,Schwarzkehlchen,50kmE4200N2700,288
2660038,Bergpieper,50kmE4200N2600,213


Next, eBird computes a metric that reflects the frequency of species reporting. This metric is obtained by dividing the number of checklists reporting a specific species by the total number of checklists submitted for a particular region. The outcome provides an indication of the 'probability' of encountering a particular bird species within that specific region. Since each observation includes information about the bird's location and time of detection, it becomes feasible to calculate the occurrence frequencies of bird species at various spatial scales and for any given date [1].

We implement this concept through a multi-step process.

Initially, we generate a dataframe encompassing all potential combinations of bird species, grids, and dates. This step is aimed to calculate the likelihood of a sighting for all possible combinations.

In [22]:
# fill df so that we have all possible day/grid combinations

grid_list = bird_sightings.eea_grid_id.unique()
day_list = range(1, 366)
species_list = bird_sightings.name_species.unique()
all_combinations = pd.MultiIndex.from_product([species_list, grid_list, day_list], names=['name_species', 'eea_grid_id', 'day_of_year'])
all_combinations = pd.DataFrame(index=all_combinations).reset_index()
all_combinations

Unnamed: 0,name_species,eea_grid_id,day_of_year
0,Rohrammer,50kmE4150N2750,1
1,Rohrammer,50kmE4150N2750,2
2,Rohrammer,50kmE4150N2750,3
3,Rohrammer,50kmE4150N2750,4
4,Rohrammer,50kmE4150N2750,5
...,...,...,...
2207515,Gänsegeier,50kmE4150N2500,361
2207516,Gänsegeier,50kmE4150N2500,362
2207517,Gänsegeier,50kmE4150N2500,363
2207518,Gänsegeier,50kmE4150N2500,364


Subsequently, for each combination, the total number of submitted sightings on that specific day and grid, as well as the count of the corresponding bird species, is determined:

In [23]:
# number of all sightings per grid per day
by_days = bird_sightings.groupby(['day_of_year', 'eea_grid_id']).count().reset_index()
by_days.rename(columns={'name_species':'total_sightings'}, inplace=True)
by_days = all_combinations.merge(by_days, on=['eea_grid_id', 'day_of_year'], how='left')
by_days['total_sightings'] = by_days['total_sightings'].fillna(0).astype(int)
by_days

Unnamed: 0,name_species,eea_grid_id,day_of_year,total_sightings
0,Rohrammer,50kmE4150N2750,1,32
1,Rohrammer,50kmE4150N2750,2,14
2,Rohrammer,50kmE4150N2750,3,12
3,Rohrammer,50kmE4150N2750,4,1
4,Rohrammer,50kmE4150N2750,5,16
...,...,...,...,...
2207515,Gänsegeier,50kmE4150N2500,361,0
2207516,Gänsegeier,50kmE4150N2500,362,0
2207517,Gänsegeier,50kmE4150N2500,363,0
2207518,Gänsegeier,50kmE4150N2500,364,0


In [24]:
# number of sightings for specific species per day per grid
by_days_and_species = bird_sightings.groupby(['name_species', 'day_of_year', 'eea_grid_id']).size().reset_index()
by_days_and_species.rename(columns={0:'n_sightings'}, inplace=True)
by_days_and_species

total_df = by_days.merge(by_days_and_species, on=['name_species', 'eea_grid_id', 'day_of_year'], how='left')
total_df['n_sightings'] = total_df['n_sightings'].fillna(0).astype(int)
total_df

Unnamed: 0,name_species,eea_grid_id,day_of_year,total_sightings,n_sightings
0,Rohrammer,50kmE4150N2750,1,32,7
1,Rohrammer,50kmE4150N2750,2,14,1
2,Rohrammer,50kmE4150N2750,3,12,3
3,Rohrammer,50kmE4150N2750,4,1,0
4,Rohrammer,50kmE4150N2750,5,16,2
...,...,...,...,...,...
2207515,Gänsegeier,50kmE4150N2500,361,0,0
2207516,Gänsegeier,50kmE4150N2500,362,0,0
2207517,Gänsegeier,50kmE4150N2500,363,0,0
2207518,Gänsegeier,50kmE4150N2500,364,0,0


Finally, we calculate the likelihood or frequency of bird sightings, following the methodology employed by eBird. This is expressed as:
$$frequency = \frac{n\_sightings}{total\_sightings}$$

In [25]:
total_df['frequency'] = total_df.n_sightings / total_df.total_sightings
total_df['frequency'] = total_df['frequency'].fillna(0)
total_df

Unnamed: 0,name_species,eea_grid_id,day_of_year,total_sightings,n_sightings,frequency
0,Rohrammer,50kmE4150N2750,1,32,7,0.218750
1,Rohrammer,50kmE4150N2750,2,14,1,0.071429
2,Rohrammer,50kmE4150N2750,3,12,3,0.250000
3,Rohrammer,50kmE4150N2750,4,1,0,0.000000
4,Rohrammer,50kmE4150N2750,5,16,2,0.125000
...,...,...,...,...,...,...
2207515,Gänsegeier,50kmE4150N2500,361,0,0,0.000000
2207516,Gänsegeier,50kmE4150N2500,362,0,0,0.000000
2207517,Gänsegeier,50kmE4150N2500,363,0,0,0.000000
2207518,Gänsegeier,50kmE4150N2500,364,0,0,0.000000


Additionally, eBird introduces one more step:
>"To account for this variation in the number of checklists per day the frequencies were calculated based on a sliding 7-day window. The frequency for day X was calculated using a total number of checklists from 3 days prior through 3 days after day X. We then assigned the highest initial frequency within that same sliding 7- day window to day X. The resulting frequency is an estimate of the likelihood of observing a species on each of the 365 days of the year." [1]

Thus, we determine the rolling frequency using a centered sliding 7-day window based on the prior calculated frequency. 

As the only modification to the eBird methodology, we subsequently introduce an additional centered rolling window that averages the values over 30 days, aimed at smoothing out the values. Due to the relatively smaller dataset available from ornitho.de and ornitho.ch compared to eBird, the plausibility values can exhibit considerable fluctuations even with the 7-day window due to the limited number of sightings. The averaging effect of the rolling window aims to mitigate this impact and compensate for the reduced data volume.

By utilizing the calculated plausibility values, the sightings can now be assigned a plausibility rating ranging from 0 to 100%.

In [26]:
groups = total_df.groupby(['name_species', 'eea_grid_id'])

def circular_rolling(group):
    df = group.copy()
    df = pd.concat([df.iloc[-19:], df, df.iloc[:17]])
    df['frequency_rolling'] = df.frequency.rolling(window=7, center=True).max()
    df['plausibility'] = df.frequency_rolling.rolling(window=30, center=True).mean()
    return df.iloc[19:].iloc[:-17]

result_df = pd.concat([circular_rolling(group) for _, group in groups])
result_df.reset_index(drop=True, inplace=True)
result_df

Unnamed: 0,name_species,eea_grid_id,day_of_year,total_sightings,n_sightings,frequency,frequency_rolling,plausibility
0,Alpenschneehuhn,50kmE4000N2500,1,0,0,0.0,0.0,0.0
1,Alpenschneehuhn,50kmE4000N2500,2,0,0,0.0,0.0,0.0
2,Alpenschneehuhn,50kmE4000N2500,3,0,0,0.0,0.0,0.0
3,Alpenschneehuhn,50kmE4000N2500,4,0,0,0.0,0.0,0.0
4,Alpenschneehuhn,50kmE4000N2500,5,0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
2207515,Zwergohreule,50kmE4650N3150,361,2,0,0.0,0.0,0.0
2207516,Zwergohreule,50kmE4650N3150,362,0,0,0.0,0.0,0.0
2207517,Zwergohreule,50kmE4650N3150,363,0,0,0.0,0.0,0.0
2207518,Zwergohreule,50kmE4650N3150,364,0,0,0.0,0.0,0.0


Lastly, we verify whether all plausibility values have been successfully computed, ensuring that no NaN values remain, which is the case:

In [27]:
result_df.plausibility.isna().any()

False

## Visualization of Emergent Filters
The calculated values now yield probability curves throughout the year for sighting a particular species within a specific grid.

As an illustrative example in this notebook, we visualize the emergent filters for the 'Knäkente' and the 'Seeadler' in a grid near Hannover for the year 2023.

If the probability of a sighting on a specific day falls below a certain threshold (for instance, 5%, as done at eBird), all sightings for that day are flagged. Examining the data for the Knäkente, it becomes apparent that migratory birds outside their stay in Germany would consistently be flagged. The graph also aligns well with the species' migration phenology. In the chosen grid and with a 5% threshold, 'Seeadler' would always pass through without being flagged.

In [28]:
def plot_plausibility_over_year(df, species, grid, sign='🦆', threshold=0.05):
    data = df[(df.name_species == species) & (df.eea_grid_id == grid)]
    fig = px.line(data, x='day_of_year', y='plausibility', color_discrete_sequence=['#0074D9'])
    fig.add_shape(type="rect", x0=0, x1=365, y0=0, y1=threshold, fillcolor="red", opacity=0.1)
    fig.add_shape(type="rect", x0=0, x1=365, y0=threshold, y1=data.plausibility.max(), fillcolor="green", opacity=0.1)
    fig.add_annotation(x=365/2, y=threshold/2, text='flagged for review', showarrow=False, font=dict(color="red"))
    fig.add_annotation(x=365/2, y=threshold+0.02, text='OK', showarrow=False, font=dict(color="green"))
    fig.update_layout(title={'text': "{} Plausibility for seeing a {} in '{}' {}".format(sign, species, grid, sign),
                             'x': 0.5,'xanchor': 'center','yanchor': 'top'},
                      xaxis_title='Day of Year', yaxis_title='Likelihood',
                      font=dict(family="Gill Sans", size=15, color="#333333"))
    fig.show()

In [29]:
plot_plausibility_over_year(result_df, species='Knäkente', grid='50kmE4350N3250', sign='🦆')

In [30]:
plot_plausibility_over_year(result_df, species='Seeadler', grid='50kmE4350N3250', sign='🦅')

## Applying emergent filters for flagging unusual sightings
Below is an example code demonstrating how the emergent filters could be utilized to assess newly incoming sightings. In this process, we simply need to extract the dataframe row corresponding to the relevant species, grid, and date, and retrieve the plausibility value. If this value falls below the predetermined threshold (here set at 5%), the `flagged_for_review` flag is set to True. An expert can subsequently filter based on this flag and review the corresponding sightings. Similarly, a system with continuous values could be established, allowing the plausibility value to be directly associated with the new sighting. An expert can then sort the sightings by plausibility, starting with those having the lowest plausibility values.

In [31]:
def create_emergent_filters_lookup(emergent_filters):
    lookup = emergent_filters.groupby(['name_species', 'eea_grid_id', 'day_of_year'])['plausibility'].first().to_dict()
    return lookup

def is_unlikely(sighting, emergent_filters_lookup, threshold=0.05):
    key = (sighting['name_species'], sighting['eea_grid_id'], sighting['day_of_year'])
    plausibility = emergent_filters_lookup.get(key, None)
    return plausibility is not None and plausibility < threshold

In [32]:
emergent_filters_lookup = create_emergent_filters_lookup(result_df)  # convert pandas df to dict as it is more efficient to compute with

In [33]:
# flag single datapoints

# let's set threshold to <5% (as at eBird)
threshold = 0.05

# let's artificially create a usual datapoint and an unusual datapoint
# sighting at Jan 30 '22 (unusual)
unusual_knaekente = {'name_species': 'Knäkente', 'eea_grid_id': '50kmE4350N3250', 'day_of_year': 30}

# sighting at Apr 1 '22 (usual)
usual_knaekente = {'name_species': 'Knäkente', 'eea_grid_id': '50kmE4350N3250', 'day_of_year': 100}

print('Should this datapoint be flagged for review?', is_unlikely(usual_knaekente, emergent_filters_lookup, threshold))
print('Should this datapoint be flagged for review?', is_unlikely(unusual_knaekente, emergent_filters_lookup, threshold))

Should this datapoint be flagged for review? False
Should this datapoint be flagged for review? True


In [34]:
# flag a whole dataframe
result_df['flagged_for_review'] = result_df.apply(is_unlikely, args=(emergent_filters_lookup, threshold,), axis=1)
result_df

Unnamed: 0,name_species,eea_grid_id,day_of_year,total_sightings,n_sightings,frequency,frequency_rolling,plausibility,flagged_for_review
0,Alpenschneehuhn,50kmE4000N2500,1,0,0,0.0,0.0,0.0,True
1,Alpenschneehuhn,50kmE4000N2500,2,0,0,0.0,0.0,0.0,True
2,Alpenschneehuhn,50kmE4000N2500,3,0,0,0.0,0.0,0.0,True
3,Alpenschneehuhn,50kmE4000N2500,4,0,0,0.0,0.0,0.0,True
4,Alpenschneehuhn,50kmE4000N2500,5,0,0,0.0,0.0,0.0,True
...,...,...,...,...,...,...,...,...,...
2207515,Zwergohreule,50kmE4650N3150,361,2,0,0.0,0.0,0.0,True
2207516,Zwergohreule,50kmE4650N3150,362,0,0,0.0,0.0,0.0,True
2207517,Zwergohreule,50kmE4650N3150,363,0,0,0.0,0.0,0.0,True
2207518,Zwergohreule,50kmE4650N3150,364,0,0,0.0,0.0,0.0,True


## Initial Assessment: Possible Opportunities and Limitations
The Emergent Filters represent a method already implemented in practice and established at eBird. Therefore, we intend to utilize this approach as a benchmark for our outlier analysis. While they are relatively resource-efficient to compute, they narrow down data points solely on location and time, disregarding other features that can significantly influence plausibility. For instance, a Knäkente with Atlas code 0 might be highly plausible, while one with a high Atlas code (indicating breeding) could be quite unusual.

We aim to benchmark the Emergent Filters against our outlier analysis to test whether the outlier analysis, which considers all features and employs common outlier detection algorithms, performs better, equal, or worse. This comparison will help us determine whether the established Emergent Filters, while straightforward, are indeed effective or if the more comprehensive outlier analysis offers superior performance by encompassing all available features and calculations.