# Bird Analysis
In this notebook, we ran some initial descriptive analyses on the German and Swiss data in order to get a sense of the data.

***

You can download the data needed to run this code [here](https://drive.google.com/drive/folders/1eznk8GyIKt8fPJCb4TVqEIkrNcwonn9m).<br>

In [3]:
import pandas as pd
import plotly.express as px

In [1]:
data_path_ch = '/Users/marinasiebold/Library/Mobile Documents/com~apple~CloudDocs/Studium/Bird_Research/01_Data/datasets/birds_ch_2018-2022.csv'  # Provide data path of swiss dataset
data_path_de = '/Users/marinasiebold/Downloads/ornitho_de_lu_2018_2022_KI_Trials_20230512_1041.csv'  # Provide data path of german dataset
data_path_master = '/Users/marinasiebold/Library/Mobile Documents/com~apple~CloudDocs/Studium/Bird_Research/01_Data/master_bird_data.csv'  # Provide data path where merged dataset shall be saved

In [4]:
ch_data = pd.read_csv(data_path_ch, delimiter=';')

In [19]:
de_data = pd.read_csv(data_path_de)

# Basic Stats

## NaN values 💬

#### Switzerland

In [8]:
ch_data.isna().sum() * 100 / len(ch_data)

ID_SIGHTING       0.000000
ID_SPECIES        0.000645
NAME_SPECIES      0.000000
DATE              0.000000
TIMING           63.002890
COORD_LAT         0.000000
COORD_LON         0.000000
PRECISION         0.000000
ALTITUDE          0.000000
TOTAL_COUNT      12.565139
ATLAS_CODE_CH     0.000000
ID_OBSERVER       0.000010
dtype: float64

In the swiss dataset, AtlasCode zero stands for "no AtlasCode was given for this datapoint." So below, we calculate the percentage of missing values not with counting NaN, but with zeroes:

In [20]:
print('Missing Atlas Codes in %:', (ch_data.ATLAS_CODE_CH == 0).sum() / len(ch_data) *100)

Missing Atlas Codes in %: 70.40417217792223


The same is true for total count of sighted birds:

In [31]:
print('Missing Total Counts in %:', ((ch_data.TOTAL_COUNT == 0).sum()+ (ch_data.TOTAL_COUNT.isna()).sum()) / len(ch_data) *100)

Missing Total Counts in %: 12.753057959512743


#### Germany

In [10]:
de_data.isna().sum() * 100 / len(de_data)

id_sighting         0.000000
id_species          0.000000
name_species        0.000000
date                0.000000
timing             58.133934
coord_lat           0.000000
coord_lon           0.000000
precision           0.000000
estimation_code     0.000000
altitude            0.000000
total_count         0.000000
altas_code         73.208080
beobachter          0.000000
dtype: float64

In the german dataset, if no total_count is given, it is filled with a zero. So below, we calculate the percentage of missing values not with counting NaN, but with zeroes:

In [26]:
print('Missing Total Counts in %:', ((de_data.total_count == 0).sum()) / len(de_data) *100)

Missing Total Counts in %: 5.179704443933757


## Birdos 🦜 

### How many species?

In [11]:
n_species = ch_data['NAME_SPECIES'].nunique()
print('Number of species in ch dataset:', n_species)
n_species = de_data.name_species.nunique()
print('Number of species in de dataset:', n_species)

Number of species in ch dataset: 497
Number of species in de dataset: 708


### How many of each species? 

#### Switzerland

In [12]:
n_per_species = ch_data.groupby('NAME_SPECIES').size()

In [13]:
fig = px.bar(n_per_species)
fig.show()

#### Germany

In [16]:
n_per_species_de = de_data.groupby('name_species').size()
fig = px.bar(n_per_species_de, height=600)
fig.show()

### Top 10 birdo's

#### Switzerland

In [1]:
fig = px.bar(n_per_species.sort_values(ascending=False), title='Top 10 bird sightings Switzerland')
fig.show()

NameError: name 'px' is not defined

#### Germany

In [25]:
fig = px.bar(n_per_species_de.sort_values(ascending=False)[0:10], title='Top 10 bird sightings Germany')
fig.show()

## Observers  👀

### How many observers?

In [27]:
n_observers = ch_data['ID_OBSERVER'].nunique()
print('Number of observers in Switerland:', n_observers)
n_observers_de = de_data.beobachter.nunique()
print('Number of observers in Germany:', n_observers_de)

Number of observers in Switerland: 8885
Number of observers in Germany: 22218


### Top 10 observers

#### Switzerland

In [37]:
ch_data['ID_OBSERVER'] = ch_data['ID_OBSERVER'].astype(str)
n_per_observer = ch_data.groupby('ID_OBSERVER').size().sort_values(ascending=False)
fig = px.bar(n_per_observer[0:10], title='Top 10 observer in Switzerland')
fig.show()

In [35]:
n_per_observer[0]/len(ch_data)*100

2.2882279581270546

#### Germany

In [38]:
de_data['beobachter'] = de_data['beobachter'].astype(str)
n_per_observer_de = de_data.groupby('beobachter').size().sort_values(ascending=False)
fig = px.bar(n_per_observer_de[0:10], title='Top 10 observer in Germany')
fig.show()

In [36]:
n_per_observer_de[0]/len(de_data)*100

0.691516971562718

## Weekday and monthly distributions

#### Switzerland

In [55]:
ch_data['weekday'] = pd.to_datetime(ch_data['DATE']).dt.day_name()

In [58]:
weekdays = ch_data.groupby('weekday').size()
fig = px.bar(weekdays, title='Weekday distribution of observations in Switzerland', category_orders={'weekday':['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']})
fig.show()

In [68]:
ch_data['month'] = pd.to_datetime(ch_data['DATE']).dt.month_name()
months = ch_data.groupby('month').size()
fig = px.bar(months, title='Monthly distribution of observations in Switzerland', category_orders={'month':['January', 'February', 'March',
                                        'April', 'May', 'June', 'July', 
                                        'August', 'September', 'October', 'November', 'December']})
fig.show()

In [76]:
ch_data['year'] = pd.to_datetime(ch_data['DATE']).dt.year
years = ch_data.groupby('year').size()
fig = px.bar(years, title='Yearly distribution of observations in Switzerland')
fig.show()

#### Germany

In [63]:
de_data['weekday'] = pd.to_datetime(de_data['date'], format='mixed', dayfirst=True).dt.day_name()

In [64]:
weekdays = de_data.groupby('weekday').size()
fig = px.bar(weekdays, title='Weekday distribution of observations in Germany', category_orders={'weekday':['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']})
fig.show()

In [72]:
de_data['month'] = pd.to_datetime(de_data['date'], format='mixed', dayfirst=True).dt.month_name()
months = de_data.groupby('month').size()
fig = px.bar(months, title='Monthly distribution of observations in Germany', category_orders={'month':['January', 'February', 'March',
                                        'April', 'May', 'June', 'July', 
                                        'August', 'September', 'October', 'November', 'December']})
fig.show()

In [77]:
de_data['year'] = pd.to_datetime(de_data['date'], format='mixed', dayfirst=True).dt.year
years = de_data.groupby('year').size()
fig = px.bar(years, title='Yearly distribution of observations in Germany')
fig.show()

## Spatial distributions

In [5]:
map_data = pd.DataFrame()
map_data['lat'] = ch_data['COORD_LAT']
map_data['lon'] = ch_data['COORD_LON']
map_data['count'] = 1

In [11]:
y2022 = df[df.DATE.str.contains('2022')]
map_data_2022 = pd.DataFrame()
map_data_2022['lat'] = y2022['COORD_LAT']
map_data_2022['lon'] = y2022['COORD_LON']
map_data_2022['count'] = 1

In [None]:
taucher = df[df.NAME_SPECIES=='Haubentaucher']
map_data_taucher = pd.DataFrame()
map_data_taucher['lat'] = taucher['COORD_LAT']
map_data_taucher['lon'] = taucher['COORD_LON']
map_data_taucher['count'] = 1

In [None]:
stausee = df[(df.COORD_LAT<=47.1) & (df.COORD_LAT >=47.5) & (df.COORD_LON<=8.3) & (df.COORD_LON>=8)]
map_data_stausee = pd.DataFrame()
map_data_stausee['lat'] = stausee['COORD_LAT']
map_data_stausee['lon'] = stausee['COORD_LON']
map_data_stausee['count'] = 1

In [None]:
sempach = df[(df.COORD_LAT<=47.2) & (df.COORD_LAT >=47) & (df.COORD_LON<=8.5) & (df.COORD_LON>=8)]
map_data_sempach = pd.DataFrame()
map_data_sempach['lat'] = sempach['COORD_LAT']
map_data_sempach['lon'] = sempach['COORD_LON']
map_data_sempach['count'] = 1

In [None]:
fig = px.density_mapbox(map_data_sempach, lat='lat', lon='lon', z='count',  # choose map to be shown, either 'map_data_y2022', 'map_data_taucher', 'map_data_stausee', or 'map_data_sempach'
                        mapbox_style='open-street-map', radius=4, height=1600)

fig.show()