# Merge swiss and german datasets
We aligned both datasets to create a consistent master dataset. <br><br>
Important decisions we made: <br>
- Remove `estimation_code` feature from the german dataset as it is not given in the swiss dataset and is not considered crucial for our use case.
- Keep `altitude` even though all german entries are zero. -> Maybe discuss this further.
- Align feature names
- Align dtypes
- Align date formats
- Align precisions; drop some precisions in german dataset as they are not in the datafield description and only a minority.

You can download the master dataset, the swiss dataset and the german dataset [here](https://drive.google.com/drive/folders/1R9VHEs6nq8ogPYSSp8IfSbkFWFAoyhm8?usp=sharing).<br>
Alternatively, run the code by yourself to create the master dataset; please provide your data paths in chapter 1.

In [201]:
import pandas as pd

## 1 - Load data

In [202]:
data_path_ch = '/Users/marinasiebold/Library/Mobile Documents/com~apple~CloudDocs/Studium/Bird_Research/01_Data/birds_ch_2018-2022.csv'  # Provide data path of swiss dataset
data_path_de = '/Users/marinasiebold/Library/Mobile Documents/com~apple~CloudDocs/Studium/Bird_Research/01_Data/birds_de_2018-2022.csv'  # Provide data path of german dataset
data_path_master = '/Users/marinasiebold/Library/Mobile Documents/com~apple~CloudDocs/Studium/Bird_Research/01_Data/master_bird_data.csv'  # Provide data path where merged dataset shall be saved

In [203]:
ch_data = pd.read_csv(data_path_ch, delimiter=';')
ch_data.head()

Unnamed: 0,ID_SIGHTING,ID_SPECIES,NAME_SPECIES,DATE,TIMING,COORD_LAT,COORD_LON,PRECISION,ALTITUDE,TOTAL_COUNT,ATLAS_CODE_CH,ID_OBSERVER
0,14731644,371.0,Blaumeise,2018-01-21,,46.217211,7.582658,Exakte Lokalisierung,1150,1.0,0,11750.0
1,15360340,361.0,Saatkrähe,2018-03-24,10:41:00,46.923721,7.481304,Exakte Lokalisierung,510,,0,2246.0
2,15360731,358.0,Rabenkrähe,2018-03-24,,46.887983,7.545741,Ort,520,,0,3539.0
3,15360732,495.0,Feldsperling,2018-03-24,,46.887983,7.545741,Ort,520,,0,3539.0
4,15360733,518.0,Buchfink,2018-03-24,,46.887983,7.545741,Ort,520,,0,3539.0


In [204]:
de_data = pd.read_csv(data_path_de)
de_data.head()


Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,estimation_code,altitude,total_count,altas_code,beobachter
0,29666944,119,Reiherente,01.01.2018,,53.15776,8.676993,place,EXACT_VALUE,0,24,,37718
1,29666945,141,Gänsesäger,01.01.2018,,53.15776,8.676993,place,EXACT_VALUE,0,1,,37718
2,29666946,24,Kormoran,01.01.2018,04:00,53.15776,8.676993,place,ESTIMATION,0,240,,37718
3,29666947,205,Blässhuhn,01.01.2018,,53.15776,8.676993,place,EXACT_VALUE,0,13,,37718
4,29666948,309,Ringeltaube,01.01.2018,,53.15776,8.676993,place,EXACT_VALUE,0,2,,37718


## 2 - Merge datasets

### Align features
`estimation_code` holds information if the birdo count is an exact value or an estimation. As it is only present in the swiss dataset, it is dropped.

In [205]:
de_data.drop(columns='estimation_code', inplace=True)

### Align feature names
Some columns represent the same features but have different names or typos.

In [206]:
ch_data.columns = ch_data.columns.str.lower()
ch_data.rename({'atlas_code_ch': 'atlas_code'}, axis='columns', inplace=True)
de_data.rename({'beobachter':'id_observer', 'altas_code': 'atlas_code'}, axis='columns', inplace=True)

### Align dtypes
German data uses *float* for `id_species`, `total_count` and `id_observer`. <br>
Swiss data uses *int*. <br><br>
Swiss scheme is used as there are no decimals necessary for these features.

In [207]:
df = pd.DataFrame(columns=['ch dtype', 'de dtype'])
for col in ch_data.columns:
    df.loc[col] = [ch_data[col].dtype, de_data[col].dtype]
df

Unnamed: 0,ch dtype,de dtype
id_sighting,int64,int64
id_species,float64,int64
name_species,object,object
date,object,object
timing,object,object
coord_lat,float64,float64
coord_lon,float64,float64
precision,object,object
altitude,int64,int64
total_count,float64,int64


In [208]:
ch_data.id_species = ch_data.id_species.astype('Int64')
ch_data.total_count = ch_data.total_count.astype('Int64')
ch_data.id_observer = ch_data.id_observer.astype('Int64')

de_data.id_species = de_data.id_species.astype('Int64')
de_data.total_count = de_data.total_count.astype('Int64')
de_data.id_observer = de_data.id_observer.astype('Int64')

### Align date format
Swiss data uses *yyyy-mm-dd* <br>
German data uses *dd.mm.yyyy* <br><br>
Swiss scheme is used as it is more common. 

In [209]:
def change_dateformat(date):
    d_m_y = date.split('.')
    y_m_d = '{}-{}-{}'.format(d_m_y[2], d_m_y[1], d_m_y[0])
    return y_m_d

de_data.date = de_data.date.apply(change_dateformat)
de_data.head()

Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,altitude,total_count,atlas_code,id_observer
0,29666944,119,Reiherente,2018-01-01,,53.15776,8.676993,place,0,24,,37718
1,29666945,141,Gänsesäger,2018-01-01,,53.15776,8.676993,place,0,1,,37718
2,29666946,24,Kormoran,2018-01-01,04:00,53.15776,8.676993,place,0,240,,37718
3,29666947,205,Blässhuhn,2018-01-01,,53.15776,8.676993,place,0,13,,37718
4,29666948,309,Ringeltaube,2018-01-01,,53.15776,8.676993,place,0,2,,37718


### Align precisions
According to the datafield description by @Johannes, the following upholds:
- *precise* = *Exakte Lokalisierung*
- *square* = *Kilometerquadrat*
- *place* = *Ort*<br><br>

The swiss descriptions are adjusted accordingly to the english counterparts.<br>
All other values in the german dataset are dropped as they have very little occurences and are not present in the swiss dataset or in the datafield description (see below).

In [210]:
# Before: Occurences of all precisions in both datasets
print('\033[1m'+'German precision occurences:\n', '\033[0m', de_data.groupby('precision').size())
print('\033[1m'+'\nSwiss precision occurences:\n', '\033[0m', ch_data.groupby('precision').size())

[1mGerman precision occurences:
 [0m precision
municipality               6
place               10012712
polygone                   4
polygone_precise           8
precise             22903811
square               7186252
subplace                   2
transect_precise          53
dtype: int64
[1m
Swiss precision occurences:
 [0m precision
Exakte Lokalisierung    4753055
Kilometerquadrat        3176584
Ort                     1993236
dtype: int64


In [211]:
# Replace swiss precisions with english counterparts
precisions = {'Exakte Lokalisierung': 'precise', 
              'Kilometerquadrat': 'square', 
              'Ort': 'place'}
ch_data.precision = ch_data.precision.map(precisions)

In [212]:
# drop all minority precisions in german dataset
precisions_to_drop = 'municipality|polygone|polygone_precise|subplace|transect_precise'
de_data.drop(de_data[de_data.precision.str.contains(precisions_to_drop)].index, inplace=True)

In [213]:
# After: Aligned and cleaned precision occurences
print('\033[1m'+'German precision occurences:\n', '\033[0m', de_data.groupby('precision').size())
print('\033[1m'+'\nSwiss precision occurences:\n', '\033[0m', ch_data.groupby('precision').size())

[1mGerman precision occurences:
 [0m precision
place      10012712
precise    22903811
square      7186252
dtype: int64
[1m
Swiss precision occurences:
 [0m precision
place      1993236
precise    4753055
square     3176584
dtype: int64


#### Align bird names
Many bird species have different names in both datasets. However, the ID is the same.<br>
Based on the species ID, all swiss bird names are replaced by their german counterpart. If a bird species is only present in the swiss dataset, its name stays as-is.

#### Comparison of bird names in Germany and bird names in Switzerland

In [214]:
def highlight_rows(row):
    german = row.loc['German name']
    swiss = row.loc['Swiss name']
    color = ''
    if german != swiss and swiss != '-' and german != '-':
        color = 'red' # Red
    return ['color: {}'.format(color) for r in row]

In [215]:
# Create a dicts  with species_id as keys and species_names as values {<species_id>: <species_name>}
german_species_map = dict(zip(de_data.id_species, de_data.name_species))
swiss_species_map = dict(zip(ch_data.id_species, ch_data.name_species))

# Create side-by-side comparison view
species_name_comparison = pd.DataFrame({'German name': pd.Series(german_species_map).sort_index(),
                                        'Swiss name': pd.Series(swiss_species_map).sort_index()}).fillna('-')
species_name_comparison.style.apply(highlight_rows, axis=1)

Unnamed: 0,German name,Swiss name
1.0,Sterntaucher,Sterntaucher
2.0,Prachttaucher,Prachttaucher
3.0,Eistaucher,Eistaucher
4.0,Gelbschnabeltaucher,Gelbschnabeltaucher
5.0,Zwergtaucher,Zwergtaucher
6.0,Ohrentaucher,Ohrentaucher
7.0,Schwarzhalstaucher,Schwarzhalstaucher
8.0,Haubentaucher,Haubentaucher
9.0,Rothalstaucher,Rothalstaucher
10.0,Eissturmvogel,-


#### Problem: Apparently, not all german id's == swiss id's

In [216]:
display(de_data[de_data.id_species==314].head())
display(ch_data[ch_data.id_species==314].head())

Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,altitude,total_count,atlas_code,id_observer
5061,29652006,314,Schleiereule,2018-01-01,,52.111929,6.900061,precise,0,1,,94465
8623,29774039,314,Schleiereule,2018-01-01,,54.051124,9.844217,precise,0,1,,119330
23253,29648892,314,Schleiereule,2018-01-01,,51.676494,6.042677,precise,0,1,,66634
28557,36335001,314,Schleiereule,2019-01-01,,49.928015,10.240481,square,0,1,,38875
29163,36465019,314,Schleiereule,2019-01-01,,48.11989,10.507153,square,0,0,,43461


Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,altitude,total_count,atlas_code,id_observer
738,15563770,314,Kuckuck,2018-04-16,,46.396284,6.90206,place,370,1,1,725
747,15568002,314,Kuckuck,2018-04-17,,46.981906,7.050892,square,430,1,3,9066
819,15585844,314,Kuckuck,2018-04-19,,47.58241,8.242828,square,320,2,3,2778
901,15596605,314,Kuckuck,2018-04-21,,47.630094,7.565012,square,270,1,1,14827
1043,15776702,314,Kuckuck,2018-05-14,,46.504846,8.943449,square,830,1,3,11944


#### Align bird names

In [217]:
# In swiss dataset: If a different bird name is used for the same species, replace it with respective german bird name
german_species_map = dict(zip(de_data.id_species, de_data.name_species))
ch_data.name_species = ch_data.id_species.map(german_species_map).fillna(ch_data.name_species)

### Merge datasets

In [223]:
ch_data['country'] = 'ch'
de_data['country'] = 'de'

In [231]:
master_data = pd.concat([de_data, ch_data])
master_data.to_csv(data_path_master)

In [232]:
master_data.head()

Unnamed: 0,id_sighting,id_species,name_species,date,timing,coord_lat,coord_lon,precision,altitude,total_count,atlas_code,id_observer,country
0,29666944,119,Reiherente,2018-01-01,,53.15776,8.676993,place,0,24,,37718,de
1,29666945,141,Gänsesäger,2018-01-01,,53.15776,8.676993,place,0,1,,37718,de
2,29666946,24,Kormoran,2018-01-01,04:00,53.15776,8.676993,place,0,240,,37718,de
3,29666947,205,Blässhuhn,2018-01-01,,53.15776,8.676993,place,0,13,,37718,de
4,29666948,309,Ringeltaube,2018-01-01,,53.15776,8.676993,place,0,2,,37718,de


In [220]:
master_data.dtypes

id_sighting       int64
id_species        Int64
name_species     object
date             object
timing           object
coord_lat       float64
coord_lon       float64
precision        object
altitude          int64
total_count       Int64
atlas_code       object
id_observer       Int64
dtype: object