# BirdCLEF 2023 - Data Inspection

Here we do a first inspection on the competition data.

**All comments welcome!**

## Table of Contents
- [Config](#Config)
- [Submission](#Submission)
- [Training data](#Training-data)
- [Taxonomy](#Taxonomy)

In [None]:
import os
import collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import librosa
print("librosa:", librosa.__version__)

# Config

In [None]:
base_dir = "/kaggle/input/birdclef-2023/"
train_sound_dir = "/kaggle/input/birdclef-2023/train_audio/"
sample_rate = 32_000

In [None]:
os.listdir(base_dir)

# Submission
- Task: Predict probability of bird being present in 5 second time intervals (suffix is end time).
- Multilabel task.
- In total we have 264 birds.
- Test soundscapes are 10 minutes long (only 3 submission rows provided here).

In [None]:
path_submission = base_dir + "sample_submission.csv"
submission = pd.read_csv(path_submission)
submission

In [None]:
test_sound_dir = base_dir + "test_soundscapes/"
os.listdir(test_sound_dir)

In [None]:
path_test_file = test_sound_dir + os.listdir(test_sound_dir)[0]
print(path_test_file)
sig, sr = librosa.load(path_test_file, sr=sample_rate)
sig, sr

In [None]:
len(sig) / sample_rate / 60

# Training data
- Total of 264 birds, same as in submission file.
- Very unbalanced training set: 500 files max, but some birds are having only 1 file (might be very rare cases).
- Training set does not contain any multilabel task; differecne to final prediction task.
- Scientific name and common name are just 1-to-1 mapping of primary label.
- Author, license, url, and filename is clear.
- Latitude and longitude are clear, but some are missing.
- Ratings are spread between 0.0 and 5.0; can this be used as a confidence measure?
- 13% records have secondary labels; this might be another source of uncertainty to consider. 
    - Unclear about duplicates there.
- Weird list columns can be casted directly using ``eval``.
- Type give extra information (90%), but too many different labels 370 and clear how to utilize.

In [None]:
train = pd.read_csv(base_dir + "train_metadata.csv")
for col in ["type", "secondary_labels"]:
    train[col] = train[col].apply(eval)
    # remove '' elements in list (for type)
    train[col] = train[col].apply(lambda l: [x for x in l if x])
#train["path_ogg"] = train_sound_dir + train["filename"]
train

In [None]:
labels = train["primary_label"].unique()
len(labels), labels[:10], set(submission.columns[1:]) == set(labels)

In [None]:
counts_labels = train["primary_label"].value_counts().to_frame("counts")
counts_labels.index.name = "primary_label"
counts_labels

In [None]:
counts_labels.plot(kind="line", marker="x", figsize=(12, 4))
plt.show()

In [None]:
display(counts_labels.head(20).T.style.set_caption("Head"))
display(counts_labels.tail(20).T.style.set_caption("Tail"))

In [None]:
print("1-to-1 primary label vs scientific name:", (train.groupby("primary_label")["scientific_name"].nunique() == 1).all())
print("1-to-1 primary label vs common name:", (train.groupby("primary_label")["common_name"].nunique() == 1).all())
print("duplicated URL:", train["url"].duplicated().any())
print("duplicated filename:", train["filename"].duplicated().any())

In [None]:
train.isna().sum()

In [None]:
sec_labels = train["secondary_labels"]
sum(sec_labels.str.len() > 0), sum(sec_labels.str.len() > 0) / len(sec_labels)

In [None]:
# duplicates?
sec_labels[sec_labels.apply(lambda x: len(set(x)) != len(x))]

In [None]:
print("all known secondary labels:", sec_labels.apply(lambda x: set(x).issubset(labels)).all())

In [None]:
sec_labels.apply(lambda x: set(x)).str.len().value_counts().sort_index()

In [None]:
song_types = train["type"]
sum(song_types.str.len() > 0), sum(song_types.str.len() > 0) / len(sec_labels)

In [None]:
# duplicates?
song_types[song_types.apply(lambda x: len(set(x)) != len(x))]

In [None]:
c = collections.Counter()
for v in song_types:
    c.update([x.strip().lower() for x in v])
len(c), c.most_common(20)

# Taxonomy
- Taxonomy includes many entries.
- From 264 revelant birds only one is missing, ``gnbcam2`` (if lookup acts on scientifc name).
- Still unclear about how to use in modelling; prediction on higher aggregation?

In [None]:
path_taxonomy = base_dir + "eBird_Taxonomy_v2021.csv"
taxonomy = pd.read_csv(path_taxonomy)
taxonomy

In [None]:
taxonomy["label"] = taxonomy["SCI_NAME"].map(train.groupby("scientific_name")["primary_label"].first())
taxonomy = taxonomy.dropna(subset=["label"])
taxonomy

In [None]:
print("missing:", set(labels) - set(taxonomy["label"]))

In [None]:
counts_family = taxonomy["FAMILY"].value_counts().to_frame("counts")
counts_family.index.name = "family_short"
counts_family.index = counts_family.index.str.split("(", n=0).str[0]
assert ~counts_family.index.has_duplicates, "duplicates in shortend index"
counts_family

In [None]:
counts_family.plot(kind='bar', rot=90, figsize=(14, 4))
plt.show()

In [None]:
nan