# Does the Common Swift's tune behave in a Zipf-like pattern? 

## A vague concept and analysis

## The Common Swift

The common swift (_Apus apus_), is a medium-sized bird resembling the barn swallow and house martin but larger, belonging to the order Apodiformes. These resemblances result from convergent evolution, as their closest relatives are New World hummingbirds and Southeast Asian treeswifts. Its scientific name "Apus" is derived from Latin, meaning swift, reflecting the ancient belief that swifts had no feet. Swifts have short legs used for clinging to vertical surfaces, and they avoid settling on the ground to minimize vulnerability, with non-breeding individuals often in continuous flight for up to ten months.

![Swift Pretty](images/swift-flying.webp)

### Taxonomy and Physiology

The common swift, was classified by Carl Linnaeus in 1758 as _Hirundo apus_ and later categorized under the _genus Apus_ by Giovanni Antonio Scopoli in 1777. Its name "apus" is derived from the Latin word for swift, reflecting the belief that these birds were footless swallows. These birds measure 16–17 cm in length with a wingspan of 38–40 cm. They are primarily blackish-brown, with a small white or pale grey chin patch that is not easily visible from a distance.

![Swift Anatomy](images/commonswift-topography.jpg)


### Language Behavior
Common swifts are known for their distinctive loud screaming calls, often forming groups during summer evenings. These gatherings may serve various purposes, including ascending to sleep in flight, with radar tracking revealing flocking behavior during specific times, possibly for social interaction or information exchange. 

### Final Motivations
As the name would led you to believe, the Common Swift is a fairly popular recorded species, specifically in the Xenocanto bird dataset there are thousands of high quality recordings that can be obtained through a wrapped API (see _example_download.py_) therefore this is a good starter specie to test if Zipf's law is present after some preprocessing.

The main focus of this work will be to explore a novel way to synthetize bird sounds into a codified language and thus allowing possible languange analysis in the future.

### Xenocanto Dataset 

_To download the dataset using the wrapped API (pypi xenocanto) run example_download.py_


#### Check Metadata

In [1]:
# check metadata files in the dataset
import os 
import json


bird = "CommonSwift"

# check a metadata file's content
path = f"./dataset/metadata/{bird}/"
metalist = os.listdir(path)

# open and read the JSON file
chosen_meta = metalist[0]
with open(f"./dataset/metadata/CommonSwift/{chosen_meta}", 'r') as json_file:
    metadata = json.load(json_file)

#print(metadata)


So each metadata file has a ton of recordings with fields that _could_ be interesting if data's rich enough, let's visualize
how rich is our data by merging both metadata files to a pandas dataframe and getting some statistics.

In [2]:
# uses: metalist from cell 9 and imports
# merge metadata files into one and then dataframe it
import pandas as pd


records = []

for meta in metalist:
    # open each metadata file and load dict
    with open(f"./dataset/metadata/CommonSwift/{meta}", 'r') as json_file:
        metadata = json.load(json_file)
        # we need only the recordings from json
        sub_records = metadata['recordings']
        for record in sub_records:
            records.append(record)

# now we create the dataframe
df = pd.DataFrame(records)
df.sample(5)

Unnamed: 0,id,gen,sp,ssp,group,en,rec,cnt,loc,lat,...,rmk,bird-seen,animal-seen,playback-used,temp,regnr,auto,dvc,mic,smp
314,663446,Apus,apus,,birds,Common Swift,Michael John O Mahony,Ireland,"Ireland (near Buttevant), Cork, County Cork",52.2341,...,,yes,yes,unknown,,,no,,,44100
187,813246,Apus,apus,,birds,Common Swift,Peter Boesman,Kazakhstan,Sharyn canyon area,43.4936,...,,unknown,unknown,unknown,,,no,Olympus LS-05,Telinga Pro-X,48000
492,736481,Apus,apus,,birds,Common Swift,Cedric Mroczko,France,"Cordes-sur-Ciel, Tarn, Occitanie, France",44.0632,...,High-pass filtered. ls-5 + stereo esm aom5024,no,no,no,,,no,,,44100
655,418453,Apus,apus,,birds,Common Swift,Stanislas Wroza,France,"Cap Leucate, Leucate, Langeudoc-Rousillon",42.913,...,,yes,yes,no,,,no,,,44100
44,723217,Apus,apus,,birds,Common Swift,Ad Hilders,Portugal,"São Nicolau (near Lisboa), Lisbon, Lisboa",38.712,...,,no,no,no,,,no,,,44100


Clearly, there are some useless columns so let's ignore those. Some columns that could be interesting may not be rich enough.

TODO: Check later if we can separate by male and female.

In [24]:
# clean a bit
uselful_cols = ['id', 'cnt', 'lat', 'lng', 'type', 'q', 'length', 'time', 'date', 'bird-seen', 'smp']

df[uselful_cols].sample(10)

Unnamed: 0,id,cnt,lat,lng,type,q,length,time,date,bird-seen,smp
108,500246,United Kingdom,53.9299,-2.9833,song,C,0:04,08:44,2016-05-07,no,22050
243,817639,Portugal,41.5176,-7.7929,flight call,A,0:04,06:30,2023-07-19,no,44100
281,733547,Poland,51.6578,19.3281,"call, flight call",A,0:19,21:00,2022-06-22,yes,44100
49,663911,Netherlands,51.5121,4.4359,call,C,0:10,07:18,2021-07-18,yes,44100
582,570979,Poland,50.1191,18.9721,flight call,B,0:07,05:37,2020-06-23,no,48000
127,380253,Switzerland,46.9377,7.4678,flight call,C,0:31,21:00,2017-07-03,yes,48000
139,376874,France,42.9922,6.1862,flight call,C,0:31,20:00,2017-06-18,yes,44100
620,486195,Portugal,37.1866,-7.4373,flight call,B,0:17,11:45,2019-06-28,yes,48000
297,724477,France,48.2016,-2.9861,flight call,A,0:38,10:30,2022-05-15,unknown,48000
326,657329,Netherlands,51.9892,4.4694,"call, flight call",A,0:04,05:51,2021-06-17,no,44100


In [6]:
# check statistics
import plotly.express as px

px.histogram(df[uselful_cols], x='cnt', template='plotly_dark', title='Recordings by Country')

In [19]:
px.histogram(df[uselful_cols], x='q', template='plotly_dark', category_orders=dict(q=['A', 'B', 'C', 'D']),
             title='Recordings by Quality', histnorm='percent')

In [28]:
# clean by seconds
def time_to_seconds(time_str):
    minutes, seconds = map(int, time_str.split(':'))
    return minutes * 60 + seconds

df['seconds'] = df['length'].apply(time_to_seconds)
px.histogram(df[df.seconds <= 120], x='seconds', cumulative=True,
             template='plotly_dark', title='Recordings by Duration [s]', histnorm='percent')

So if we use recordings that are 20 seconds or less we have 60% of data, and of at least C quality we have 80% of data. If these values are not correlated in some way we should expect to have around 48% of data as a available to process (being picky).

We also take only sample rates of 44khz because it makes our life easier.

In [37]:
# create a sub dataset with the picky data
dataset = df[uselful_cols + ['seconds']][(df.seconds <= 20) & ((df.q != 'D') | (df.q != 'E')) & (df.smp == '44100')].copy().reset_index(drop=True)
dataset

Unnamed: 0,id,cnt,lat,lng,type,q,length,time,date,bird-seen,smp,seconds
0,184143,Switzerland,46.9509,7.4339,flight call,B,0:02,11:30,2014-06-28,yes,44100,2
1,182340,Germany,,,flight call,B,0:09,?,2014-06-16,yes,44100,9
2,179334,Spain,37.3811,-5.947,flight call,B,0:18,?,2014-05-12,yes,44100,18
3,144363,Israel,31.7833,35.2167,call,B,0:07,11:00,2011-04-14,no,44100,7
4,140595,Spain,42.0029,-5.6742,flight call,B,0:12,08:30,2013-06-25,yes,44100,12
...,...,...,...,...,...,...,...,...,...,...,...,...
300,199124,Russian Federation,56.0818,47.2859,flight call,B,0:18,19:00,2014-05-29,yes,44100,18
301,198310,Russian Federation,56.1545,47.2399,call,B,0:11,09:30,2014-06-01,yes,44100,11
302,198309,Russian Federation,56.0814,47.3027,call,B,0:19,09:30,2014-06-01,yes,44100,19
303,198308,Russian Federation,56.0814,47.3027,call,B,0:12,09:30,2014-06-01,yes,44100,12


### Frequency Spectrum of recordings

We will observe some frequency spectra first to have an idea of the data