## 1.0 Explore raw data and prepare songs for analysis

> **Note:** this works for the recordings of a given year only

### This notebook does the following:
 - Segments raw recordings into manually defined songs
 - Applies a band-pass filter to reject frequencies outside a given range

**You need to define previous steps: **
 - Copy full lenght recordings
 - Segment songs with AviaNZ
 - Segment into syllables with chipper

In [36]:
# Reload modules automatically
# to update edited src code
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [45]:
import numpy as np
import pandas as pd
import src
import glob
from os import fspath

from src.read.paths import DATA_DIR, RESOURCES_DIR
from src.audio.segmentation import *
from IPython.display import display, HTML, display_html

In [46]:
# import recorded nestboxes
files_path = DATA_DIR / "raw" / "2020"
filelist = np.sort(list(files_path.glob("**/*.WAV")))
recorded_nestboxes = pd.DataFrame(set([file.parent.name for file in filelist]))

# import the latest brood data downloaded from https://ebmp.zoo.ox.ac.uk/broods
brood_data_path = RESOURCES_DIR / "brood_data" / "2020"
list_of_files = glob.glob(fspath(brood_data_path) + "/*.csv")
latest_file = max(list_of_files, key=os.path.getctime)
greti_nestboxes = pd.DataFrame(
    (pd.read_csv(latest_file).query('Species == "g"').filter(["Pnum"]))["Pnum"].str[5:]
)
# get those in both lists
recorded_gretis = [
    i
    for i in recorded_nestboxes.values.tolist()
    if i in greti_nestboxes.values.tolist()
]

print("You recorded a total of " + str(len(filelist)) + " hours of audio.")
print(
    "You recorded "
    + str(len(recorded_gretis))
    + " out of a total of "
    + str(len(greti_nestboxes))
    + " great tits that bred this year"
)

You recorded a total of 6811 hours of audio.
You recorded 240 out of a total of 260 great tits that bred this year


### Segment raw recordings into bouts
 - Songs manually selected in AviaNZ - for now

> `batch_segment_bouts()` usis multiprocessing. If you run into problems, use `batch_segment_bouts_single()` (much slower).

In [50]:
DATA_DIR = Path('/home/nilomr/projects/0.0_great-tit-song/test') # remove

origin = DATA_DIR / "raw" / "2020" # Folder to segment
DT_ID = dt.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
dataset = "GRETI_HQ"

batch_segment_songs(origin, DATA_DIR, DT_ID, subset=dataset)


{Reading, trimming and saving song bouts}: 0it [00:00, ?it/s]Complete
total time (s)= 0.11518740653991699

{Reading, trimming and saving song bouts}: 100%|██████████| 31/31 [00:00<00:00, 21484.37it/s]
Complete
total time (s)= 56.98673605918884
{Reading, trimming and saving song bouts}: 100%|██████████| 41/41 [00:00<00:00, 12032.36it/s]
{Reading, trimming and saving song bouts}: 0it [00:00, ?it/s]
{Reading, trimming and saving song bouts}:   0%|          | 0/31 [00:00<?, ?it/s]Complete
total time (s)= 51.760085344314575
         10202 function calls (10198 primitive calls) in 108.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      417  108.816    0.261  108.816    0.261 {method 'acquire' of '_thread.lock' objects}
       18    0.076    0.004    0.076    0.004 {built-in method posix.fork}
      120    0.039    0.000    0.039    0.000 socket.py:342(send)
       63    0.021    0.000    0.021    0.000 {built-in method posi

- Let's check how many songs have been exported:

In [64]:
from avgn.utils.paths import most_recent_subdirectory

all_songs_path = most_recent_subdirectory(DATA_DIR / 'processed' / dataset)

all_songs_list = np.sort(list(all_songs_path.glob('**/*.wav')))

print('There are ' + str(len(all_songs_list)) + ' songs')


There are 811 song bouts


In [None]:
from avgn.utils.paths import most_recent_subdirectory

all_songs_path = most_recent_subdirectory(DATA_DIR / 'processed' / dataset)

all_songs_list = np.sort(list(all_songs_path.glob('**/*.wav')))


print(
    "There are",
    len(list(destination.glob("**/*"))),
    "song songs in",
    len(list(destination.glob("*"))),
    "folders\n",
)

nbouts = {}
for folder in destination.glob('*'):
    nbouts[folder.name] = len(list(folder.glob("*")))

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)
sns.set_style("whitegrid", {'axes.grid' : False})
plt.figure(figsize=(16, 6))
sns.distplot(list(nbouts.values()), bins=50, kde=False, rug=True, color="#995c00")
plt.xlabel("\nNumber of song bouts")
plt.ylabel("Frequency\n")
plt.title("Song bouts per nestbox")
sns.despine(right=True)


Now, let's check the distribution of maximum amplitude per song - and decide the appropriate cutoff.

In [7]:
test_audio = pydub.AudioSegment.from_wav("/home/nilomr/Music/trimmed.wav")
test_audio_bandpass = band_pass_filter(test_audio, 2500, 9000, order = 12)
test_audio_bandpass.export("/home/nilomr/Music/bandpassed.wav", format = "wav")

NameError: name 'pydub' is not defined