## Regenerate Train/Val/Test Sets for `PATH_DATASET/train_soundscapes/`
Since the first way (i.e. notebooks `01-data_exploration.ipynb`) for generating the data no longer corresponds to our need,
we decided to regenerate the dataset, this time

- not excluding the validation set
- save `.npy` files into `./train_npy/` and `./val_npy/`

We will combine this notebook with

- `utils.py`
- `soundscape_to_npy.py`

## `train_soundscapes/`

In [1]:
from soundscape_to_npy import *

In [2]:
df_train_soundscape.head()

Unnamed: 0,row_id,site,audio_id,seconds,birds,n_birds,year,month,day,longitude,latitude
0,7019_COR_5,COR,7019,5,nocall,0,2019,9,4,-84.51,10.12
1,7019_COR_10,COR,7019,10,nocall,0,2019,9,4,-84.51,10.12
2,7019_COR_15,COR,7019,15,nocall,0,2019,9,4,-84.51,10.12
3,7019_COR_20,COR,7019,20,nocall,0,2019,9,4,-84.51,10.12
4,7019_COR_25,COR,7019,25,nocall,0,2019,9,4,-84.51,10.12


### Cyclic Data
`cyclicize_number` and `cyclicize_series`.

In [3]:
D_location_coordinate

{'COR': Coordinate(longitude=-84.51, latitude=10.12),
 'SSW': Coordinate(longitude=-76.45, latitude=42.47),
 'SNE': Coordinate(longitude=-119.95, latitude=38.49),
 'COL': Coordinate(longitude=-75.85, latitude=5.57)}

In [14]:
df_train_soundscape.columns

Index(['row_id', 'site', 'audio_id', 'seconds', 'birds', 'n_birds', 'year',
       'month', 'day', 'longitude', 'latitude'],
      dtype='object')

In [4]:
cyclicize_series(df_train_soundscape["longitude"], 180, -180)

[(0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.09567202165105818, -0.9954129114458982),
 (0.095672

In [5]:
[cyclicize_number(coord.longitude, 180, -180) for coord in D_location_coordinate.values()]

[(0.09567202165105818, -0.9954129114458982),
 (0.23429382769171864, -0.9721658306613966),
 (-0.4992440599749497, -0.8664614062840472),
 (0.24446129191636662, -0.9696590518087175)]

It will be more efficient to only convert these four longitudes to cyclic form (instead of repeatedly converting them in the dataframe), but I'll leave that part to whenever I have free time.

In [6]:
cyclicize_number(24, 0, 24)

(1.0, 2.4492935982947064e-16)

In [7]:
cyclicize_number(1, 0, 24)

(0.9659258262890683, -0.25881904510252074)

In [8]:
cyclicize_number(0, 0, 24)

(1.0, -0.0)

In [9]:
cyclicize_number(180, 180, -180)

(-1.0, 1.2246467991473532e-16)

In [10]:
cyclicize_number(-180, 180, -180)

(-1.0, -1.2246467991473532e-16)

Note that the above coordinates in the Euclidean plane are only close, but not identical. According to the following cells of experiment, it seems to be due to the fact that `np.sin(2*pi)` **is not equal to** `np.sin(0)` numerically.

- `theta = 2 * np.pi * (number / period)`
- `theta = 2 * np.pi * ((number - min_) / period)`

seem to make little difference. 

In [11]:
for i in range(1, 100):
    if i / i != 1:
        print(f"{i}")

In [12]:
np.cos(0) - np.cos(2*np.pi)

0.0

In [13]:
np.sin(0) - np.sin(2*np.pi)

2.4492935982947064e-16

In [14]:
df_train_soundscape.columns

Index(['row_id', 'site', 'audio_id', 'seconds', 'birds', 'n_birds', 'year',
       'month', 'day', 'longitude', 'latitude'],
      dtype='object')

In [15]:
df_train_soundscape[["longitude_x", "longitude_y"]] = \
    cyclicize_series(df_train_soundscape["longitude"], 180, -180)

In [16]:
df_train_soundscape.columns

Index(['row_id', 'site', 'audio_id', 'seconds', 'birds', 'n_birds', 'year',
       'month', 'day', 'longitude', 'latitude', 'longitude_x', 'longitude_y'],
      dtype='object')

In [18]:
df_train_soundscape[["longitude", "longitude_x", "longitude_y"]]

Unnamed: 0,longitude,longitude_x,longitude_y
0,-84.51,0.095672,-0.995413
1,-84.51,0.095672,-0.995413
2,-84.51,0.095672,-0.995413
3,-84.51,0.095672,-0.995413
4,-84.51,0.095672,-0.995413
...,...,...,...
2395,-76.45,0.234294,-0.972166
2396,-76.45,0.234294,-0.972166
2397,-76.45,0.234294,-0.972166
2398,-76.45,0.234294,-0.972166


In [19]:
df_train_soundscape

Unnamed: 0,row_id,site,audio_id,seconds,birds,n_birds,year,month,day,longitude,latitude,longitude_x,longitude_y
0,7019_COR_5,COR,7019,5,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413
1,7019_COR_10,COR,7019,10,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413
2,7019_COR_15,COR,7019,15,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413
3,7019_COR_20,COR,7019,20,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413
4,7019_COR_25,COR,7019,25,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2395,54955_SSW_580,SSW,54955,580,nocall,0,2017,6,17,-76.45,42.47,0.234294,-0.972166
2396,54955_SSW_585,SSW,54955,585,grycat,1,2017,6,17,-76.45,42.47,0.234294,-0.972166
2397,54955_SSW_590,SSW,54955,590,grycat,1,2017,6,17,-76.45,42.47,0.234294,-0.972166
2398,54955_SSW_595,SSW,54955,595,nocall,0,2017,6,17,-76.45,42.47,0.234294,-0.972166


In [20]:
df_train_soundscape["latitude"].max(), df_train_soundscape["latitude"].min()

(42.47, 10.12)

In [22]:
df_meta = pd.read_csv(PATH_DATASET / "train_metadata.csv")
df_meta["latitude"].max(), df_meta["latitude"].min()

(78.9281, -53.162)

In [23]:
(df_train_soundscape["latitude"] / 90)

0       0.112444
1       0.112444
2       0.112444
3       0.112444
4       0.112444
          ...   
2395    0.471889
2396    0.471889
2397    0.471889
2398    0.471889
2399    0.471889
Name: latitude, Length: 2400, dtype: float64

In [28]:
(df_train_soundscape["latitude"] / 90).unique()

array([0.11244444, 0.47188889])

In [32]:
df_train_soundscape.head()

Unnamed: 0,row_id,site,audio_id,seconds,birds,n_birds,year,month,day,longitude,latitude,longitude_x,longitude_y,latitude_normalized
0,7019_COR_5,COR,7019,5,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413,0.112444
1,7019_COR_10,COR,7019,10,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413,0.112444
2,7019_COR_15,COR,7019,15,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413,0.112444
3,7019_COR_20,COR,7019,20,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413,0.112444
4,7019_COR_25,COR,7019,25,nocall,0,2019,9,4,-84.51,10.12,0.095672,-0.995413,0.112444


In [35]:
df_train_soundscape[["month_x", "month_y"]] = cyclicize_series(df_train_soundscape["month"], 12, 0)
df_train_soundscape[["month", "month_x", "month_y"]].value_counts()

month  month_x        month_y      
9      -1.836970e-16  -1.000000e+00    840
7      -8.660254e-01  -5.000000e-01    480
10      5.000000e-01  -8.660254e-01    360
4      -5.000000e-01   8.660254e-01    240
3       6.123234e-17   1.000000e+00    120
5      -8.660254e-01   5.000000e-01    120
6      -1.000000e+00   1.224647e-16    120
8      -5.000000e-01  -8.660254e-01    120
dtype: int64

### Train/Val/Test Split

<s>There are a total of `20` `.ogg` files in `train_soundscapes/`: I would like to split these into train/val/test sets.</s>

- <s>`12` files for train</s>
- <s>`4` files for val</s>
- <s>`4` files for test</s>

Unlike our first attempt, here I would like to use `StratifiedShuffleSplit` (from `sklearn`) on the column `birds` of `df_train_soundscape`


In [None]:
df_train_soundscape["n_birds"].value_counts()

### Objective 1: `.ogg` to `.npy`

#### `joblib` way

In [None]:
def audio_to_mels(audio,
                  sr=SR,
                  n_mels=128,
                  fmin=0,
                  fmax=None):
    fmax = fmax or sr // 2
    mel_spec_computer = MelSpecComputer(sr=sr,
                                        n_mels=n_mels,
                                        fmin=fmin,
                                        fmax=fmax)
    mels = standardize_uint8(mel_spec_computer(audio))
    return mels

def every_5sec(id_,
               sr=SR,
               resample=True,
               res_type="kaiser_fast",
               single_process=True,
               save_to=Path("corbeille"),
               n_workers=2
                ):
    """
    - read the audio file of ID `id_`
    - cut the read audio into pieces of 5 seconds
    - convert each piece into `.npy` file and save
    """
    path_ogg = next((PATH_DATASET / "train_soundscapes").glob(f"{id_}*.ogg"))
    location = (path_ogg.name).split("_")[1]
    whole_audio, orig_sr = soundfile.read(path_ogg, dtype="float32")
    if resample and orig_sr != sr:
        whole_audio = librosa.resample(whole_audio, orig_sr, sr, res_type=res_type)
    n_samples = len(whole_audio)
    n_samples_5sec = sr * 5
    save_to.mkdir(exist_ok=True)

    def convert_and_save(i):
        audio_i = whole_audio[i:i + n_samples_5sec]
        mels_i = audio_to_mels(audio_i)
        path_i = save_to / f"{id_}_{location}_{((i + n_samples_5sec) // n_samples_5sec) * 5}.npy"
        np.save(str(path_i), mels_i)

    if single_process:
        for i in range(0, n_samples - n_samples % n_samples_5sec, n_samples_5sec):
            #audio_i = whole_audio[i:i + n_samples_5sec]
            ## No need the next check because in range() we have subtracted the remainder.
            ## That is, len(audio_i) is guaranteed to be n_samples_5sec for all i.
            ##if len(audio_i) < n_samples_5sec:
            ##    pass
            #mels_i = audio_to_mels(audio_i)
            #path_i = save_to / f"{id_}_{location}_{((i + n_samples_5sec) // n_samples_5sec) * 5}.npy"
            #np.save(str(path_i), mels_i)
            convert_and_save(i)
    else:
        pool = joblib.Parallel(n_workers)
        mapping = joblib.delayed(convert_and_save)
        tasks = (mapping(i) for i in range(0, n_samples - n_samples % n_samples_5sec, n_samples_5sec))
        pool(tasks)

def soundscapes_to_npy(is_test=False, n_processes=4):
    pool = joblib.Parallel(n_processes)
    mapping = joblib.delayed(every_5sec)
    if is_test:
        tasks = list(mapping(id_, save_to=testSoundScapes) for id_ in S_testSoundScapeIDs)
        #tasks = list(mapping(id_,
        #                     single_process=False,
        #                     save_to=testSoundScapes)
        #             for id_ in S_testSoundScapeIDs)
    else:
        tasks = list(mapping(id_, save_to=trainSoundScapes) for id_ in S_trainSoundScapeIDs)
        #tasks = list(mapping(id_,
        #                     single_process=False,
        #                     save_to=trainSoundScapes)
        #             for id_ in S_trainSoundScapeIDs)
    pool(tqdm(tasks))

### Nota Bene
- `tasks` (i.e. input to `joblib.Parallel`) can be either a generator or a list, but since I do not know a priori the length of a generator, when combined with the usage of `tqdm`, the progress bar will lack the capability to show progress percentage, compared to using a list.

In [None]:
%%time
soundscapes_to_npy()

In [None]:
soundscapes_to_npy(is_test=True)

In [None]:
S_testSoundScapeIDs

In [None]:
!ls $trainSoundScapes | wc -l

In [None]:
!ls $testSoundScapes | wc -l

In [None]:
16 * (600 // 5)

In [None]:
4 * (600 // 5)

Let's at least verify that the saved images exhibit difference.<br>
Try execute the next cell several times to see randomly the melspectrograms.

In [None]:
rand_npy = random.choice(list(trainSoundScapes.iterdir()))
rand_image = np.load(rand_npy)
print(f"rand_npy = {rand_npy.name}")
librosa.display.specshow(rand_image);

### Objective 2: Construct `df_train_soundscape`

Recall that
> - We want to update `df_train_soundscape` to contain more information. What information?
>   - Date: Can be separated.
>   - Corresponding `.npy` path: Can be separated.
>   - Longitude, latitude: Can be separated.
>   - birds label to birds indices?
>   - new col `"n_birds"` and do a stat?

Construct a dictionary for

- key: recording location, e.g. `COR`, `SSW`, etc.
- value: possibly `NamedTuple(longitude, latitude)`

I think the year won't make much difference.