## Regenerate Train/Val/Test Sets for `PATH_DATASET/train_soundscapes/`
Since the first way (i.e. notebooks `01-data_exploration.ipynb`) for generating the data no longer corresponds to our need,
we decided to regenerate the dataset, this time

- not excluding the validation set
- save `.npy` files into `./train_npy/` and `./val_npy/`

We will combine this notebook with

- `utils.py`
- `soundscape_to_npy.py`

## `train_soundscapes/`

In [1]:
from soundscape_to_npy import *

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/1600 [00:00<?, ?it/s]

  0%|          | 0/400 [00:00<?, ?it/s]

  0%|          | 0/400 [00:00<?, ?it/s]

In [None]:
!ls soundscape_npy_tmp/ | wc -l

In [4]:
df_soundscape_train.to_csv("soundscape_train.csv", index=False)
df_soundscape_val.to_csv("soundscape_val.csv", index=False)
df_soundscape_test.to_csv("soundscape_test.csv", index=False)

In [None]:
df_train_soundscape.shape

In [None]:
tmp_train_npy_paths = [ soundscape_npy_tmp / f"{row_id}.py" for row_id in df_soundscape_train["row_id"] ]
tmp_train_npy_paths

In [None]:
df_soundscape_test

In [None]:
df_train_soundscape.head()

### Cyclic Data
`cyclicize_number` and `cyclicize_series`.

In [None]:
D_location_coordinate

In [None]:
df_train_soundscape.columns

In [None]:
cyclicize_series(df_train_soundscape["longitude"], 180, -180)

In [None]:
[cyclicize_number(coord.longitude, 180, -180) for coord in D_location_coordinate.values()]

It will be more efficient to only convert these four longitudes to cyclic form (instead of repeatedly converting them in the dataframe), but I'll leave that part to whenever I have free time.

In [None]:
cyclicize_number(24, 0, 24)

In [None]:
cyclicize_number(1, 0, 24)

In [None]:
cyclicize_number(0, 0, 24)

In [None]:
cyclicize_number(180, 180, -180)

In [None]:
cyclicize_number(-180, 180, -180)

Note that the above coordinates in the Euclidean plane are only close, but not identical. According to the following cells of experiment, it seems to be due to the fact that `np.sin(2*pi)` **is not equal to** `np.sin(0)` numerically.

- `theta = 2 * np.pi * (number / period)`
- `theta = 2 * np.pi * ((number - min_) / period)`

seem to make little difference. 

In [None]:
for i in range(1, 100):
    if i / i != 1:
        print(f"{i}")

In [None]:
np.cos(0) - np.cos(2*np.pi)

In [None]:
np.sin(0) - np.sin(2*np.pi)

In [None]:
df_train_soundscape.columns

In [None]:
df_train_soundscape[["longitude", "longitude_x", "longitude_y"]]

In [None]:
df_train_soundscape["latitude"].max(), df_train_soundscape["latitude"].min()

In [None]:
df_meta = pd.read_csv(PATH_DATASET / "train_metadata.csv")
df_meta["latitude"].max(), df_meta["latitude"].min()

In [None]:
df_meta["latitude"].value_counts()

In [None]:
sorted(df_meta["latitude"].unique())

In [None]:
(df_train_soundscape["latitude"] / 90).value_counts()

In [None]:
(df_train_soundscape["latitude"] / 90).unique()

In [None]:
df_train_soundscape.columns

In [None]:
df_train_soundscape.head()

In [None]:
df_train_soundscape[["month", "month_x", "month_y"]].value_counts()

### Train/Val/Test Split

<s>There are a total of `20` `.ogg` files in `train_soundscapes/`: I would like to split these into train/val/test sets.</s>

- <s>`12` files for train</s>
- <s>`4` files for val</s>
- <s>`4` files for test</s>

Unlike our first attempt, here I would like to use `StratifiedShuffleSplit` (from `sklearn`) on the column `birds` of `df_train_soundscape`


In [None]:
df_train_soundscape["n_birds"].value_counts()

In [None]:
df_train_soundscape.shape

In [None]:
df_train_soundscape["n_birds"].value_counts()

In [None]:
df_train_soundscape[df_train_soundscape["n_birds"] == 5]

In [None]:
df_5_birds = df_train_soundscape[df_train_soundscape["n_birds"] == 5]
df_5_birds

In [None]:
df_le_4_birds = df_train_soundscape.drop(index=[1974])
df_le_4_birds.shape

In [None]:
1974 in df_le_4_birds.index

In [None]:
df_le_4_birds.reset_index(drop=True, inplace=True)
df_le_4_birds

In [None]:
1974 in df_le_4_birds.index

In [None]:
list(df_le_4_birds.index) == list(range(2399))

In [None]:
soundscape_split1 = StratifiedShuffleSplit(test_size=400, random_state=SEED)
for tv_indices, test_indices in soundscape_split1.split(df_le_4_birds, df_le_4_birds["n_birds"]):
    df_soundscape_train_val = df_le_4_birds.loc[tv_indices]
    df_soundscape_test = df_le_4_birds.loc[test_indices]

In [None]:
df_soundscape_train_val.index

In [None]:
df_soundscape_test.index

In [None]:
sorted(df_soundscape_train_val.index.union(df_soundscape_test.index)) == list(range(2399))

In [None]:
df_soundscape_test["n_birds"].value_counts()

In [None]:
df_soundscape_train_val["n_birds"].value_counts()

In [None]:
df_soundscape_train_val.reset_index(drop=True, inplace=True)
#soundscape_split2 = StratifiedShuffleSplit(test_size=400, random_state=SEED)
for train_indices, val_indices in soundscape_split1.split(df_soundscape_train_val, df_soundscape_train_val["n_birds"]):
    df_soundscape_train = df_soundscape_train_val.loc[train_indices]
    df_soundscape_val = df_soundscape_train_val.loc[val_indices]

In [None]:
df_soundscape_train["n_birds"].value_counts()

In [None]:
df_soundscape_val["n_birds"].value_counts()

In [None]:
df_soundscape_test["n_birds"].value_counts()

In [None]:
pd.concat([df_soundscape_train, df_5_birds]) 

In [None]:
pd.concat([df_soundscape_train, df_5_birds]).loc[1974]

In [None]:
df_soundscape_train.loc[1974]

In [None]:
df_soundscape_test

## Cut Audios and Placing Them to Train/Val/Test Folders
- Although by now we have known where to put the cuts, it seems better to cut and save the videos' `.npy` files into a common folder, say `./soundscape_npy_tmp/`, first.
- Then we shall move each files to its corresponding folder according to `df_soundscape_train/df_soundscape_val/df_soundscape_test`

In [None]:
list((PATH_DATASET / "train_soundscapes").iterdir())