[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1C_m0ntAKT9vycS82lMiSa6X6P6KFvI8z)

This notebook reads the orcasound catalogue and filters by time and hydrophone. Need to test the filtering output. 

In [45]:
pip install s3fs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [57]:
import dask

In [58]:
import dask.dataframe as dd

In [59]:
from dask import delayed

In [60]:
import pandas as pd

In [61]:
import fsspec

In [62]:
from pathlib import Path, PurePath

In [63]:
import datetime

In [64]:
fs = fsspec.filesystem('s3', anon=True)

### Reading AWS Credentials

In [65]:
# one needs to upload the keys file before running this step
keys = pd.read_csv("OrcaSoundKeys.csv")

In [87]:
pip install fastparquet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fastparquet
  Downloading fastparquet-2023.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
Collecting cramjam>=2.3
  Downloading cramjam-2.6.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=1.5.0
  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: cramjam, pandas, fastparquet
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      S

### Reading Catalogue

In [88]:
catalogue = dd.read_parquet('s3://orcasound-inventory/streaming-orcasound-net/orcasound-streaming-inventory/catalogue.parquet', storage_options={'anon':True}, engine = 'fastparquet')

In [80]:
catalogue.shape

(Delayed('int-4792b31f-5477-403b-a6a0-64e991d32dc9'), 6)

In [81]:
catalogue.columns

Index(['filename', 'duration', 'end_time', 'start_time', 'hydrophone',
       'fullpath'],
      dtype='object')

### Filtering Catalogue

In [94]:
start_range = datetime.datetime.strptime('2022-06-30', "%Y-%m-%d").timestamp()
end_range = datetime.datetime.strptime('2022-07-01', "%Y-%m-%d").timestamp()

In [95]:
start_range

1656547200.0

In [96]:
%%time
catalogue_selected = catalogue[(catalogue['start_time']>start_range) & (catalogue['end_time']<end_range) & (catalogue["hydrophone"]=='rpi_orcasound_lab')].compute()

CPU times: user 1min 28s, sys: 5.18 s, total: 1min 33s
Wall time: 12min 37s


Using `pyarrow` takes 30 min to read a day of files (about 8600). Using `fastparquet` takes 12 min.

In [97]:
catalogue_selected.shape

(8634, 6)

In [85]:
catalogue.head()

Unnamed: 0,filename,duration,end_time,start_time,hydrophone,fullpath
0,live000.ts,10.010044,1626482000.0,1626482000.0,rpi_bush_point,streaming-orcasound-net/rpi_bush_point/hls/162...
1,live001.ts,10.005356,1626482000.0,1626482000.0,rpi_bush_point,streaming-orcasound-net/rpi_bush_point/hls/162...
2,live002.ts,10.005333,1626482000.0,1626482000.0,rpi_bush_point,streaming-orcasound-net/rpi_bush_point/hls/162...
3,live003.ts,9.984022,1626482000.0,1626482000.0,rpi_bush_point,streaming-orcasound-net/rpi_bush_point/hls/162...
4,live004.ts,10.005344,1626482000.0,1626482000.0,rpi_bush_point,streaming-orcasound-net/rpi_bush_point/hls/162...


In [None]:
# ckecking what hydrophones are available
%%time
nodes = catalogue['hydrophone'].compute().unique()

In [98]:
nodes

array(['rpi_bush_point', 'rpi_port_townsend', 'rpi_orcasound_lab'],
      dtype=object)

In [None]:
catalogue_selected["hydrophone"].unique()

array(['rpi_bush_point'], dtype=object)

In [None]:
catalogue["hydrophone"].unique()

array(['rpi_bush_point'], dtype=object)