Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move towards human-readable timestamps in audio filenames and/or directory names #7

Open
scottveirs opened this issue May 8, 2018 · 23 comments
Assignees

Comments

@scottveirs
Copy link
Member

In the long run, it would be valuable to stream and archive the Orcasound acoustic data with a NIST-synchronized timebase encoded in both the FLAC files and possibly also the HLS/DASH stream manifest and/or segments. If adjacent hydrophones (within earshot of each other) are synchronized with millisecond to microsecond precision, then we will be able to localize sounds with an accuracy that will help us learn more about biology: e.g. direction a soniferous animal is moving, location of a sound source, or identity of a signaler.

To this end, the shell script might be adapted (along with changes to how the player stays current) from its current syntax --

timestamp=$(date +%s)

-- to syntax such as:

timestamp=$(date +\%Y-\%m-\%d)

code source and snippet:

$ rsync -avz --delete --backup --backup-dir="backup_$(date +%Y-%m-%d)" /source/path/ /dest/path
By using $(date +%Y-%m-%d) I’m telling it to use today’s date in the folder name.

@scottveirs scottveirs moved this from To do to In progress in Orcanode development Aug 6, 2021
@scottveirs scottveirs changed the title Consider human-readable datetime or MJD directory names Move towards human-readable timestamps in audio filenames and/or directory names Aug 6, 2021
@scottveirs
Copy link
Member Author

It would be even better, as Paul pointed out on Slack recently, to get rid of the datetime-stamped S3 objects (akin to directories) and just store all data under a nodename with each data filename incorporating a NIST-synchronized timestamp.

We could get HLS segments to match the filename format of the FLAC files, which in the archive-orcasound-net bucket currently look something like:
2020-12-09_23-22-16_rpi_orcasound_lab--2.flac

Or we could align with ONC or OOI filename formats:

OOI: OO-HYVM2--YDH-2017-08-21T00_02_42.437000.mseed
ONC: ICLISTENHF1293_20171226T145827.651Z.wav

Orcanode development automation moved this from In progress to Done Aug 6, 2021
@scottveirs scottveirs reopened this Aug 6, 2021
Orcanode development automation moved this from Done to In progress Aug 6, 2021
@mcshicks mcshicks self-assigned this Aug 11, 2021
@mcshicks
Copy link
Contributor

I have this working now based on Pauls suggestion for using " -strftime 1"and modifying stream.sh (for research) to this "/tmp/$NODE_NAME/hls/$timestamp/%Y-%m-%d_%H-%M-%S.ts" filename.

@valentina-s
Copy link
Contributor

I think the more standard format is %Y-%m-%dTH:%M:%S.ts
i.e. colons for the hours, and T instead of the _ (space is also used but bad for filenames). Also, what about milliseconds?
I agree timezone indication will be good since I am never sure it is Greenwich time or local time.

ISO-8601 2021-08-11T18:01:50+00:00
UTC 2021-08-11T18:01:50Z

@Molkree you want to add your comments on the format?

@mcshicks
Copy link
Contributor

I tried %Y-%m-%dTH:%M:%S.ts instead of %Y-%m-%d_%H-%M-%S.ts" and I could not get the player to work. Not sure if it's unhappy with the : or the TH (probably the :) but ffmpeg does write the files fine. I can look into milliseconds, but I think the rpi's time in probably only accurate to maybe 10 ms? It uses NTP to sync time.

@paulcretu
Copy link
Member

ISO 8601 is a good idea, the full thing with timezone is %Y-%m-%dT%H:%M:%S%z. The problem is colons won't work on some filesystems (Windows), not sure if that has anything to do with it not working for you @mcshicks. I would propose something like %Y-%m-%d_%H-%M-%S_%Z (2021-08-12_20-52-09_UTC). It's readable, portable, and easy-ish to translate into ISO 8601.

The timezone could be easier to translate with %z (e.g. 2021-08-12_20-52-09+0000) since you wouldn't have to look up the abbreviation (like PDT in 2021-08-12_20-52-09_PDT). But there might be some cases where the + is a problem, and with negative offsets, it's a bit confusing to have the - (2021-08-12_20-52-09-0700). It would be nicest to get 2021-08-12_20-52-09Z for UTC and +0000 offset notation for other timezones but that doesn't seem to be an option with strftime.

@Molkree
Copy link
Member

Molkree commented Aug 14, 2021

The problem is colons won't work on some filesystems (Windows)

Haha, actually you can't even upload such files using actions/upload-artifact#35 in GitHub workflows. I used colons at first but then changed it to this %Y-%m-%dT%H-%M-%S-%f

Haven't thought about timezone, I just used UTC everywhere I believe. If you do add it to the filename I'd also prefer +0000. If extra - at the end looks confusing, can always add delimiter like TZ or something (2021-08-12_20-52-09TZ-0100).

@valentina-s
Copy link
Contributor

I did not think about the colons. The OOI Archive has them but I guess this causes issues for some users.
The format without dashes and colons %Y%m%dT%H%M%S%Z is also supported by ISO 8601. I wonder if that can be run by the player? It may be less human readable but is also machine readable. I am more biased toward using something standard. The fractions are expected to be delimited with dots (or commas) to distinguish 01.05 (1 h 3 min) vs 01:05 (1h 5min). If there are no dashes before, maybe then the -/+ timezone will be more obvious. Is the local timezone preferred? It is only one but it may not be obvious to a non-local person.

@Molkree
Copy link
Member

Molkree commented Aug 17, 2021

Is the local timezone preferred? It is only one but it may not be obvious to a non-local person.

Right now we use Unix time so I'd prefer to stay with UTC. Not specifying time zone implies local time so fully compliant ISO 8601 UTC time without colons would look like 20210812T205209+0000, 20210812T205209+00 or 20210812T205209Z.

I personally don't care that much about strict standard adherence in this case and would prefer something more readable but still in UTC.

@scottveirs scottveirs self-assigned this Nov 3, 2021
@scottveirs
Copy link
Member Author

@tsuize @veirs this is the HLS timestamp issue I was seeking on today's call. I think we should tackle this formatting decision this winter, adjust the orcanode code accordingly, and then fix everything that we're going to break, including at least:

  • The orcasite player code
  • The ingestion of live HLS data by aifororcas-livesystem (within Azure)
  • Scripts and packages that retrieve HLS data for particular time ranges
  • Likely the mseed transcoding tools built by @karan2704 and @mcshicks?

@scottveirs
Copy link
Member Author

After looking at MBARI's Pacific Sound open data registry a bit, they seem to be using something like this:

2017-06-13T16:00:00

and John Ryan confirms via Slack that this is relying on the convention of scientific timestamps being assumed to be in the UTC time zone.

Personally, I find the ambiguity unnerving enough that I think it's worth resolving with the extra 3 characters +00...

So, I'd propose one of the following options:

  1. 20170613T160000+00
  2. 20170613-160000+00 which I find just barely human-readable enough
  3. 2017-06-13T16-00-00+00
  4. 2017-06-13_16-00-00+00 which I feel is the most human-readable while avoiding colons :

Or just use Modified Julian Date (MJD) for the filenames and utilize existing packages to decode into human-readable formats if/when necessary.

Opinions?

@scottveirs
Copy link
Member Author

Also, we should test whether we can ensure ffmpeg can write a file with data starting at YYMMDD-HHMMSS precisely (to the nearest 10 or 100 microseconds). Otherwise we may need or want to add precision within the filename, i.e. precision high enough for any future localization efforts (e.g. 10 or 100 microseconds?).

@scottveirs
Copy link
Member Author

scottveirs commented Jan 11, 2023

@ben-hendricks shared on a call today that the BC Hydrophone Network uses a custom driver to generate timestamps from their icListen hydrophones in this format:

ICLISTENHF1281_20190704T085500.000Z_20190704T090000.000Z.flac

Where 1291 is the instrument ID (serial number?) and the .000 suffix is precision in seconds.

The archived format for processed calibrated noise level files assumes the user knows the timestamp is in UTC time zone, so ends up as (or close to?):

1281_20190704T085500.wav

@scottveirs
Copy link
Member Author

Also, we should test whether we can ensure ffmpeg can write a file with data starting at YYMMDD-HHMMSS precisely (to the nearest 10 or 100 microseconds). Otherwise we may need or want to add precision within the filename, i.e. precision high enough for any future localization efforts (e.g. 10 or 100 microseconds?).

Related to this @ben-hendricks also made a good point that -- if possible -- it's ideal to have different nodes start their recordings on the minute (or they use a 5-minute interval) so that file names and time intervals end up being consistent across the network. This allows a direct request for a matching file, rather than a search through ~20k files for the desired matching time period from another location (e.g. for localization).

@ben-hendricks
Copy link

As a comment to @scottveirs suggestion regarding filename convention and time synchronization: A change in filename convention is usually a small step, from a coding perspective. Synchronizing recording periods gave our coding team some headaches because we also wanted to be sure that all files have a predictable length (those with different length were re-named so that a search algorithm could filter them). However, in our experience the benefits outweigh the costs. a) It is a virtual requirement to x-correlate and localize transient signals. b) any match between a timestamp and a corresponding audio file can be made instantaneously.

@scottveirs
Copy link
Member Author

Great advice @ben-hendricks . Thanks for sharing insights from the BC Hydrophone Network!

I've created two orcanode issues based on your input:

@scottveirs
Copy link
Member Author

@ben-hendricks shared on a call today that the BC Hydrophone Network uses a custom driver to generate timestamps from their icListen hydrophones in this format:

ICLISTENHF1281_20190704T085500.000Z_20190704T090000.000Z.flac

Where 1291 is the instrument ID (serial number?) and the .000 suffix is precision in seconds.

The archived format for processed calibrated noise level files assumes the user knows the timestamp is in UTC time zone, so ends up as (or close to?):

1281_20190704T085500.wav

These details ^^^ from Ben may be of interest @valentina-s @savageGrant @CaseCal @mitchhaldeman

@scottveirs
Copy link
Member Author

@ben-hendricks Can you confirm/deny that the .000 part of the ICLISTEN file name is precision in seconds (rather an indication of zero hours offset from UTC (Z) time)?

@CaseCal
Copy link

CaseCal commented Feb 4, 2023

Thanks @scottveirs and @ben-hendricks, this is helpful and timely as we're juts developing our file naming and access tool.

I notice in that example that the .flac file contains a start and end time, while the wav file has just a start time. Is there any standard or preference to including only start time, start time and end time, or start time and duration? Especially as we gear towards efficient storage in our own project, we may not have conveniently sized archive file durations.

My though is having start time and end time makes it the easiest to scan files for a specific timestamp or period, but it also starts to become somewhat verbose.

@ben-hendricks
Copy link

ben-hendricks commented Feb 4, 2023 via email

@scottveirs
Copy link
Member Author

@ben-hendricks Can you confirm/deny that the .000 part of the ICLISTEN file name is precision in seconds (rather an indication of zero hours offset from UTC (Z) time)?

Thanks to facilitation by @ben-hendricks , Tom Dakin confirms via email:

Yes the .000 are milliseconds.

@scottveirs
Copy link
Member Author

scottveirs commented Apr 25, 2023

Noting that MANTA (Matlab-based noise analysis software) says this about datetime formats:

The preferred time/date format in the filename is yyyymmdd_HHMMSS (HHMMSS.FFF is also acceptable).

The date/time information can be located at any position within the filename. To aid users in renaming their acoustic data files to be compatible with MANTA software, a file renaming tool (Sox-o-matic) is available from The Cornell Lab of Ornithology Center for Conservation Bioacoustics:

Sox-o-matic Wiki: https://bitbucket.org/CLO-BRP/sox-o-matic/wiki/Home

Sox-o-matic Software download: https://www.birds.cornell.edu/ccb/sox-o-matic/

@scottveirs
Copy link
Member Author

Also, we should test whether we can ensure ffmpeg can write a file with data starting at YYMMDD-HHMMSS precisely (to the nearest 10 or 100 microseconds). Otherwise we may need or want to add precision within the filename, i.e. precision high enough for any future localization efforts (e.g. 10 or 100 microseconds?).

See Steve's thoughts in this other orcanode issue for more info about achieving high precision with ffmpeg...

@scottveirs
Copy link
Member Author

Comparing readability of these two options, for fun:

20190704T085500.000Z (BCHN format)
20190704_092314.000Z (Proposed Orcasound format)

And noting that OOI added a lot of precision beyond MBARI, but neither added a Z or +00...

2017-06-13T16:00:00 (MBARI format, relying on convention of scientific timestamps defaulting to UTC time zone)
2021-08-04T00:20:00.000015 (OOI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Orcanode development
  
In progress
Development

No branches or pull requests

7 participants