New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed API for new Client & Indexer submodules based on IRIS time series index #2206

Open
wants to merge 28 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@chad-iris
Contributor

chad-iris commented Sep 14, 2018

Howdy. We will very soon begin development on a new Client (and related Indexer) submodules that will allow ObsPy users to transparently read data from a repository of miniSEED that has been indexed with IRIS' mseedindex program. At IRIS we are building a data access client that will download data and construct such an index and these modules will allow ObsPy users to efficiently access their local repository of data.

These modules are in the same spirit as obspy.db and obspy.clients.filesystem.sds, in that they abstract the access to a local repository of data.

Thoughts and feedback welcome.

We do not have strong feelings on the suggested module locations, the suggestion in the proposal is just a guess at best fit.

Are there any guidelines/requirements for connections to databases, i.e. use SQLAlchemy? By far, the most common use case is with SQLite, but in theory the same system could be used with PostgreSQL. We will probably write directly for SQLite initially (as most of the code already exists in another project) but wanted to think ahead.

Proposed APIs:

obspy.clients.filesystem.tsindex.Client

class Client(sqlitedb, datapath_replace=None)

Bases: builtins.object

Description

Time series extraction client for a database created by the IRIS mseedindex program. If datapath_replace is specified it must be a 2-value tuple, where any occurance of the first value will be replaced with the second value in filename paths from the index.

Public Methods

Method Return type Description
get_waveforms(network, station, location, channel, starttime, endtime) obspy.core.stream.Stream Query tsindex database and read miniSEED data from local indexed directory tree.
get_nslc(network, station, location, channel, starttime, endtime) list(tuple) Return a list of tuples [(net, sta, loc, cha),...] containing information on what streams are included in the tsindex database.
get_availability_extent(network, station, location, channel, starttime, endtime) list(tuple) Return a list of tuples [(net, sta, loc, cha, earliest, latest)] containing data extent info for time series included in the tsindex database.
get_availability(network, station, location, channel, starttime, endtime, include_sample_rate=False, merge_overlap=False) list(tuple) Return a list of tuples [(net, sta, loc, cha, start, end),...] containing data availability info for time series included in the tsindex database.

If include_sample_rate=True, then a tuple containing the sample rate [(net, sta, loc, cha, start, end, sample_rate),...] is returned.

If merge_overlap=True, then all time spans that overlap are merged.
get_availability_percentage(network, station, location, channel, starttime, endtime) 2-tuple of percentage of available data (0.0 to 1.0) and number of gaps/overlaps. Get percentage of available data for a specified network, station, location, channel, starttime, endtime combination.
has_data(network, station, location, channel, starttime, endtime) boolean Check if specified stream has any data using the tsindex database.

obspy.clients.filesystem.tsindex.Indexer

class Indexer(root_path, sqlitedb, index_cmd=’mseedindex’, filename_pattern=’*’, parallel=5)

Bases: builtins.object

Description

Build an index for miniSEED data using IRIS’s mseedindex program. Recursively search for files matching filename_pattern starting from root_path and run index_cmd for each target file found that is not already in the index. After all new files are indexed a summary table is generated with the extents of each time series.

Public Methods

Method Return type Description
run(build_summary=True, relative_paths=False, reindex=False) boolean Execute the file discovery and indexing.

By default, a summary table is (re)generated containing the extents for each time series in the index. This can be turned off by setting build_summary to False.

By default, the absolute path to each file is stored in the index. If relative_paths is True, the file paths will be relative to the root_path.

By default, files are not indexed that are already in the index and have not been modified. The reindex option can be set to True to force a re-indexing of all files regardless.
@jkmacc-LANL

This comment has been minimized.

Show comment
Hide comment
@jkmacc-LANL

jkmacc-LANL Aug 20, 2018

Contributor

Hi, @chad-iris. I love the idea that users can access their own data with an FDSN-compliant client! Given that it has been proposed that the other database-access client (obspy.db.client.Client) be deprecated and moved to its own package (I don't think it's actually happening, but that may be mostly due to logistics), would it also make sense for a tsindex.Client to also be its own package as well? A number of base classes were created so that any packages wanting to implement the client API could be part of the isinstance(client, obspy.clients.base.WaveformClient) family.

Contributor

jkmacc-LANL commented Aug 20, 2018

Hi, @chad-iris. I love the idea that users can access their own data with an FDSN-compliant client! Given that it has been proposed that the other database-access client (obspy.db.client.Client) be deprecated and moved to its own package (I don't think it's actually happening, but that may be mostly due to logistics), would it also make sense for a tsindex.Client to also be its own package as well? A number of base classes were created so that any packages wanting to implement the client API could be part of the isinstance(client, obspy.clients.base.WaveformClient) family.

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Aug 21, 2018

Member

@jkmacc-LANL if you're looking for something like this (indexed waveform database, FDSN interface, ..) you might also want to have a look https://github.com/krischer/jane/. It has a bit of a different scope than this project, but it is similar in the aspect of indexing local files and exposing a server to access it (in that case an FDSN compliant webserver).

Member

megies commented Aug 21, 2018

@jkmacc-LANL if you're looking for something like this (indexed waveform database, FDSN interface, ..) you might also want to have a look https://github.com/krischer/jane/. It has a bit of a different scope than this project, but it is similar in the aspect of indexing local files and exposing a server to access it (in that case an FDSN compliant webserver).

@megies megies added this to the Future release milestone Aug 21, 2018

@megies megies added the .clients label Aug 21, 2018

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Aug 21, 2018

Member

These modules are in the same spirit as obspy.db and obspy.clients.filesystem.sds, in that they abstract the access to a local repository of data.

Thoughts and feedback welcome.

Sounds good, can not hurt to have a client for this in obspy directly. Setting up proper unit tests might be the most work intensive part, I guess..

We do not have strong feelings on the suggested module locations, the suggestion in the proposal is just a guess at best fit.

I first thought obspy.db might be the better place for this, but on a second thought, I think it indeed fits better in obspy.clients.filesystem (at least unless any remote connection functionality is planned), as it will have a similar API as most other clients and it relies on locally stored files.

Proposed APIs

Looks good to me, looks like methods with common names already present on other clients have compatible API (args, return types mostly; specialized kwargs can differ of course), so that we avoid discrepancies between the different clients whenever possible. 👍

obspy.clients.filesystem.tsindex.Indexer

This could also be exposed as a CL script for convenience. I'd like to hear other opinions though, as we kinda have come to the feeling that we are already exposing a lot of CL scripts and actually were considering to switch to a more modular CLI approach at some point (e.g. $ obspy plot ... instead of obspy-plot ...)

In general this all looks well planned to me. 👍


On a side note, I think it would be good to have actual example tables (screenshots etc) created with mseedindex in its docs section, or some schema of table layout used by mseedindex, as the client you want to add here will be completely relying on a well defined DB schema, I guess.. also the naming proposed above (tsindex) should be defined at some place, I think (maybe a minimal stub repository/project site with the DB table layout).

Member

megies commented Aug 21, 2018

These modules are in the same spirit as obspy.db and obspy.clients.filesystem.sds, in that they abstract the access to a local repository of data.

Thoughts and feedback welcome.

Sounds good, can not hurt to have a client for this in obspy directly. Setting up proper unit tests might be the most work intensive part, I guess..

We do not have strong feelings on the suggested module locations, the suggestion in the proposal is just a guess at best fit.

I first thought obspy.db might be the better place for this, but on a second thought, I think it indeed fits better in obspy.clients.filesystem (at least unless any remote connection functionality is planned), as it will have a similar API as most other clients and it relies on locally stored files.

Proposed APIs

Looks good to me, looks like methods with common names already present on other clients have compatible API (args, return types mostly; specialized kwargs can differ of course), so that we avoid discrepancies between the different clients whenever possible. 👍

obspy.clients.filesystem.tsindex.Indexer

This could also be exposed as a CL script for convenience. I'd like to hear other opinions though, as we kinda have come to the feeling that we are already exposing a lot of CL scripts and actually were considering to switch to a more modular CLI approach at some point (e.g. $ obspy plot ... instead of obspy-plot ...)

In general this all looks well planned to me. 👍


On a side note, I think it would be good to have actual example tables (screenshots etc) created with mseedindex in its docs section, or some schema of table layout used by mseedindex, as the client you want to add here will be completely relying on a well defined DB schema, I guess.. also the naming proposed above (tsindex) should be defined at some place, I think (maybe a minimal stub repository/project site with the DB table layout).

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Aug 21, 2018

Member

Oh.. forgot one thing:

Are there any guidelines/requirements for connections to databases, i.e. use SQLAlchemy? By far, the most common use case is with SQLite, but in theory the same system could be used with PostgreSQL. We will probably write directly for SQLite initially (as most of the code already exists in another project) but wanted to think ahead.

Yes, ideally you should solely rely on sqlalchemy as it is already a hard dependency which would mean that our dependencies would not change. No DB expert, but I think it should work with any backends you might plan on (SQLite, MySQL, postgres) and it seems like the major SQL package for Python anyway?

Member

megies commented Aug 21, 2018

Oh.. forgot one thing:

Are there any guidelines/requirements for connections to databases, i.e. use SQLAlchemy? By far, the most common use case is with SQLite, but in theory the same system could be used with PostgreSQL. We will probably write directly for SQLite initially (as most of the code already exists in another project) but wanted to think ahead.

Yes, ideally you should solely rely on sqlalchemy as it is already a hard dependency which would mean that our dependencies would not change. No DB expert, but I think it should work with any backends you might plan on (SQLite, MySQL, postgres) and it seems like the major SQL package for Python anyway?

@chad-iris

This comment has been minimized.

Show comment
Hide comment
@chad-iris

chad-iris Aug 21, 2018

Contributor

Thanks for the feedback @jkmacc-LANL @megies

I think it indeed fits better in obspy.clients.filesystem (at least unless any remote connection functionality is planned)

No such remote functionality is planned, but that aspect did give me pause since there is no reason the index could not be used for remote things, e.g. data in an object store like S3. But let's cross that road if it ever happens.

This could also be exposed as a CL script for convenience

👍I have no strong feelings about which flavor of CL invocation. The "modular" CLI approach certainly helps with clutter, which is nice since most users are likely not using most of the scripts if any at all.

On a side note, I think it would be good to have actual example tables (screenshots etc) created with mseedindex in its docs section, or some schema of table layout used by mseedindex

Examples are a good idea. The database schema documentation is a few clicks from the GitHub project landing page, I can add a more direct link. I'll also add a link to the mseedindex man page. Thanks.

also the naming proposed above (tsindex) should be defined at some place, I think (maybe a minimal stub repository/project site with the DB table layout).

Yup. It's the default for mseedindex and it's threaded through all it's documentation and the documentation/usage for portable-fdsnws-dataselect, and while I recommend not changing it unless you really must it could be different and should be settable.

Contributor

chad-iris commented Aug 21, 2018

Thanks for the feedback @jkmacc-LANL @megies

I think it indeed fits better in obspy.clients.filesystem (at least unless any remote connection functionality is planned)

No such remote functionality is planned, but that aspect did give me pause since there is no reason the index could not be used for remote things, e.g. data in an object store like S3. But let's cross that road if it ever happens.

This could also be exposed as a CL script for convenience

👍I have no strong feelings about which flavor of CL invocation. The "modular" CLI approach certainly helps with clutter, which is nice since most users are likely not using most of the scripts if any at all.

On a side note, I think it would be good to have actual example tables (screenshots etc) created with mseedindex in its docs section, or some schema of table layout used by mseedindex

Examples are a good idea. The database schema documentation is a few clicks from the GitHub project landing page, I can add a more direct link. I'll also add a link to the mseedindex man page. Thanks.

also the naming proposed above (tsindex) should be defined at some place, I think (maybe a minimal stub repository/project site with the DB table layout).

Yup. It's the default for mseedindex and it's threaded through all it's documentation and the documentation/usage for portable-fdsnws-dataselect, and while I recommend not changing it unless you really must it could be different and should be settable.

@dsentinel

This comment has been minimized.

Show comment
Hide comment
@dsentinel

dsentinel Aug 21, 2018

Contributor

On a side note, I think it would be good to have actual example tables (screenshots etc) created with mseedindex in its docs section, or some schema of table layout used by mseedindex,

A lot of this is implemented and laid out in the mseedindex repo: https://github.com/iris-edu/mseedindex/wiki/Database-Schema

@chad-iris, is this schema set in stone? or are suggestions encouraged?

use SQLAlchemy?

👍 Even if only using the core of SQLAlchemy, and not the ORM. This will make the business logic independent of of the RDBMS used.
Especially if sqlite and other full SQL dbs are options. SQLite is not a full SQL and requires some special attention. SQLAlchemy helps manage the discrepancies in a clean way.

Contributor

dsentinel commented Aug 21, 2018

On a side note, I think it would be good to have actual example tables (screenshots etc) created with mseedindex in its docs section, or some schema of table layout used by mseedindex,

A lot of this is implemented and laid out in the mseedindex repo: https://github.com/iris-edu/mseedindex/wiki/Database-Schema

@chad-iris, is this schema set in stone? or are suggestions encouraged?

use SQLAlchemy?

👍 Even if only using the core of SQLAlchemy, and not the ORM. This will make the business logic independent of of the RDBMS used.
Especially if sqlite and other full SQL dbs are options. SQLite is not a full SQL and requires some special attention. SQLAlchemy helps manage the discrepancies in a clean way.

@Jollyfant

This comment has been minimized.

Show comment
Hide comment
@Jollyfant

Jollyfant Aug 21, 2018

Contributor

Cool! This can be really useful. I'm wondering has anyone ever tried indexing individual mSEED records before? Just the header metadata and a 512 byte blob.

One question about the timeindex field. When you're using a hash map you would need the exact timestamp (key) to get the trace byte-offset. So can you round to the nearest key within the database somehow, or will the Client somehow handle this?

Contributor

Jollyfant commented Aug 21, 2018

Cool! This can be really useful. I'm wondering has anyone ever tried indexing individual mSEED records before? Just the header metadata and a 512 byte blob.

One question about the timeindex field. When you're using a hash map you would need the exact timestamp (key) to get the trace byte-offset. So can you round to the nearest key within the database somehow, or will the Client somehow handle this?

@chad-iris

This comment has been minimized.

Show comment
Hide comment
@chad-iris

chad-iris Aug 21, 2018

Contributor

@dsentinel

A lot of this is implemented and laid out in the mseedindex repo: https://github.com/iris-edu/mseedindex/wiki/Database-Schema
@chad-iris, is this schema set in stone? or are suggestions encouraged?

It's pretty fixed at this point. It is the production schema used for the IRIS DMC's archive (known functional design and benefits from being sure that the indexer will be maintained) and we are building tools around the SQLite version of it, so stability is highly desired.

It was designed to be flexible enough to support different formats so hopefully it's got some headroom. Suggestions, in the mseedindex issues, are welcome but I'm not sure when the next version of a schema will be created.

Contributor

chad-iris commented Aug 21, 2018

@dsentinel

A lot of this is implemented and laid out in the mseedindex repo: https://github.com/iris-edu/mseedindex/wiki/Database-Schema
@chad-iris, is this schema set in stone? or are suggestions encouraged?

It's pretty fixed at this point. It is the production schema used for the IRIS DMC's archive (known functional design and benefits from being sure that the indexer will be maintained) and we are building tools around the SQLite version of it, so stability is highly desired.

It was designed to be flexible enough to support different formats so hopefully it's got some headroom. Suggestions, in the mseedindex issues, are welcome but I'm not sure when the next version of a schema will be created.

@chad-iris

This comment has been minimized.

Show comment
Hide comment
@chad-iris

chad-iris Aug 21, 2018

Contributor

@Jollyfant

I'm wondering has anyone ever tried indexing individual mSEED records before? Just the header metadata and a 512 byte blob.

It is not something I would recommend for a data set of any non-trivial size as it's simply too much overhead (and redundancy) to be worth it in most cases. Any row-per-record scheme will not scale very far.

One question about the timeindex field. When you're using a hash map you would need the exact timestamp (key) to get the trace byte-offset. So can you round to the nearest key within the database somehow, or will the Client somehow handle this?

The timeindex field will be used transparently by the Client to narrow the read range of the file for efficiency. In general, ObsPy users should not need to know any details of the index schema beyond the name of the SQLite file that contains it. More details on the timeindex field if you're interested.

Contributor

chad-iris commented Aug 21, 2018

@Jollyfant

I'm wondering has anyone ever tried indexing individual mSEED records before? Just the header metadata and a 512 byte blob.

It is not something I would recommend for a data set of any non-trivial size as it's simply too much overhead (and redundancy) to be worth it in most cases. Any row-per-record scheme will not scale very far.

One question about the timeindex field. When you're using a hash map you would need the exact timestamp (key) to get the trace byte-offset. So can you round to the nearest key within the database somehow, or will the Client somehow handle this?

The timeindex field will be used transparently by the Client to narrow the read range of the file for efficiency. In general, ObsPy users should not need to know any details of the index schema beyond the name of the SQLite file that contains it. More details on the timeindex field if you're interested.

@nick-iris

This comment has been minimized.

Show comment
Hide comment
@nick-iris

nick-iris Sep 14, 2018

Contributor

I have this code ready for your review on my nick-iris/obspy:tsindex branch at https://github.com/nick-iris/obspy/tree/tsindex.

Can you please convert this issue into a PR from the nick-iris/obspy:tsindex branch into obspy/obspy:master?

If not I can create a new PR for this issue, but trying to avoid duplicated tickets. Thanks!

Contributor

nick-iris commented Sep 14, 2018

I have this code ready for your review on my nick-iris/obspy:tsindex branch at https://github.com/nick-iris/obspy/tree/tsindex.

Can you please convert this issue into a PR from the nick-iris/obspy:tsindex branch into obspy/obspy:master?

If not I can create a new PR for this issue, but trying to avoid duplicated tickets. Thanks!

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Sep 14, 2018

Member

Can you please convert this issue into a PR

I could, but github has added an annoying glitch in issue->PR conversion, so that the person doing the conversion suddenly is author of OP. So I would hijack Chad's issue description up top. I guess just open a separate PR, otherwise Chad should do the conversion.

Member

megies commented Sep 14, 2018

Can you please convert this issue into a PR

I could, but github has added an annoying glitch in issue->PR conversion, so that the person doing the conversion suddenly is author of OP. So I would hijack Chad's issue description up top. I guess just open a separate PR, otherwise Chad should do the conversion.

@nick-iris

This comment has been minimized.

Show comment
Hide comment
@nick-iris

nick-iris Sep 14, 2018

Contributor

@chad-iris Do you mind giving this a try? I think the command to convert this issue into a PR would be as follows:

curl --user chad-iris --request POST --data '{"issue": "2206", "head": "nick-iris:tsindex", "base": "master"}' https://api.github.com/repos/obspy/obspy/pulls

Contributor

nick-iris commented Sep 14, 2018

@chad-iris Do you mind giving this a try? I think the command to convert this issue into a PR would be as follows:

curl --user chad-iris --request POST --data '{"issue": "2206", "head": "nick-iris:tsindex", "base": "master"}' https://api.github.com/repos/obspy/obspy/pulls

@chad-iris

This comment has been minimized.

Show comment
Hide comment
@chad-iris

chad-iris Sep 14, 2018

Contributor

This is now a PR.

Contributor

chad-iris commented Sep 14, 2018

This is now a PR.

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Sep 17, 2018

Member

@nick-iris test_tsindex.py is missing the appropriate footer (e.g. like here) so tests can not be run in CI right now. https://travis-ci.org/obspy/obspy/jobs/428732071#L1322

Member

megies commented Sep 17, 2018

@nick-iris test_tsindex.py is missing the appropriate footer (e.g. like here) so tests can not be run in CI right now. https://travis-ci.org/obspy/obspy/jobs/428732071#L1322

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Sep 17, 2018

Member

+DOCS for automated API docs build

Member

megies commented Sep 17, 2018

+DOCS for automated API docs build

- Add appropriate test footer.
- Change TSIndexDatabaseHandler sqlitedb instance variable name to database.
- Remove mutable objects from kwargs.
@nick-iris

This comment has been minimized.

Show comment
Hide comment
@nick-iris

nick-iris Sep 17, 2018

Contributor

Thanks for the review @megies. The ~obspy.clients.tests.test_tsindex.TestIndexer.test_run() test fails in TravisCI because it requires mseedindex to be installed:

https://github.com/nick-iris/obspy/blob/tsindex/obspy/clients/filesystem/tests/test_tsindex.py#L745.

Do you have any suggestions for what I should do with this test? Should it just be moved to a file along the lines of ~obspy.clients.filesystem.tests.integration.test_tsindex_integration.py, where it can be run manually?

Contributor

nick-iris commented Sep 17, 2018

Thanks for the review @megies. The ~obspy.clients.tests.test_tsindex.TestIndexer.test_run() test fails in TravisCI because it requires mseedindex to be installed:

https://github.com/nick-iris/obspy/blob/tsindex/obspy/clients/filesystem/tests/test_tsindex.py#L745.

Do you have any suggestions for what I should do with this test? Should it just be moved to a file along the lines of ~obspy.clients.filesystem.tests.integration.test_tsindex_integration.py, where it can be run manually?

- Fix for failing doctest
- Fix for failing  CircleCI flake8 checks.
- Skip TestIndexer.test_run() is mseedindex is not installed.
@nick-iris

This comment has been minimized.

Show comment
Hide comment
@nick-iris

nick-iris Sep 17, 2018

Contributor

I decided to add a check to the ~obspy.clients.tests.test_tsindex.TestIndexer.test_run() method that skips this test if mseedindex is not installed.

Contributor

nick-iris commented Sep 17, 2018

I decided to add a check to the ~obspy.clients.tests.test_tsindex.TestIndexer.test_run() method that skips this test if mseedindex is not installed.

@megies

This comment has been minimized.

Show comment
Hide comment
@megies

megies Sep 18, 2018

Member

Thanks for the review @megies.

No worries, but I did not have time yet to actually give this a full review, was only glancing over the whole thing so far, unfortunately.

because it requires mseedindex to be installed

Oh.. I didn't notice there was a new dependency hidden in here. Hmm. We could handle it as a soft dependency I guess, although general we try to avoid those when possible.

Member

megies commented Sep 18, 2018

Thanks for the review @megies.

No worries, but I did not have time yet to actually give this a full review, was only glancing over the whole thing so far, unfortunately.

because it requires mseedindex to be installed

Oh.. I didn't notice there was a new dependency hidden in here. Hmm. We could handle it as a soft dependency I guess, although general we try to avoid those when possible.

@@ -1174,7 +1175,8 @@ def _is_index_cmd_installed(self):
"""
Checks if the index command (e.g. mseedindex) is installed.
:raises OSError: If the index command is not installed.
:rtype: bool
:returns: Returns True the index_cmd is installed.

This comment has been minimized.

@megies

megies Sep 21, 2018

Member

Backticks for builtins/variable names, please (for proper formatting in docs): -> Returns ``True`` if ``index_cmd`` is installed.

@megies

megies Sep 21, 2018

Member

Backticks for builtins/variable names, please (for proper formatting in docs): -> Returns ``True`` if ``index_cmd`` is installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment