implement boilerplate `SubtractFromID` idea to allow same IDs on different scope #84

atn38 · 2019-10-10T20:01:06Z

The current metabase does not support the same dataset IDs for different scopes: for example, if one IM manages edi and knb-lter-ble datasets, there cannot be edi.21 and knb-lter-ble.21 datasets in the same installation of metabase.

Proposed solution: a numeric column in the boilerplate table, called SubtractFromID. The default setting is 0. If it's not 0, downstream MetaEgress will subtract the specified number from the metabase DataSetID to get the EML packageID.

For example:
an IM manages edi and knb-lter-ble datasets. To avoid ID collision, they create two sets of boilerplate items via the boilerplate table.

The default has knb-lter-ble scope and 0 in SubtractFromID. Therefore the metabase DatasetID 1 will translate to the EML packageID knb-lter-ble.1.
The other has edi scope and 1,000,000 in SubtractFromID. Therefore the metabase DatasetID 1,000,001 will translate to the EML packageID edi.1.

The text was updated successfully, but these errors were encountered:

twhiteaker · 2019-10-10T20:14:29Z

That seems complicated to me and requires downstream scripts to interpret. SubtractFromID is unintuitive for someone just looking at the database design.

I don't have a better solution, so for now my comment is just complaining!

gastil · 2019-10-10T20:23:28Z

Tim that is valid feedback. I do think it is safe though, because the default is to only have one site (one scope) and not make use of this special feature. The default for SubtractFromID is zero. If client sw is unaware of the special feature, it will merrily generate eml for all the DataSetIDs. The only way a Million+ ID can get into mb is if the mb user conciously inserts it.

atn38 · 2019-10-10T20:49:54Z

Tim if you are interested in marinating for a better solution, we came up with roughly three other classes of solutions to this problem, none of which is pratical IOO:

use scope.id a la edi.1 to serve as the metabase DataSetID.
treat the metabase DataSetID to be different from the id in EML packageID.
just use different installations for different projects. For me the big example where this doesn't make sense is if the projects share many things, or even can be almost identical save for the scope (when an IM manages both the ecocommDP and the original versions of a data package).

gastil · 2019-10-10T23:37:39Z

Clarification:
Those are just ideas being floated. DataSetID is THE CENTRAL index of the entire metabase. If we do refactor something that fundamental, it will only be after thorough vetting.

When referring to the integer DataSetID, please call it DataSetID. When referring to the string like knb-lter-xyz.789 call it dataset_id (or dataset_archive_id), to avoid confusion.

Before any major design change like this, look at the potential for redundant or conflicting rows. If DataSetID is not the i in scope.i.rev, then it is possible to enter the same package twice. Only a human could prevent that. The idea of a database is to make it not possible to enter bad content.

Tools external to the database, such as ds_time_series_update.pl or writeEML.pl or DSload.pl, no matter how loosely coupled, may be too reliant on something so fundamental as DataSetID being an integer and the unambiguous key to which dataset is which.

twhiteaker · 2019-10-11T17:51:21Z

What about this idea, whom I shall name Daisy:

treat the metabase DataSetID to be different from the id in EML packageID.

So DataSetID would be the unique ID internal to metabase. The Dataset table would have two new attributes: Scope and PackageID, whose combination must be unique across metabase. The Scope attribute would be dropped from the Boilerplate table.

What about Daisy is worse than SubtractFromID, other than the tremendous effort involved in the redesign?

One of the appealing aspects of Li's mini-metabase to me was its relative simplicity. If there's a database solution that is simple, then I prefer that over one requiring downstream scripts. What I don't know, is what impact Daisy would have on the rest of the database wrt views, constraints, relationships, etc. I think the views are probably already beyond my grasp, but if Daisy requires some wacky view, then my support of Daisy would wane.

Maybe my question is, if you had a magic button that would implement all schema changes for you, which would you prefer, SubtractFromID or Daisy?

gastil · 2019-10-11T21:12:49Z

The problem is not the DDL to perform schema changes. The ALTER sql to change the schema of lter-core-mb before migration is nowhere near the scale of work of the actual migrations. And those schema changes may make the difference between an existing site being able or not to use lter-core-mb.

If we opt for SubtractFromID or "Daisy"...

SubtractFromID should not be visible to external code because the subtraction should happen in the VIEWs. Similarly, Daisy (DataSetID not related to i in scope.i.rev packageId) would change each of the VIEWs on the db side but those VIEWs would still output the same columns to external queries, such as from MetaEgres. At least, it SHOULD be that way.

We should not choose until we thoroughly examine the implications.

There is another option, equally ungainly, of combining 2 columns for the main key, (DataSetID, scope). I do not like this option because it means adding a scope column to every DataSetStuff table. SubtractFromID encodes a scope into the id, in one column.

Daisy would result in all queries having an additional JOIN. And removing the key identifier from all the DataSetStuff tables humans edit, so the human would have to perform that JOIN manually each time. Awkward. This would be the most radical divergence from the Metabase design. It would also require a slew of new UNIQUE keys to prevent the same dataset from using two keys.

That is one way of implementing Daisy. There is also the BON mini-version of mini_metabase, which is an unconstrained Daisy, maintained by human convention. (Duplicates are avoided by a human just not entering them.) Of the 53 rows in pkg_state, only 17 are in DataSet. DataSetID (as used for BON) has no relation to the i in scope.i.rev except via pkg_state. There are 5 scopes: edi, knb-lter-sbc, knb-lter-cce, pisco, and x, "x" meaning scope-not-known-yet. There are also empty string and a single space (unconstrained). If we opt for a Daisy design, we can learn from the BON mini some potential pitfalls.

SubtractFromID gives the human editor a clue: they can see the i (the i in scope.i.rev) for each row. There would need to be a JOIN in the VIEWs to look up the scope and N*million to subtract. And the calling code might have to know what scope it is requesting. ... actually I'm not clear how that would work yet! (And the same goes for Daisy; the calling code would either have to be aware of the inner workings of mb, ie tightly coupled, or know what scope it is requesting and have the scope part of the key in the view output.

None of these options appeals.

If forced to choose from these four options:

SubtractFromID (N*million + the i in scope.i.rev) with look-up for (N,scope)
Daisy (a main key not related to the i in scope.i.rev)
2-column main key (DataSetID,scope)
Remain a single-scope design (or, non-overlapping id from multiple scopes)

At this point I would, tentatively, choose SubtractFromID because

rows remain recogniseable to human editors
no extra JOINs until the last step
for sites with one scope, no added complications
will not break existing external tools that work directly with tables (DSload, DSexport)
single-scope is limiting for some sites.

My, tentative
2nd choice: 2-column (DataSetID,scope)
3rd choice: limited to single-scope
4th choice: Daisy

More important, I would not make the choice now, not until I look carefully at all sites seriously considering migration (ie NWT, HBR). I do not want another surprise like the BON site. And if core-mb does go a new direction so far away from the metabase I've used for 6 years, I would re-weigh the benefit:cost of migrating MCR.

Also I would not commit to that choice until we consider some other potential major design revisions (or, technically, revert-tions as these are un-dos). We may have to revert to pkg_state being parent to DataSet. I'll explain in a separate issue.

twhiteaker · 2019-10-12T12:41:11Z

@gastil I'm confused about this paragraph:

Daisy would result in all queries having an additional JOIN. And removing the key identifier from all the DataSetStuff tables humans edit, so the human would have to perform that JOIN manually each time. Awkward. This would be the most radical divergence from the Metabase design. It would also require a slew of new UNIQUE keys to prevent the same dataset from using two keys.

Daisy means having a DatasetID unique within metabase, and adding scope and packageID to the Dataset table. So I think you'd just need an additional JOIN when you need packageID, which is just one place in EML, right? Also regarding the last sentence, a given row in the Dataset table would only have one DatasetID, which would represent one dataset, so how could that dataset use two DatasetIDs? Or did you mean something else by "keys"? Isn't DatasetID the unique key, and for existing metabase installations, it would already be unique.

Since tone doesn't come across well in written text, let me say that I am not strongly advocating for Daisy. I'm just trying to understand the options. :)

gastil · 2019-10-12T16:11:05Z

Hi @tim, Before I describe the purpose of that JOIN, I have to admit we greatly over estimated the need for multiple scopes. Revisions of ecocomdp maintained by a site can -and perhaps “should” - become a new packageId with the site’s scope. I did not know that before. The BON is an edge case, and perhaps can be handled in an “off label” use of MB, with fake DataSetID in pkg_state. That quirky situation need not drive all other installations of mb. The JOIN I mentioned would be needed to re-index from one ID to the other, to reconnect the 5 in scope.5.rev to the different 1234 Dailey would put in DataSetID. Otherwise any external code would have to look up the 1234 for 5. And humans would have to edit row 1234 to edit dataset 5. See how that adds an extra step to all actions and queries? But all that is moot if we either do not overlap (real) DaraSetIDs. Note that in the BON case, as long as pkg_state is parent of DataSet, there is no conflict of scope. We have to revert to that original design for other reasons anyway. BON only uses scope edi in DataSet. Thanks for reviewing. We all need to participate to have a MB that works for all of us! Gastil

…

On Saturday, October 12, 2019, Tim Whiteaker ***@***.***> wrote: @gastil <https://github.com/gastil> I'm confused about this paragraph: Daisy would result in all queries having an additional JOIN. And removing the key identifier from all the DataSetStuff tables humans edit, so the human would have to perform that JOIN manually each time. Awkward. This would be the most radical divergence from the Metabase design. It would also require a slew of new UNIQUE keys to prevent the same dataset from using two keys. Daisy means having a DatasetID unique within metabase, and adding scope and packageID to the Dataset table. So I think you'd just need an additional JOIN when you need packageID, which is just one place in EML, right? Also regarding the last sentence, a given row in the Dataset table would only have one DatasetID, which would represent one dataset, so how could that dataset use two DatasetIDs? Or did you mean something else by "keys"? Isn't DatasetID the unique key, and for existing metabase installations, it would already be unique. Since tone doesn't come across well in written text, let me say that I am not strongly advocating for Daisy. I'm just trying to understand the options. :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=AGTM6FVNEBIWGNBTANCT4N3QOHAWRA5CNFSM4I7RNYHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBB6RHQ#issuecomment-541321374>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGTM6FQDU4UIOSH7W2TEE6DQOHAWRANCNFSM4I7RNYHA> .

twhiteaker · 2019-10-12T17:54:46Z

I see. Rather than exporting out of MB, if you were trying to get into it and query for a dataset's details given a packageID, then you'd need a join to get the DatasetID. You could add a join to everything, or you could gather info in a two step process. Step 1: Do one join and get the DatasetID. Step 2: Use the DatasetID for the rest of your queries.

However, based on comments, it seems that it is a rare edge case when multiple scopes with identical packageIDs is needed, so the SubtractFromID solution keeps the design simpler and works fine for most users.

gastil · 2019-10-12T18:20:48Z

Correct Tim. Good summary. One detail: Substitute “As well as when” for “Rather than”.

…

On Saturday, October 12, 2019, Tim Whiteaker ***@***.***> wrote: I see. Rather than exporting out of MB, if you were trying to get into it and query for a dataset's details given a packageID, then you'd need a join to get the DatasetID. You could add a join to everything, or you could gather info in a two step process. Step 1: Do one join and get the DatasetID. Step 2: Use the DatasetID for the rest of your queries. However, based on comments, it seems that it is a rare edge case when multiple scopes with identical packageIDs is needed, so the SubtractFromID solution keeps the design simpler and works fine for most users. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=AGTM6FUFADZAJBD7QTBM5LDQOIFOPA5CNFSM4I7RNYHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCE5EQ#issuecomment-541347474>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGTM6FQUCOYHJXOMJKPNAYLQOIFOPANCNFSM4I7RNYHA> .

atn38 self-assigned this Oct 10, 2019

atn38 mentioned this issue Oct 10, 2019

implement the SubtractFromID idea BLE-LTER/MetaEgress#49

Open

gastil self-assigned this Oct 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement boilerplate `SubtractFromID` idea to allow same IDs on different scope #84

implement boilerplate `SubtractFromID` idea to allow same IDs on different scope #84

atn38 commented Oct 10, 2019

twhiteaker commented Oct 10, 2019

gastil commented Oct 10, 2019

atn38 commented Oct 10, 2019 •

edited

Loading

gastil commented Oct 10, 2019

twhiteaker commented Oct 11, 2019

gastil commented Oct 11, 2019

twhiteaker commented Oct 12, 2019

gastil commented Oct 12, 2019 via email

twhiteaker commented Oct 12, 2019

gastil commented Oct 12, 2019 via email

implement boilerplate SubtractFromID idea to allow same IDs on different scope #84

implement boilerplate SubtractFromID idea to allow same IDs on different scope #84

Comments

atn38 commented Oct 10, 2019

twhiteaker commented Oct 10, 2019

gastil commented Oct 10, 2019

atn38 commented Oct 10, 2019 • edited Loading

gastil commented Oct 10, 2019

twhiteaker commented Oct 11, 2019

gastil commented Oct 11, 2019

twhiteaker commented Oct 12, 2019

gastil commented Oct 12, 2019 via email

twhiteaker commented Oct 12, 2019

gastil commented Oct 12, 2019 via email

implement boilerplate `SubtractFromID` idea to allow same IDs on different scope #84

implement boilerplate `SubtractFromID` idea to allow same IDs on different scope #84

atn38 commented Oct 10, 2019 •

edited

Loading