Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement boilerplate SubtractFromID idea to allow same IDs on different scope #84

Open
atn38 opened this issue Oct 10, 2019 · 10 comments
Assignees

Comments

@atn38
Copy link
Member

atn38 commented Oct 10, 2019

The current metabase does not support the same dataset IDs for different scopes: for example, if one IM manages edi and knb-lter-ble datasets, there cannot be edi.21 and knb-lter-ble.21 datasets in the same installation of metabase.

Proposed solution: a numeric column in the boilerplate table, called SubtractFromID. The default setting is 0. If it's not 0, downstream MetaEgress will subtract the specified number from the metabase DataSetID to get the EML packageID.

For example:
an IM manages edi and knb-lter-ble datasets. To avoid ID collision, they create two sets of boilerplate items via the boilerplate table.

  1. The default has knb-lter-ble scope and 0 in SubtractFromID. Therefore the metabase DatasetID 1 will translate to the EML packageID knb-lter-ble.1.
  2. The other has edi scope and 1,000,000 in SubtractFromID. Therefore the metabase DatasetID 1,000,001 will translate to the EML packageID edi.1.
@twhiteaker
Copy link
Contributor

That seems complicated to me and requires downstream scripts to interpret. SubtractFromID is unintuitive for someone just looking at the database design.

I don't have a better solution, so for now my comment is just complaining!

@gastil
Copy link
Contributor

gastil commented Oct 10, 2019

Tim that is valid feedback. I do think it is safe though, because the default is to only have one site (one scope) and not make use of this special feature. The default for SubtractFromID is zero. If client sw is unaware of the special feature, it will merrily generate eml for all the DataSetIDs. The only way a Million+ ID can get into mb is if the mb user conciously inserts it.

@atn38
Copy link
Member Author

atn38 commented Oct 10, 2019

Tim if you are interested in marinating for a better solution, we came up with roughly three other classes of solutions to this problem, none of which is pratical IOO:

  • use scope.id a la edi.1 to serve as the metabase DataSetID.
  • treat the metabase DataSetID to be different from the id in EML packageID.
  • just use different installations for different projects. For me the big example where this doesn't make sense is if the projects share many things, or even can be almost identical save for the scope (when an IM manages both the ecocommDP and the original versions of a data package).

@gastil
Copy link
Contributor

gastil commented Oct 10, 2019

Clarification:
Those are just ideas being floated. DataSetID is THE CENTRAL index of the entire metabase. If we do refactor something that fundamental, it will only be after thorough vetting.

When referring to the integer DataSetID, please call it DataSetID. When referring to the string like knb-lter-xyz.789 call it dataset_id (or dataset_archive_id), to avoid confusion.

Before any major design change like this, look at the potential for redundant or conflicting rows. If DataSetID is not the i in scope.i.rev, then it is possible to enter the same package twice. Only a human could prevent that. The idea of a database is to make it not possible to enter bad content.

Tools external to the database, such as ds_time_series_update.pl or writeEML.pl or DSload.pl, no matter how loosely coupled, may be too reliant on something so fundamental as DataSetID being an integer and the unambiguous key to which dataset is which.

@twhiteaker
Copy link
Contributor

What about this idea, whom I shall name Daisy:

treat the metabase DataSetID to be different from the id in EML packageID.

So DataSetID would be the unique ID internal to metabase. The Dataset table would have two new attributes: Scope and PackageID, whose combination must be unique across metabase. The Scope attribute would be dropped from the Boilerplate table.

What about Daisy is worse than SubtractFromID, other than the tremendous effort involved in the redesign?

One of the appealing aspects of Li's mini-metabase to me was its relative simplicity. If there's a database solution that is simple, then I prefer that over one requiring downstream scripts. What I don't know, is what impact Daisy would have on the rest of the database wrt views, constraints, relationships, etc. I think the views are probably already beyond my grasp, but if Daisy requires some wacky view, then my support of Daisy would wane.

Maybe my question is, if you had a magic button that would implement all schema changes for you, which would you prefer, SubtractFromID or Daisy?

@gastil
Copy link
Contributor

gastil commented Oct 11, 2019

The problem is not the DDL to perform schema changes. The ALTER sql to change the schema of lter-core-mb before migration is nowhere near the scale of work of the actual migrations. And those schema changes may make the difference between an existing site being able or not to use lter-core-mb.

If we opt for SubtractFromID or "Daisy"...

SubtractFromID should not be visible to external code because the subtraction should happen in the VIEWs. Similarly, Daisy (DataSetID not related to i in scope.i.rev packageId) would change each of the VIEWs on the db side but those VIEWs would still output the same columns to external queries, such as from MetaEgres. At least, it SHOULD be that way.

We should not choose until we thoroughly examine the implications.

There is another option, equally ungainly, of combining 2 columns for the main key, (DataSetID, scope). I do not like this option because it means adding a scope column to every DataSetStuff table. SubtractFromID encodes a scope into the id, in one column.

Daisy would result in all queries having an additional JOIN. And removing the key identifier from all the DataSetStuff tables humans edit, so the human would have to perform that JOIN manually each time. Awkward. This would be the most radical divergence from the Metabase design. It would also require a slew of new UNIQUE keys to prevent the same dataset from using two keys.

That is one way of implementing Daisy. There is also the BON mini-version of mini_metabase, which is an unconstrained Daisy, maintained by human convention. (Duplicates are avoided by a human just not entering them.) Of the 53 rows in pkg_state, only 17 are in DataSet. DataSetID (as used for BON) has no relation to the i in scope.i.rev except via pkg_state. There are 5 scopes: edi, knb-lter-sbc, knb-lter-cce, pisco, and x, "x" meaning scope-not-known-yet. There are also empty string and a single space (unconstrained). If we opt for a Daisy design, we can learn from the BON mini some potential pitfalls.

SubtractFromID gives the human editor a clue: they can see the i (the i in scope.i.rev) for each row. There would need to be a JOIN in the VIEWs to look up the scope and N*million to subtract. And the calling code might have to know what scope it is requesting. ... actually I'm not clear how that would work yet! (And the same goes for Daisy; the calling code would either have to be aware of the inner workings of mb, ie tightly coupled, or know what scope it is requesting and have the scope part of the key in the view output.

None of these options appeals.

If forced to choose from these four options:

  • SubtractFromID (N*million + the i in scope.i.rev) with look-up for (N,scope)
  • Daisy (a main key not related to the i in scope.i.rev)
  • 2-column main key (DataSetID,scope)
  • Remain a single-scope design (or, non-overlapping id from multiple scopes)

At this point I would, tentatively, choose SubtractFromID because

  1. rows remain recogniseable to human editors
  2. no extra JOINs until the last step
  3. for sites with one scope, no added complications
  4. will not break existing external tools that work directly with tables (DSload, DSexport)
  5. single-scope is limiting for some sites.

My, tentative
2nd choice: 2-column (DataSetID,scope)
3rd choice: limited to single-scope
4th choice: Daisy

More important, I would not make the choice now, not until I look carefully at all sites seriously considering migration (ie NWT, HBR). I do not want another surprise like the BON site. And if core-mb does go a new direction so far away from the metabase I've used for 6 years, I would re-weigh the benefit:cost of migrating MCR.

Also I would not commit to that choice until we consider some other potential major design revisions (or, technically, revert-tions as these are un-dos). We may have to revert to pkg_state being parent to DataSet. I'll explain in a separate issue.

@gastil gastil self-assigned this Oct 12, 2019
@twhiteaker
Copy link
Contributor

@gastil I'm confused about this paragraph:

Daisy would result in all queries having an additional JOIN. And removing the key identifier from all the DataSetStuff tables humans edit, so the human would have to perform that JOIN manually each time. Awkward. This would be the most radical divergence from the Metabase design. It would also require a slew of new UNIQUE keys to prevent the same dataset from using two keys.

Daisy means having a DatasetID unique within metabase, and adding scope and packageID to the Dataset table. So I think you'd just need an additional JOIN when you need packageID, which is just one place in EML, right? Also regarding the last sentence, a given row in the Dataset table would only have one DatasetID, which would represent one dataset, so how could that dataset use two DatasetIDs? Or did you mean something else by "keys"? Isn't DatasetID the unique key, and for existing metabase installations, it would already be unique.

Since tone doesn't come across well in written text, let me say that I am not strongly advocating for Daisy. I'm just trying to understand the options. :)

@gastil
Copy link
Contributor

gastil commented Oct 12, 2019 via email

@twhiteaker
Copy link
Contributor

I see. Rather than exporting out of MB, if you were trying to get into it and query for a dataset's details given a packageID, then you'd need a join to get the DatasetID. You could add a join to everything, or you could gather info in a two step process. Step 1: Do one join and get the DatasetID. Step 2: Use the DatasetID for the rest of your queries.

However, based on comments, it seems that it is a rare edge case when multiple scopes with identical packageIDs is needed, so the SubtractFromID solution keeps the design simpler and works fine for most users.

@gastil
Copy link
Contributor

gastil commented Oct 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants