-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement boilerplate SubtractFromID
idea to allow same IDs on different scope
#84
Comments
That seems complicated to me and requires downstream scripts to interpret. SubtractFromID is unintuitive for someone just looking at the database design. I don't have a better solution, so for now my comment is just complaining! |
Tim that is valid feedback. I do think it is safe though, because the default is to only have one site (one scope) and not make use of this special feature. The default for SubtractFromID is zero. If client sw is unaware of the special feature, it will merrily generate eml for all the DataSetIDs. The only way a Million+ ID can get into mb is if the mb user conciously inserts it. |
Tim if you are interested in marinating for a better solution, we came up with roughly three other classes of solutions to this problem, none of which is pratical IOO:
|
Clarification: When referring to the integer DataSetID, please call it DataSetID. When referring to the string like knb-lter-xyz.789 call it dataset_id (or dataset_archive_id), to avoid confusion. Before any major design change like this, look at the potential for redundant or conflicting rows. If DataSetID is not the i in scope.i.rev, then it is possible to enter the same package twice. Only a human could prevent that. The idea of a database is to make it not possible to enter bad content. Tools external to the database, such as ds_time_series_update.pl or writeEML.pl or DSload.pl, no matter how loosely coupled, may be too reliant on something so fundamental as DataSetID being an integer and the unambiguous key to which dataset is which. |
What about this idea, whom I shall name Daisy:
So DataSetID would be the unique ID internal to metabase. The Dataset table would have two new attributes: Scope and PackageID, whose combination must be unique across metabase. The Scope attribute would be dropped from the Boilerplate table. What about Daisy is worse than SubtractFromID, other than the tremendous effort involved in the redesign? One of the appealing aspects of Li's mini-metabase to me was its relative simplicity. If there's a database solution that is simple, then I prefer that over one requiring downstream scripts. What I don't know, is what impact Daisy would have on the rest of the database wrt views, constraints, relationships, etc. I think the views are probably already beyond my grasp, but if Daisy requires some wacky view, then my support of Daisy would wane. Maybe my question is, if you had a magic button that would implement all schema changes for you, which would you prefer, SubtractFromID or Daisy? |
The problem is not the DDL to perform schema changes. The ALTER sql to change the schema of lter-core-mb before migration is nowhere near the scale of work of the actual migrations. And those schema changes may make the difference between an existing site being able or not to use lter-core-mb. If we opt for
We should not choose until we thoroughly examine the implications. There is another option, equally ungainly, of combining 2 columns for the main key, (DataSetID, scope). I do not like this option because it means adding a Daisy would result in all queries having an additional JOIN. And removing the key identifier from all the DataSetStuff tables humans edit, so the human would have to perform that JOIN manually each time. Awkward. This would be the most radical divergence from the Metabase design. It would also require a slew of new UNIQUE keys to prevent the same dataset from using two keys. That is one way of implementing Daisy. There is also the BON mini-version of mini_metabase, which is an unconstrained Daisy, maintained by human convention. (Duplicates are avoided by a human just not entering them.) Of the 53 rows in pkg_state, only 17 are in DataSet. DataSetID (as used for BON) has no relation to the i in scope.i.rev except via pkg_state. There are 5 scopes: edi, knb-lter-sbc, knb-lter-cce, pisco, and x, "x" meaning scope-not-known-yet. There are also empty string and a single space (unconstrained). If we opt for a Daisy design, we can learn from the BON mini some potential pitfalls. SubtractFromID gives the human editor a clue: they can see the i (the i in scope.i.rev) for each row. There would need to be a JOIN in the VIEWs to look up the scope and N*million to subtract. And the calling code might have to know what scope it is requesting. ... actually I'm not clear how that would work yet! (And the same goes for Daisy; the calling code would either have to be aware of the inner workings of mb, ie tightly coupled, or know what scope it is requesting and have the scope part of the key in the view output. None of these options appeals. If forced to choose from these four options:
At this point I would, tentatively, choose
My, tentative More important, I would not make the choice now, not until I look carefully at all sites seriously considering migration (ie NWT, HBR). I do not want another surprise like the BON site. And if core-mb does go a new direction so far away from the metabase I've used for 6 years, I would re-weigh the benefit:cost of migrating MCR. Also I would not commit to that choice until we consider some other potential major design revisions (or, technically, revert-tions as these are un-dos). We may have to revert to pkg_state being parent to DataSet. I'll explain in a separate issue. |
@gastil I'm confused about this paragraph:
Daisy means having a DatasetID unique within metabase, and adding scope and packageID to the Dataset table. So I think you'd just need an additional JOIN when you need packageID, which is just one place in EML, right? Also regarding the last sentence, a given row in the Dataset table would only have one DatasetID, which would represent one dataset, so how could that dataset use two DatasetIDs? Or did you mean something else by "keys"? Isn't DatasetID the unique key, and for existing metabase installations, it would already be unique. Since tone doesn't come across well in written text, let me say that I am not strongly advocating for Daisy. I'm just trying to understand the options. :) |
Hi @tim,
Before I describe the purpose of that JOIN, I have to admit we greatly over
estimated the need for multiple scopes. Revisions of ecocomdp maintained by
a site can -and perhaps “should” - become a new packageId with the site’s
scope. I did not know that before. The BON is an edge case, and perhaps can
be handled in an “off label” use of MB, with fake DataSetID in pkg_state.
That quirky situation need not drive all other installations of mb.
The JOIN I mentioned would be needed to re-index from one ID to the other,
to reconnect the 5 in scope.5.rev to the different 1234 Dailey would put in
DataSetID. Otherwise any external code would have to look up the 1234 for
5. And humans would have to edit row 1234 to edit dataset 5. See how that
adds an extra step to all actions and queries?
But all that is moot if we either do not overlap (real) DaraSetIDs.
Note that in the BON case, as long as pkg_state is parent of DataSet, there
is no conflict of scope. We have to revert to that original design for
other reasons anyway. BON only uses scope edi in DataSet.
Thanks for reviewing. We all need to participate to have a MB that works
for all of us!
Gastil
…On Saturday, October 12, 2019, Tim Whiteaker ***@***.***> wrote:
@gastil <https://github.com/gastil> I'm confused about this paragraph:
Daisy would result in all queries having an additional JOIN. And removing
the key identifier from all the DataSetStuff tables humans edit, so the
human would have to perform that JOIN manually each time. Awkward. This
would be the most radical divergence from the Metabase design. It would
also require a slew of new UNIQUE keys to prevent the same dataset from
using two keys.
Daisy means having a DatasetID unique within metabase, and adding scope
and packageID to the Dataset table. So I think you'd just need an
additional JOIN when you need packageID, which is just one place in EML,
right? Also regarding the last sentence, a given row in the Dataset table
would only have one DatasetID, which would represent one dataset, so how
could that dataset use two DatasetIDs? Or did you mean something else by
"keys"? Isn't DatasetID the unique key, and for existing metabase
installations, it would already be unique.
Since tone doesn't come across well in written text, let me say that I am
not strongly advocating for Daisy. I'm just trying to understand the
options. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=AGTM6FVNEBIWGNBTANCT4N3QOHAWRA5CNFSM4I7RNYHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBB6RHQ#issuecomment-541321374>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGTM6FQDU4UIOSH7W2TEE6DQOHAWRANCNFSM4I7RNYHA>
.
|
I see. Rather than exporting out of MB, if you were trying to get into it and query for a dataset's details given a packageID, then you'd need a join to get the DatasetID. You could add a join to everything, or you could gather info in a two step process. Step 1: Do one join and get the DatasetID. Step 2: Use the DatasetID for the rest of your queries. However, based on comments, it seems that it is a rare edge case when multiple scopes with identical packageIDs is needed, so the SubtractFromID solution keeps the design simpler and works fine for most users. |
Correct Tim. Good summary.
One detail:
Substitute “As well as when” for “Rather than”.
…On Saturday, October 12, 2019, Tim Whiteaker ***@***.***> wrote:
I see. Rather than exporting out of MB, if you were trying to get into it
and query for a dataset's details given a packageID, then you'd need a join
to get the DatasetID. You could add a join to everything, or you could
gather info in a two step process. Step 1: Do one join and get the
DatasetID. Step 2: Use the DatasetID for the rest of your queries.
However, based on comments, it seems that it is a rare edge case when
multiple scopes with identical packageIDs is needed, so the SubtractFromID
solution keeps the design simpler and works fine for most users.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=AGTM6FUFADZAJBD7QTBM5LDQOIFOPA5CNFSM4I7RNYHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCE5EQ#issuecomment-541347474>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGTM6FQUCOYHJXOMJKPNAYLQOIFOPANCNFSM4I7RNYHA>
.
|
The current metabase does not support the same dataset IDs for different scopes: for example, if one IM manages
edi
andknb-lter-ble
datasets, there cannot beedi.21
andknb-lter-ble.21
datasets in the same installation of metabase.Proposed solution: a numeric column in the
boilerplate
table, calledSubtractFromID
. Thedefault
setting is 0. If it's not 0, downstream MetaEgress will subtract the specified number from the metabaseDataSetID
to get the EMLpackageID
.For example:
an IM manages
edi
andknb-lter-ble
datasets. To avoid ID collision, they create two sets of boilerplate items via theboilerplate
table.knb-lter-ble
scope and 0 inSubtractFromID
. Therefore the metabaseDatasetID
1 will translate to the EMLpackageID
knb-lter-ble.1
.edi
scope and 1,000,000 inSubtractFromID
. Therefore the metabaseDatasetID
1,000,001 will translate to the EMLpackageID
edi.1
.The text was updated successfully, but these errors were encountered: