Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add package maintenance tracker #41

Closed
atn38 opened this issue Apr 23, 2019 · 16 comments
Closed

add package maintenance tracker #41

atn38 opened this issue Apr 23, 2019 · 16 comments
Assignees

Comments

@atn38
Copy link
Member

atn38 commented Apr 23, 2019

Per request from @scelmendorf and @gastil

Need to add way to track what was done in maintenance updates to data packages.

@gastil
Copy link
Contributor

gastil commented Apr 23, 2019

Wade uses the maintenance element consistently. I will take the task of looking at the W3D, whether he stores this in Metabase or it just comes straight from GCE Toolbox. I looked at the ERD and did not find maintenance.

Regardless, I will suggest it use the updateFrequency that is already stored in pkg_mgmt to avoid redundancy.

Maybe we should compose a generic way of saying the changeHistory, something like "metadata only update for revision (N)", or "time series update for revision (N)", ... have a few canned examples. I have not looked how other LTER sites do this (aside from Wade).

@twhiteaker
Copy link
Contributor

Using updateFrequency would jive with ESIP's recommendations on data citations, as in

Maslanik, J. and J. Stroeve. 1999, updated daily. Near-Real-Time DMSP SSMIS Daily Polar Gridded Sea Ice Concentrations, Version 1. NASA National Snow and Ice Data Center Distributed Active Archive Center. https://doi.org/10.5067/U8C09DWVX9LM. Accessed 2019-02-14.

@atn38
Copy link
Member Author

atn38 commented Apr 24, 2019

Here's the KNB controlled vocab on what's allowed as a maintenanceUpdateFrequency. @gastil, by "updateFrequency that is already stored in pkg_mgmt" I suppose you mean column pkg_mgmt.pg_alt_id.maint_freq in full_mb. That's the only relevant field I could find. Core_mb currently doesn't have that table and so nowhere to store updateFrequency.

changeHistory is a repeatable element within /dataset/maintenance, so we can expose revision history via mb2eml_r views and write to EML doc. As to modeling maintenance history in core_mb, I can draft a table + view, but I'd like to know what you consider an update (i.e. what a row in that table would be). @scelmendorf? In practice, does it correspond to a revision in EDI terms?

@scelmendorf
Copy link

@atn38 I have struggled a bit to fit the descriptive change histories we have into the eml ChangeHistory module, as there is not always an easy description of oldValue, changeScope, etc, which are required. So, mostly I have been stuffing it into just the maintenance description. But, we have a long history of changes to our datasets which range from 'appended on 2018 data [initials, date], to 'appended on 2018 data, removed 2014 data from plots x,y,z due to protocol deviations used in those years' to 'changed temperature units to celsius' to 'updated metadata to correct units and definitions, data remains unchanged'. Perhaps that gives you a flavor for the janitorial work that likely accompanies many long-term dataset. oldValue in those cases is not a scalar or anything that is easy to fill out without diffing the files -- which anyone very concerned with exactly what changed should certainly do! nonetheless, I think the verbal descriptions are helpful in case someone wants to know whether the change means they really ought to redo their analyses or not. I would possibly lean towards just 'date', 'bywhom', 'description', and then use code to parse that into the /dataset/meantenance/description element?

@gastil
Copy link
Contributor

gastil commented Apr 25, 2019

Precisely what @scelmendorf wrote. To answer @atn38 's question, I do not add a maintenance note with every revision if it is just routine, such as adding a year's data to a timeseries. But if it were automated, or even semi-automated, I would, as I think it would reassure the data user. On rare occasion, we find and correct errors in the data. I always note those. (And also email everyone who registered a download, which the pasta users miss out on.)

Also thank you @atn38 for looking in pkg_mgmt for updateFrequency. I thought it was there but I was wrong. Let's put it in metabase then, with no redundancy problem.

@atn38
Copy link
Member Author

atn38 commented Apr 25, 2019

Too bad oldValue is required. If that wasn't the case, multiple changeHistory elements would be better!

Here's a draft CREATE TABLE DataSetMaintenance SQL:

CREATE TABLE lter_metabase"DataSetMaintenance"(
	"DataSetID" integer NOT NULL,
	"Revision" integer NOT NULL,
	"RevisionNotes" character varying(200),
	"RevisionDate" date NOT NULL,
	"NameID" character varying(20) NOT NULL
);

If we have this table populated like below; the empty cell could be a routine update like you said @gastil.

DataSetID Revision RevisionNotes RevisionDate NameID
1001 1 2018 data appended 3/4/2018 selmendorf
1001 2 removed 2014 data from plots x,y,z due to protocol deviations used in those years 5/23/2018 gastil
1001 3 7/11/2018 selmendorf
1001 4 updated metadata to correct units and definitions, data remains unchanged 11/16/2018 gastil

It could be resolved via R code (I wrote a MRE to create this) to this EML snippet:

<maintenance>
  <description>2018-04-03, revision 1 by Sarah Elmendorf: 2018 data appended
2018-05-23, revision 2 by Gastil Gastil-Buhl: removed 2014 data from plots x y z due to protocol deviations used in those years
2018-07-11, revision 3 by Sarah Elmendorf: minor change, no description provided
2018-11-16, revision 4 by Gastil Gastil-Buhl: updated metadata to correct units and definitions, data remains unchanged</description>
  <maintenanceUpdateFrequency>annually</maintenanceUpdateFrequency>
 </maintenance>

Thoughts? Note that update_frequency is coming from somewhere else in core-metabase. I am thinking to put it in with table DataSet.
Any chance of misinterpretation with update_frequency? Putting in naive user hat, I could see confusion since it says annually but there could be a whole bunch of minor updates in a year. At any rate, this is more an EML spec concern than core-mb.

@atn38 atn38 self-assigned this Apr 25, 2019
@gastil
Copy link
Contributor

gastil commented Apr 25, 2019

It helps to work from this example. I have not used the description element that way. Here is an example what I put in the description element:

<description>
        <para>This is an ongoing dataset. Data are added every year.</para>
</description>
<maintenanceUpdateFrequency>annually</maintenanceUpdateFrequency>

In your example, it concatenates each revision note. I am not sure that is ideal. Although, by using the para element you can separate them. Some time series datasets of mine have revisions into the 50's. The example dataset I have open today is in its 37th revision, with 9 changeHistory trees. And MCR is a relatively young site, started in 2005. I use the changeHistory element. I butcher its intended use when it does not apply. Note you can put 'na' for a value. Here are some examples:

      <changeHistory>
        <changeScope>data</changeScope>
        <oldValue>na</oldValue>
        <changeDate>2013-03-28</changeDate>
        <comment>Annual survey 2012 data appended.</comment>
      </changeHistory>
      <changeHistory>
        <changeScope>data</changeScope>
        <oldValue>na</oldValue>
        <changeDate>2014-03-04</changeDate>
        <comment>Annual survey 2013 data appended.</comment>
      </changeHistory>
      <changeHistory>
        <changeScope>metadata</changeScope>
        <oldValue>metadata</oldValue>
        <changeDate>2014-03-11</changeDate>
        <comment>Revision 31 the only change is correction of spelling of Psammocora.</comment>
      </changeHistory>
      <changeHistory>
        <changeScope>metadata</changeScope>
        <oldValue>metadata</oldValue>
        <changeDate>2015-12-09</changeDate>
        <comment>Revision 33 adds 2015 data and revises spelling of Gardineroseris. 
Caveat about use of repeated location in transect 4 added to methods.
Methods updated for new coral analysis software used starting in 2015.</comment>
      </changeHistory>

As for confusion over updateFrequency, that is the expected data timeseries update frequency, not the metadata maintenance like adding ORCiDs, cksums, typos, redesigning tables, taxonomic updates, etc.

I could be ok with updateFrequency being in DataSet. It is 1-1 with dataset and does not change, or very rarely changes.

@scelmendorf
Copy link

Probably too late in the eml 2.2 release train to release the requirement on changeHistory/oldValue and changeHistory/changeScope @mobb
If not, then you could nicely put what you have into a set of {changeDate, comment}, which I think is what we actually want. Alternatively, one could ignore the instructions in the eml specification which certainly make it sound like this is intended to track individual measurements (i.e. one record in a single attribute in a single entity), and put in something placeholdery in changeScope and oldValue as Gastil suggests. I thought of doing the packageIds (old, new) there instead of data, metadata, though I can see advantages to data, metadata. I think that would cover most of the use cases together with the ability to put in comments and dates. Technically, we'd want changePerson (or - "changer/changedBy/change_provider?") to be in there as an optional element under changeHistory so it's more easily parsed out. Anyhow - that's my unsolicited advice of how eml 2.2 (or, realistically, 2.3)

@gastil
Copy link
Contributor

gastil commented Apr 25, 2019

Margaret is traveling but I am 99.9% sure there cannot be any schema changes at this point. The EML dev crew assured folks they could count on the schema for 2.2 to be stable. Only documentation is left to finish. So we can submit an enhancement request for EML2.3.

Here is my earlier example in table format, with an earlier one included. I see I did not always note which revision. I should have. I will enter 'na' for those.

DataSetID Revision RevisionNotes RevisionDate NameID
4 25 In Version 4.25 long format location column split into site, habitat, transect, quadrat for user convenience, while retaining whole location column as part of primary key. Only a format change; data unchanged. 2011-09-12 mgastilbuhl
4 na Annual survey 2012 data appended. 2013-03-28 mgastilbuhl
4 na Annual survey 2013 data appended. 2014-03-04 mgastilbuhl
4 31 Revision 31 the only change is correction of spelling of Psammocora. 2014-03-11 mgastilbuhl

I suggest we always use YYYY-MM-DD, the ISO 8601 date format. And I use the LTER Network user IDs, which we used to call "lno_uid" when lno was LTER Network Office, before your time. Those take the form flast, sometimes with a number appended. I do not know if that format is still in use. You may have different user ID formats.

@atn38
Copy link
Member Author

atn38 commented Apr 25, 2019

@gastil I really like using "metadata", "data, or "metadata and data" for changeScope. Maybe that should be a column in core-mb. @scelmendorf I like using the previous revision for oldScope! Great workarounds to be able to use the changeHistory tree.

So now if we have this table:

DataSetID Revision RevisionScope RevisionNotes RevisionDate NameID
1001 1 data and metadata 2018 data appended 2018-04-03 selmendorf
1001 2 data removed 2014 data from plots x,y,z due to protocol deviations used in those years 2018-05-23 gastil
1001 3 data 2018-07-11 selmendorf
1001 4 metadata updated metadata to correct units and definitions, data remains unchanged 2018-11-16 gastil

Feed it through R code to get this EML. No place to put in names, but we can concatenate into comment if that's acceptable.

<changeHistory>
    <changeScope>data</changeScope>
    <oldValue>first package version, oldValue not applicable</oldValue>
    <changeDate>2018-04-03</changeDate>
    <comment>2018 data appended</comment>
  </changeHistory>
  <changeHistory>
    <changeScope>data</changeScope>
    <oldValue>revision 1</oldValue>
    <changeDate>2018-05-23</changeDate>
    <comment>removed 2014 data from plots x y z due to protocol deviations used in those years</comment>
  </changeHistory>
  <changeHistory>
    <changeScope>data</changeScope>
    <oldValue>revision 2</oldValue>
    <changeDate>2018-07-11</changeDate>
  </changeHistory>
  <changeHistory>
    <changeScope>metadata</changeScope>
    <oldValue>revision 3</oldValue>
    <changeDate>2018-11-16</changeDate>
    <comment>updated metadata to correct units and definitions, data remains unchanged</comment>
  </changeHistory>
  <description><para>This is an ongoing dataset. Data are added every year.</para></description>
  <maintenanceUpdateFrequency>annually</maintenanceUpdateFrequency>

I am merely subtracting 1 from the current revision number to get oldScope. @gastil, you are right about wrappers, it's just I'm having trouble with the function in rEML that deals with text stuff. Re: name IDs, internally I'm just doing [first letter first name][last name], not really following any spec.

@gastil
Copy link
Contributor

gastil commented Apr 25, 2019

@atn38 there is no oldScope. I think you meant oldValue. Im not sure the previous revision number is meaningful if we make a practice of including the current revision number in the comment. Although I am leery of using such an undefined field as comment for structured content, we do not have good options until EML enhances the maintenance section. I know many IMs do not use maintenance at all, some may not even realize it exists. It is not one of the parts of EML that receive attention. Our local catalog does not even display it; portal does, if it is included in an EML doc. And yet, if I were a data user of a timeseries, I know Id rely on maintenance to give me a heads up. Because time series data is not so uniform year-to-year as some assume. Real stuff happens.

Some examples of stuff that happened with one of our core time series are viewable here, under the 'maintenance' section (near bottom of page). https://portal.lternet.edu/nis/metadataviewer?packageid=knb-lter-mcr.4.37
Another good example: https://portal.lternet.edu/nis/metadataviewer?packageid=knb-lter-mcr.6.56

@atn38
Copy link
Member Author

atn38 commented Apr 25, 2019

@gastil, it is oldValue. I do think we should include current revision number in comment; otherwise it's not entirely obvious what changeHistory element corresponds to what revision. (LTER/EDI revisions webpage leaves much to be desired.) Thanks for the links! I wish there was a way to search for EML docs by presence/absence of random obscure EML elements.

@gastil
Copy link
Contributor

gastil commented Apr 25, 2019

Yes. I wish I had consistently included the revision number in all my changeHistory entries. Let's make a practice of that. Actually, if we can define a format or template for how we compose our comment field then in a later version of EML maybe we can parse and re-enter that content with a script. Or, at minimum, just converge on how we enter the info into the EML. We can store the info in metabase by fields more specific than EML currently offers.

Back to @atn38 's draft of the table, I will add a column for changeScope and the PK and FKs:

CREATE TABLE lter_metabase"DataSetMaintenance"(
	"DataSetID" integer NOT NULL,
	"Revision" integer NOT NULL,
	"RevisionNotes" character varying(200), -- goes in this xpath: eml/dataset/maintenance/changeHistory/comment
	"ChangeDate" date NOT NULL,
        "ChangeScope" character varying(32), -- (data, metadata, data and metadata)
	"NameID" character varying(20) NOT NULL,
CONSTRAINT "PK_DataSetMaintenance" PRIMARY KEY ("DataSetID","Revision","ChangeScope"),
CONSTRAINT "FK_DataSetMaintenance_DataSet" FOREIGN KEY ("DataSetID") REFERENCES lter_metabase."DataSet" ("DataSetID") MATCH SIMPLE ON UPDATE CASCADE ON DELETE NO ACTION,
CONSTRAINT "FK_DataSetMaintenance_People" FOREIGN KEY ("NameID") REFERENCES lter_metabase."People" ("NameID") MATCH SIMPLE ON UPDATE CASCADE ON DELETE NO ACTION
);

Would we want to put a CHECK on ChangeScope to be in (data, metadata, data and metadata)?

I put 3 columns in the PK to allow a separate entry for a data and a metadata entry for the same revision. Do you think that is overkill?

@gastil
Copy link
Contributor

gastil commented May 1, 2019

What do you all think about putting the maintenance content into the pkg_mgmt schema? It occurs to me that the content feels related.

@atn38
Copy link
Member Author

atn38 commented May 7, 2019

Agree that the content feels related, but to play devil's advocate: I think maintenance content fits in lter_metabase schema. We might want the difference between the two schemas to be: lter_metabase content is exposed via EML docs, while pkg_mgmt content is entirely for internal mgmt. @twhiteaker (I think) made a point in zoom that the schemas should be modular & optional, so a new user wouldn't have to learn all of them at once. E.g. by putting maintenance (& all EML elements) in lter_metabase, users only need to populate lter_metabase to make EML and don't need to open pkg_mgmt until they feel the need to.

Note that the way the views and R code are written right now, populating pkg_mgmt is required to make EML (since VIEWs still pull a small amount of info from pkg_mgmt). We can change that of course.

@atn38
Copy link
Member Author

atn38 commented Jun 27, 2019

closing since #44 and subsequent edits have implemented this feature

@atn38 atn38 closed this as completed Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants