Make explicit list of fields required for portal to function, make missing entries invalidate the data #41

jeffbaumes · 2021-04-01T13:49:44Z

For example, principal_investigator_name was marked optional in the JSON schema but is required for the portal to function.

I believe this involves marking more things as required in the schema, hence I'm putting this issue on the metadata repo.

jbeezley · 2021-04-07T17:59:52Z

I went through the generated schema definitions and assembled a list of attributes that are generally assumed to be present by the pilot application. Several attributes are additionally required for ingest either because of referential integrity or because they represent non-nullable fields in the database. I have marked such attributes in bold text.

Study

doi
principal_investigator_name

We also need additional information related to studies not present in the source data (see study_additional.json and microbiomedata/nmdc-metadata#301):

principal_investigator_websites
proposal_title
publication_dois
scientific_objective

OmicsProcessing

has_input
has_output
instrument_name
omics_type
part_of
processing_institution

Biosample

collection_date
community
depth
ecosystem
ecosystem_category
ecosystem_subtype
ecosystem_type
env_broad_scale
env_local_scale
env_medium
geo_loc_name
habitat
lat_lon
ncbi_taxonomy_name
sample_collection_site
specific_ecosystem

DataObject

file_size_bytes
md5_checksum
name
url
was_generated_by

WorkflowExecutionActivity

ended_at_time
execution_resource
git_url
name
started_at_time
type
has_input
has_output
was_informed_by

MetaproteomicAnalysis

has_peptide_quantifications

PeptideQuantification

all_proteins
best_protein

FunctionalAnnotation

has_function
subject
was_generated_by

jeffbaumes · 2021-04-15T19:54:40Z

@wdduncan could you review these attributes and make sure they are required in the schema?

ssarrafan · 2021-04-15T21:39:06Z

Based on meeting today I'm moved to "in progress".

wdduncan · 2021-04-27T19:13:46Z

@jeffbaumes Many of these schema changes require data that I don't have access to.
@dehays can we discuss this at the Wed April 28 call?

dehays · 2021-04-27T23:20:15Z

@wdduncan Yes, we can discuss this tomorrow. The fields I think you are probably referring to are the ones on study that are not available in GOLD.

@jeffbaumes There are a number of these that I don't believe can be made required in the schema. DOE, PI web sites, objective. If values for these are not available from NMDC sources (GOLD, NCBI, etc.) then requiring them in the schema does not make sense. In the cases where we do have values - we need to consider how they would be included. I do believe that should happen upstream of search portal ingest; i.e. not provided to Kitware out-of-band in various files.

dwinston · 2021-04-28T13:43:06Z

One possible course of action is an nmdc-portal LinkML file that imports nmdc.yaml, i.e.

# in e.g. nmdc-schema/src/schema/extensions/nmdc-portal.yaml
...
imports:
  - nmdc
...

Then, the portal schema can use the slot_usage facility to make certain inherited slots required, add slots, etc.

This way, the nmdc-portal requirements can be verified via tests in the nmdc-schema repo to be a proper extension of the base nmdc schema (e.g. no required fields are made optional, etc.), though I think for this issue such an error is unlikely - the portal only makes optional fields required and adds new fields.

jeffbaumes · 2021-04-28T17:29:14Z

@jeffbaumes There are a number of these that I don't believe can be made required in the schema. DOE, PI web sites, objective. If values for these are not available from NMDC sources (GOLD, NCBI, etc.) then requiring them in the schema does not make sense. In the cases where we do have values - we need to consider how they would be included. I do believe that should happen upstream of search portal ingest; i.e. not provided to Kitware out-of-band in various files.

@dehays if we know what must remain optional (i.e. we can't expect to require it on ingest or submission) then we should make that explicit and adapt our portal DB and UI to accommodate so it is resilient to those fields being absent. Basically @jbeezley is reporting what is required to drive the portal we see today. If we don't require these things some aspects of the interface will suffer (i.e. we can't expect to have PI/study pages that are all as nice as what is there today), which is fine but we need to judge with some finality what are the minimum requirements for inclusion in NMDC.

ssarrafan · 2021-04-28T17:40:08Z

Adding notes exchanged via email here for reference:

Agree with your comments here David (and yours Kjiersten). I commented in parallel on GitHub, but the gist is that we need to resolve which fields we can and can't expect to require. Many "required" things for the portal could be made optional if needed.

#41

On Wed, Apr 28, 2021 at 1:31 PM Kjiersten Fagnan kmfagnan@lbl.gov wrote:
I support the approach David laid out for what fields are required vs optional. I can add the following comments to the ticket, but we seem to be getting into this via email.

Could we create some default values for the portal to populate the page - avatar, URL, DOI and scientific objectives would be harder if not impossible.

In the future, when contributors are providing data to NMDC, could we also collect - photo, website URL, etc as part of the submission process - or perhaps give the PI the ability to add this themselves? This depends on having some level of access controls (different roles in the data portal than exist right now). Maybe this is part of working directly with the PIs to get their help on the study/data landing pages?

Kjiersten

On Wed, Apr 28, 2021 at 10:20 AM David Hays dehays@lbl.gov wrote:
Bill and Emiley said:

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41 I do not have access to the data for some of the fields that Jeff is requesting. I can give an estimate until we track the data.

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the data needed?

Move PI profile images to official metadata schema #19 Again, I do not have access to images of PIs.

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the PI images needed?

David should be able to address #41, seems like a GOLD database dump issue?

For #19, as was indicated in the github ticket, these were all manually collected by me. Not sure the best solution here, but this could tie into the more general discussion of the study pages (and some items from #41 like scientific objectives that are not part of the GOLD db dump). Not everything can be fully automated.... just my two cents.

On #41, I believe the fields that Bill is referring to are the ones that are NOT available from GOLD; i.e. those listed in microbiomedata/nmdc-metadata#301 and #19 such as PI web site, PI image, scientific objective, publication DOIs. Basically, the ones that Emiley collected manually and provided to Kitware.

For these, Bill could add these to the schema as non required attributes. He has no way of making the GOLD ETL populate these because they do not exist in GOLD.

Jon states that the portal UI depends on these fields - but I believe the portal UI will need to treat them as optional fields as well because normally they will not be available. If we add 10K studies tomorrow from GOLD or NCBI - we will not be waiting for Emiley to collect values for these fields before they can be displayed in the UI. The portal UI needs to be flexible enough to handle cases where these values are not available.

I also believe it should not be the responsibility of search portal development to merge additional metadata for studies to extend what was made available for ingest. So that implies the need for an curate/annotate procedure that is available between GOLD or NCBI ETL and search portal ingest. And in the case of images - in addition to a way to edit the study json docs to add PI image URLs, there is also the need to add and manage the image files to a location associated with the metadata URL.

So for #41, there are a number of fields that are always available (We will always have a PI for a study.) that can be made required in the schema. But for those for which there is no available source except manual curation, at best these could be optional fields in the schema. Make sense?

-David

On Tue, Apr 27, 2021 at 1:51 PM Emiley Eloe-Fadrosh eaeloefadrosh@lbl.gov wrote:
For these two:

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41 I do not have access to the data for some of the fields that Jeff is requesting. I can give an estimate until we track the data.

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the data needed?

Move PI profile images to official metadata schema #19 Again, I do not have access to images of PIs.

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the PI images needed?

David should be able to address #41, seems like a GOLD database dump issue?

For #19, as was indicated in the github ticket, these were all manually collected by me. Not sure the best solution here, but this could tie into the more general discussion of the study pages (and some items from #41 like scientific objectives that are not part of the GOLD db dump). Not everything can be fully automated.... just my two cents.

jbeezley · 2021-04-28T17:48:28Z

Yes. What @jeffbaumes said was my intent.

We need to ensure that the UI design takes into account information that may not be available. Especially if the the common case is going to be that this information is missing, the current study page won't make much sense. Where this information is missing on a study, it might be better to link to a page inviting PI's to submit the data to us directly.

ssarrafan · 2021-05-18T23:00:41Z

Based on the meeting today with @dehays, @emileyfadrosh, @dwinston, @wdduncan, and @jbeezley, @wdduncan will add the missing fields into the schema for this sprint.

ssarrafan · 2021-05-27T17:40:11Z

@wdduncan I'll move this to in progress per the last comment but let me know if you won't be able to add the missing fields this week or maybe it's already done?

ssarrafan · 2021-05-27T23:00:01Z

@dehays and @wdduncan have discussed this and @dehays will close this issue and open several new related issues for June.

dehays · 2021-05-28T15:21:59Z

Closing this issue. Replaced by #50 and #51

mixs.yaml never autoregenrated

jeffbaumes assigned jbeezley Apr 6, 2021

jeffbaumes assigned wdduncan Apr 15, 2021

wdduncan transferred this issue from microbiomedata/nmdc-metadata Apr 27, 2021

ssarrafan mentioned this issue Apr 28, 2021

Move PI profile images to official metadata schema #19

Closed

wdduncan removed this from In progress in NMDC April 2021 Sprint Apr 29, 2021

wdduncan added this to To do in NMDC May 2021 Sprint via automation Apr 29, 2021

wdduncan added the MEDIUM 4-7 days label May 5, 2021

dehays moved this from To do to Needs discussion, scoping in NMDC May 2021 Sprint May 6, 2021

ssarrafan added this to the Sprint 2 milestone May 14, 2021

ssarrafan moved this from Needs discussion, scoping to In progress in NMDC May 2021 Sprint May 27, 2021

This was referenced May 28, 2021

existing NMDC entities attributes to make required #50

Closed

additional optional attributes to be added to study #51

Closed

dehays closed this as completed May 28, 2021

NMDC May 2021 Sprint automation moved this from In progress to Done May 28, 2021

turbomam added a commit that referenced this issue Feb 21, 2024

Merge pull request #41 from microbiomedata/nmdc-schema-1572

586aef9

mixs.yaml never autoregenrated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41

jeffbaumes commented Apr 1, 2021

jbeezley commented Apr 7, 2021

jeffbaumes commented Apr 15, 2021

ssarrafan commented Apr 15, 2021

wdduncan commented Apr 27, 2021

dehays commented Apr 27, 2021

dwinston commented Apr 28, 2021

jeffbaumes commented Apr 28, 2021

ssarrafan commented Apr 28, 2021

jbeezley commented Apr 28, 2021

ssarrafan commented May 18, 2021

ssarrafan commented May 27, 2021

ssarrafan commented May 27, 2021

dehays commented May 28, 2021

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41

Comments

jeffbaumes commented Apr 1, 2021

jbeezley commented Apr 7, 2021

jeffbaumes commented Apr 15, 2021

ssarrafan commented Apr 15, 2021

wdduncan commented Apr 27, 2021

dehays commented Apr 27, 2021

dwinston commented Apr 28, 2021

jeffbaumes commented Apr 28, 2021

ssarrafan commented Apr 28, 2021

jbeezley commented Apr 28, 2021

ssarrafan commented May 18, 2021

ssarrafan commented May 27, 2021

ssarrafan commented May 27, 2021

dehays commented May 28, 2021