Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make explicit list of fields required for portal to function, make missing entries invalidate the data #41

Closed
jeffbaumes opened this issue Apr 1, 2021 · 13 comments
Assignees
Labels
MEDIUM 4-7 days
Milestone

Comments

@jeffbaumes
Copy link

For example, principal_investigator_name was marked optional in the JSON schema but is required for the portal to function.

I believe this involves marking more things as required in the schema, hence I'm putting this issue on the metadata repo.

@jbeezley
Copy link

jbeezley commented Apr 7, 2021

I went through the generated schema definitions and assembled a list of attributes that are generally assumed to be present by the pilot application. Several attributes are additionally required for ingest either because of referential integrity or because they represent non-nullable fields in the database. I have marked such attributes in bold text.

Study

  • doi
  • principal_investigator_name

We also need additional information related to studies not present in the source data (see study_additional.json and microbiomedata/nmdc-metadata#301):

  • principal_investigator_websites
  • proposal_title
  • publication_dois
  • scientific_objective

OmicsProcessing

  • has_input
  • has_output
  • instrument_name
  • omics_type
  • part_of
  • processing_institution

Biosample

  • collection_date
  • community
  • depth
  • ecosystem
  • ecosystem_category
  • ecosystem_subtype
  • ecosystem_type
  • env_broad_scale
  • env_local_scale
  • env_medium
  • geo_loc_name
  • habitat
  • lat_lon
  • ncbi_taxonomy_name
  • sample_collection_site
  • specific_ecosystem

DataObject

  • file_size_bytes
  • md5_checksum
  • name
  • url
  • was_generated_by

WorkflowExecutionActivity

  • ended_at_time
  • execution_resource
  • git_url
  • name
  • started_at_time
  • type
  • has_input
  • has_output
  • was_informed_by

MetaproteomicAnalysis

  • has_peptide_quantifications

PeptideQuantification

  • all_proteins
  • best_protein

FunctionalAnnotation

  • has_function
  • subject
  • was_generated_by

@jeffbaumes
Copy link
Author

@wdduncan could you review these attributes and make sure they are required in the schema?

@ssarrafan
Copy link
Collaborator

Based on meeting today I'm moved to "in progress".

@wdduncan wdduncan transferred this issue from microbiomedata/nmdc-metadata Apr 27, 2021
@wdduncan
Copy link
Contributor

@jeffbaumes Many of these schema changes require data that I don't have access to.
@dehays can we discuss this at the Wed April 28 call?

@dehays
Copy link
Contributor

dehays commented Apr 27, 2021

@wdduncan Yes, we can discuss this tomorrow. The fields I think you are probably referring to are the ones on study that are not available in GOLD.

@jeffbaumes There are a number of these that I don't believe can be made required in the schema. DOE, PI web sites, objective. If values for these are not available from NMDC sources (GOLD, NCBI, etc.) then requiring them in the schema does not make sense. In the cases where we do have values - we need to consider how they would be included. I do believe that should happen upstream of search portal ingest; i.e. not provided to Kitware out-of-band in various files.

@dwinston
Copy link
Collaborator

One possible course of action is an nmdc-portal LinkML file that imports nmdc.yaml, i.e.

# in e.g. nmdc-schema/src/schema/extensions/nmdc-portal.yaml
...
imports:
  - nmdc
...

Then, the portal schema can use the slot_usage facility to make certain inherited slots required, add slots, etc.

This way, the nmdc-portal requirements can be verified via tests in the nmdc-schema repo to be a proper extension of the base nmdc schema (e.g. no required fields are made optional, etc.), though I think for this issue such an error is unlikely - the portal only makes optional fields required and adds new fields.

@jeffbaumes
Copy link
Author

@jeffbaumes There are a number of these that I don't believe can be made required in the schema. DOE, PI web sites, objective. If values for these are not available from NMDC sources (GOLD, NCBI, etc.) then requiring them in the schema does not make sense. In the cases where we do have values - we need to consider how they would be included. I do believe that should happen upstream of search portal ingest; i.e. not provided to Kitware out-of-band in various files.

@dehays if we know what must remain optional (i.e. we can't expect to require it on ingest or submission) then we should make that explicit and adapt our portal DB and UI to accommodate so it is resilient to those fields being absent. Basically @jbeezley is reporting what is required to drive the portal we see today. If we don't require these things some aspects of the interface will suffer (i.e. we can't expect to have PI/study pages that are all as nice as what is there today), which is fine but we need to judge with some finality what are the minimum requirements for inclusion in NMDC.

@ssarrafan
Copy link
Collaborator

Adding notes exchanged via email here for reference:

Agree with your comments here David (and yours Kjiersten). I commented in parallel on GitHub, but the gist is that we need to resolve which fields we can and can't expect to require. Many "required" things for the portal could be made optional if needed.

#41

On Wed, Apr 28, 2021 at 1:31 PM Kjiersten Fagnan kmfagnan@lbl.gov wrote:
I support the approach David laid out for what fields are required vs optional. I can add the following comments to the ticket, but we seem to be getting into this via email.

Could we create some default values for the portal to populate the page - avatar, URL, DOI and scientific objectives would be harder if not impossible.

In the future, when contributors are providing data to NMDC, could we also collect - photo, website URL, etc as part of the submission process - or perhaps give the PI the ability to add this themselves? This depends on having some level of access controls (different roles in the data portal than exist right now). Maybe this is part of working directly with the PIs to get their help on the study/data landing pages?

Kjiersten

On Wed, Apr 28, 2021 at 10:20 AM David Hays dehays@lbl.gov wrote:
Bill and Emiley said:

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the data needed?

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the PI images needed?

David should be able to address #41, seems like a GOLD database dump issue?

For #19, as was indicated in the github ticket, these were all manually collected by me. Not sure the best solution here, but this could tie into the more general discussion of the study pages (and some items from #41 like scientific objectives that are not part of the GOLD db dump). Not everything can be fully automated.... just my two cents.


On #41, I believe the fields that Bill is referring to are the ones that are NOT available from GOLD; i.e. those listed in microbiomedata/nmdc-metadata#301 and #19 such as PI web site, PI image, scientific objective, publication DOIs. Basically, the ones that Emiley collected manually and provided to Kitware.

For these, Bill could add these to the schema as non required attributes. He has no way of making the GOLD ETL populate these because they do not exist in GOLD.

Jon states that the portal UI depends on these fields - but I believe the portal UI will need to treat them as optional fields as well because normally they will not be available. If we add 10K studies tomorrow from GOLD or NCBI - we will not be waiting for Emiley to collect values for these fields before they can be displayed in the UI. The portal UI needs to be flexible enough to handle cases where these values are not available.

I also believe it should not be the responsibility of search portal development to merge additional metadata for studies to extend what was made available for ingest. So that implies the need for an curate/annotate procedure that is available between GOLD or NCBI ETL and search portal ingest. And in the case of images - in addition to a way to edit the study json docs to add PI image URLs, there is also the need to add and manage the image files to a location associated with the metadata URL.

So for #41, there are a number of fields that are always available (We will always have a PI for a study.) that can be made required in the schema. But for those for which there is no available source except manual curation, at best these could be optional fields in the schema. Make sense?

-David

On Tue, Apr 27, 2021 at 1:51 PM Emiley Eloe-Fadrosh eaeloefadrosh@lbl.gov wrote:
For these two:

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the data needed?

@emiley Eloe-Fadrosh any idea of who to follow up on with to get access to the PI images needed?

David should be able to address #41, seems like a GOLD database dump issue?

For #19, as was indicated in the github ticket, these were all manually collected by me. Not sure the best solution here, but this could tie into the more general discussion of the study pages (and some items from #41 like scientific objectives that are not part of the GOLD db dump). Not everything can be fully automated.... just my two cents.

@jbeezley
Copy link

Yes. What @jeffbaumes said was my intent.

We need to ensure that the UI design takes into account information that may not be available. Especially if the the common case is going to be that this information is missing, the current study page won't make much sense. Where this information is missing on a study, it might be better to link to a page inviting PI's to submit the data to us directly.

@wdduncan wdduncan removed this from In progress in NMDC April 2021 Sprint Apr 29, 2021
@wdduncan wdduncan added this to To do in NMDC May 2021 Sprint via automation Apr 29, 2021
@wdduncan wdduncan added the MEDIUM 4-7 days label May 5, 2021
@dehays dehays moved this from To do to Needs discussion, scoping in NMDC May 2021 Sprint May 6, 2021
@ssarrafan ssarrafan added this to the Sprint 2 milestone May 14, 2021
@ssarrafan
Copy link
Collaborator

Based on the meeting today with @dehays, @emileyfadrosh, @dwinston, @wdduncan, and @jbeezley, @wdduncan will add the missing fields into the schema for this sprint.

@ssarrafan
Copy link
Collaborator

@wdduncan I'll move this to in progress per the last comment but let me know if you won't be able to add the missing fields this week or maybe it's already done?

@ssarrafan ssarrafan moved this from Needs discussion, scoping to In progress in NMDC May 2021 Sprint May 27, 2021
@ssarrafan
Copy link
Collaborator

@dehays and @wdduncan have discussed this and @dehays will close this issue and open several new related issues for June.

@dehays
Copy link
Contributor

dehays commented May 28, 2021

Closing this issue. Replaced by #50 and #51

@dehays dehays closed this as completed May 28, 2021
NMDC May 2021 Sprint automation moved this from In progress to Done May 28, 2021
turbomam added a commit that referenced this issue Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MEDIUM 4-7 days
Projects
No open projects
Development

No branches or pull requests

6 participants