-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make explicit list of fields required for portal to function, make missing entries invalidate the data #41
Comments
I went through the generated schema definitions and assembled a list of attributes that are generally assumed to be present by the pilot application. Several attributes are additionally required for ingest either because of referential integrity or because they represent non-nullable fields in the database. I have marked such attributes in bold text. Study
We also need additional information related to studies not present in the source data (see study_additional.json and microbiomedata/nmdc-metadata#301):
OmicsProcessing
Biosample
DataObject
WorkflowExecutionActivity
MetaproteomicAnalysis
PeptideQuantification
FunctionalAnnotation
|
@wdduncan could you review these attributes and make sure they are required in the schema? |
Based on meeting today I'm moved to "in progress". |
@jeffbaumes Many of these schema changes require data that I don't have access to. |
@wdduncan Yes, we can discuss this tomorrow. The fields I think you are probably referring to are the ones on study that are not available in GOLD. @jeffbaumes There are a number of these that I don't believe can be made required in the schema. DOE, PI web sites, objective. If values for these are not available from NMDC sources (GOLD, NCBI, etc.) then requiring them in the schema does not make sense. In the cases where we do have values - we need to consider how they would be included. I do believe that should happen upstream of search portal ingest; i.e. not provided to Kitware out-of-band in various files. |
One possible course of action is an nmdc-portal LinkML file that imports nmdc.yaml, i.e. # in e.g. nmdc-schema/src/schema/extensions/nmdc-portal.yaml
...
imports:
- nmdc
... Then, the portal schema can use the This way, the nmdc-portal requirements can be verified via tests in the nmdc-schema repo to be a proper extension of the base nmdc schema (e.g. no required fields are made optional, etc.), though I think for this issue such an error is unlikely - the portal only makes optional fields required and adds new fields. |
@dehays if we know what must remain optional (i.e. we can't expect to require it on ingest or submission) then we should make that explicit and adapt our portal DB and UI to accommodate so it is resilient to those fields being absent. Basically @jbeezley is reporting what is required to drive the portal we see today. If we don't require these things some aspects of the interface will suffer (i.e. we can't expect to have PI/study pages that are all as nice as what is there today), which is fine but we need to judge with some finality what are the minimum requirements for inclusion in NMDC. |
Adding notes exchanged via email here for reference: Agree with your comments here David (and yours Kjiersten). I commented in parallel on GitHub, but the gist is that we need to resolve which fields we can and can't expect to require. Many "required" things for the portal could be made optional if needed. On Wed, Apr 28, 2021 at 1:31 PM Kjiersten Fagnan kmfagnan@lbl.gov wrote: Could we create some default values for the portal to populate the page - avatar, URL, DOI and scientific objectives would be harder if not impossible. In the future, when contributors are providing data to NMDC, could we also collect - photo, website URL, etc as part of the submission process - or perhaps give the PI the ability to add this themselves? This depends on having some level of access controls (different roles in the data portal than exist right now). Maybe this is part of working directly with the PIs to get their help on the study/data landing pages? Kjiersten On Wed, Apr 28, 2021 at 10:20 AM David Hays dehays@lbl.gov wrote:
David should be able to address #41, seems like a GOLD database dump issue? For #19, as was indicated in the github ticket, these were all manually collected by me. Not sure the best solution here, but this could tie into the more general discussion of the study pages (and some items from #41 like scientific objectives that are not part of the GOLD db dump). Not everything can be fully automated.... just my two cents. On #41, I believe the fields that Bill is referring to are the ones that are NOT available from GOLD; i.e. those listed in microbiomedata/nmdc-metadata#301 and #19 such as PI web site, PI image, scientific objective, publication DOIs. Basically, the ones that Emiley collected manually and provided to Kitware. For these, Bill could add these to the schema as non required attributes. He has no way of making the GOLD ETL populate these because they do not exist in GOLD. Jon states that the portal UI depends on these fields - but I believe the portal UI will need to treat them as optional fields as well because normally they will not be available. If we add 10K studies tomorrow from GOLD or NCBI - we will not be waiting for Emiley to collect values for these fields before they can be displayed in the UI. The portal UI needs to be flexible enough to handle cases where these values are not available. I also believe it should not be the responsibility of search portal development to merge additional metadata for studies to extend what was made available for ingest. So that implies the need for an curate/annotate procedure that is available between GOLD or NCBI ETL and search portal ingest. And in the case of images - in addition to a way to edit the study json docs to add PI image URLs, there is also the need to add and manage the image files to a location associated with the metadata URL. So for #41, there are a number of fields that are always available (We will always have a PI for a study.) that can be made required in the schema. But for those for which there is no available source except manual curation, at best these could be optional fields in the schema. Make sense? -David On Tue, Apr 27, 2021 at 1:51 PM Emiley Eloe-Fadrosh eaeloefadrosh@lbl.gov wrote:
David should be able to address #41, seems like a GOLD database dump issue? For #19, as was indicated in the github ticket, these were all manually collected by me. Not sure the best solution here, but this could tie into the more general discussion of the study pages (and some items from #41 like scientific objectives that are not part of the GOLD db dump). Not everything can be fully automated.... just my two cents. |
Yes. What @jeffbaumes said was my intent. We need to ensure that the UI design takes into account information that may not be available. Especially if the the common case is going to be that this information is missing, the current study page won't make much sense. Where this information is missing on a study, it might be better to link to a page inviting PI's to submit the data to us directly. |
@wdduncan I'll move this to in progress per the last comment but let me know if you won't be able to add the missing fields this week or maybe it's already done? |
For example, principal_investigator_name was marked optional in the JSON schema but is required for the portal to function.
I believe this involves marking more things as required in the schema, hence I'm putting this issue on the metadata repo.
The text was updated successfully, but these errors were encountered: