Move file descriptors to the offical description field of data objects #20

jeffbaumes · 2021-04-01T13:55:18Z

Currently the mapping is in the portal repo but should probably be associated with data objects upstream.

jeffbaumes · 2021-04-06T13:47:35Z

Here is the current mapping of descriptions in the client based on file name. It seems these descriptions should be added as part of the analysis output JSON instead of in the client:

https://github.com/microbiomedata/nmdc-server/blob/master/web/src/v2/components/DataObjectTable.vue#L14-L26

jeffbaumes · 2021-04-06T13:49:01Z

This involves discussing JSON format on analysis output (requiring descriptions on data objects). @dehays or @dwinston do you have thoughts here?

dwinston · 2021-04-06T15:45:26Z

I recommend that each data object reference a file type, that can in turn be de-referenced to obtain a string description. I don't think data objects should e.g. store full string descriptions such as "Reads QC result fastq (clean data)", as any cosmetic change to these would require multiple-document updates.

Thus, the mapping currently in the front-end code would essentially live upstream and be consumed by the portal.

jeffbaumes · 2021-04-06T20:23:20Z

I like this idea @dwinston. So this becomes (1) a schema change that adds a file type field to data objects, (2) an official registry that maps data types to description strings (minimally a simple JSON or CSV file). It would seem to make sense for the portal database to contain the description mapping and file type of each data object.

@dwinston do you have ideas on how to manage the data object type registry and ensure analysis workflows tag each data object appropriately?

dwinston · 2021-04-06T21:47:35Z

On ensuring appropriate tagging of data objects: this could be part of workflow post-processing -- i.e. this needn't be done as part of a workflow's data-intensive main logic. However, workflow authors would know best about how to type-annotate their workflow data products. They needn't be responsible for implementing the tagging logic, but they would need to be consulted about the logic's correctness (i.e. code review).

On the data object type registry: I think that a collaborative google sheets session among workflow authors could knock out creation in under an hour. The result could live anywhere, perhaps as a CSV in the nmdc-metadata repo, and updated as needed.

jeffbaumes · 2021-04-07T13:09:32Z

On the data object type registry: I think that a collaborative google sheets session among workflow authors could knock out creation in under an hour.

There is a start to that effort in this document, which our current mapping derives from:

https://docs.google.com/document/d/1uq6J__NbCUezsD16VTVJa-3_HJn5DacASBmFw3xQ3Bc/edit?usp=sharing

The result could live anywhere, perhaps as a CSV in the nmdc-metadata repo, and updated as needed.

A CSV in nmdc-metadata sounds reasonable. Right now the google doc only contains exemplar file names, from which Brandon derived a regex/prefix name matching. Explicit file type IDs applied to analysis output data objects would be cleaner. @dwinston do you want to take a crack at converting to an official CSV and assigning IDs?

cmungall · 2021-04-07T20:59:54Z

We can do this in the schema

e.g.

file_type_enum:
  permissible_values:
     assembly_contigs.fna: 
        description: Final assembly contigs fasta
--



--

ssarrafan · 2021-04-27T20:37:18Z

Moving to May sprint per Bill. More discussions required before estimating work.

scanon · 2021-05-17T19:06:38Z

I like @cmungall 's proposal. We should probably have the workflow owners help populate this.

scanon · 2021-05-17T19:13:38Z

{
"description": "Functional annotation GFF file for gold:Gp0324006",
"url": "https://data.microbiomedata.org/data/503568_190750/annotation/503568_190750_functional_annotation.gff",
"md5_checksum": "a360c42d1403084110719be0d6d8d47a",
"file_size_bytes": 51348736,
"id": "nmdc:a360c42d1403084110719be0d6d8d47a",
"name": "gold:Gp0324006_Functional annotation GFF file",
"XXXXX": "functional_annotation_gff"
},

So I don't know if "type" would be the best one to use or "data_object_type". It sounds like type would be best but I think it should be constrained to some curated list of types.

ssarrafan · 2021-05-25T21:23:56Z

@wdduncan please update microbiomedata/nmdc-server#357 when this is completed so we know we can move forward with the next step of this fix. Thank you!

cmungall · 2021-05-26T20:47:47Z

I propose we change

  data object type:
    range: controlled term value
    description: >-
      The type of data object
    examples:
      - value: metagenome_assembly

changing "controlled term value" to an enum we define

(note: the two are actually similar but we have more fine-grained control with an enum, and we can still map our enum permissible values to terminologies like EDAM, if we like)

SamuelPurvine · 2021-05-27T19:40:49Z

Stab at proteomics file descriptors:
MSGFjobs_MASIC_resultant.tsv: Peptide-to-Spectrum Matches and LC-MS abundances
Peptide_Report.tsv: List of peptides and associated abundances filtered by FDR
Protein_Report.tsv: List of proteins derived from Peptide_Report with summed abundances
QC_Metrics.tsv: Report of Peptide-to-Spectrum matching tool performance statistics

ssarrafan · 2021-06-02T23:27:18Z

@wdduncan let me know if this should be assigned to someone else. I've removed the other assignees per discussion with Kjiersten.

jbeezley · 2021-06-04T16:59:28Z

This issue will likely end up as a blocker for microbiomedata/nmdc-server#354 because we need to know the file types inside the database to perform the queries. I can do most of the engineering using the mapping from the client side code, but I don't think those values are necessarily what we want exposed in the bulk download UI.

dehays · 2021-06-10T21:37:17Z

@wdduncan Can the change that Chris describes above be included among those schema changes you will be completing for this schema version? I can work with Shane, Sam and others to get values for those descriptions.

wdduncan · 2021-06-10T21:56:29Z

I can make the changes. Can you get me a list of file types for to include in the enum?

wdduncan · 2021-06-11T23:36:19Z

In the schema, I've created the follow list of file enums.
@SamuelPurvine @jbeezley @jbeezley @jeffbaumes @cmungall @turbomam Please let me know what I need to edit.

file_type_enum:
    permissible_values:
      assembly_contigs.fna: 
        description: Final assembly contigs fasta.
      assembly_scaffolds.fna:
        description: Final assembly scaffolds fasta.
      assembly.agp:
        description: An AGP format file describes the assembly.
      filterStats.txt:
        description: Reads QC summary statistics
      filtered.fastq.gz:
        description: Reads QC result fastq (clean data)
      mapping_stats.txt :
        description: Assembled contigs coverage information 
      pairedMapped_sorted.bam:
        description: Sorted bam file of reads mapping back to the final assembly.
      KO TSV:
        description: Tab delimited file for KO annotation.
      EC TSV:
        description: Tab delimited file for EC annotation.
      Protein FAA:
        description: FASTA amino acid file for annotated proteins.
      MSGFjobs_MASIC_resultant.tsv:
        description: Tab delimited file of unfiltered metaproteomics results, both identifications and abundances.
      Peptide_Report.tsv:
        description: Tab delimited file of peptide results filtered to ~5% FDR, including protein and abundance information.
      Protein_Report.tsv:
        description: Tab delimited file of protein results derived from ~5% FDR filtered peptide data, including aggregated abundance information.
      QC_metrics.tsv:
        description: Tab delimited file of aggregate statistics derived from workflow results.

dehays · 2021-06-13T22:51:14Z

@jeffbaumes @jbeezley Can you remind me why the client is using a mapping from the data object name attribute (You must be doing a regex as part of that mapping considering name values like ""gold:Gp0324006_Functional annotation GFF file") to a set of descriptions instead of using the description attribute of each data object?

Bill has added the file type enum above to the schema, but there isn't a file type attribute on data objects (yet). Seems like you would need that.

jbeezley · 2021-06-14T16:13:33Z

Can you remind me why the client is using a mapping from the data object name attribute to a set of descriptions instead of using the description attribute of each data object?

I'm not certain, but I suspect it's because the descriptions that exist in the database are different than what we want displayed. E.g. in the database we have Filtered read data for gold:Gp0115663, but the client side code replaces it with Reads QC result fastq (clean data).

jeffbaumes · 2021-06-14T16:58:15Z

The main reason IMO is that file type and file description should be separate things in the UI, but right now they are conflated as the same thing. Should we make two distinct columns in the table for these?

dehays · 2021-06-14T18:02:40Z

@jeffbaumes I think right now you are inferring a 'file type' by looking at the end of the data object name attribute value - then mapping that to a description to display in the UI. The data object description attribute is populated on all data objects, but is not used.

I spoke with Jon about this this morning. For filtering by file type for the bulk download feature - there needs to be a file type attribute to filter on. To do that - 1) need to check that the enumerated type values that Bill has are the ones people would want to see in the UI 2) add a 'file_type' attribute to each data object and populate it.

For the related issue of what description to display - if the data objects have file types, then Bill's file type enum can be used to obtain (and manage) the descriptions to display. This leaves the question of why we need to have a description attribute on the data objects if it is not being used anywhere.

wdduncan · 2021-06-14T18:26:52Z

Just for clarity: the slot/property that will hold the enum is names data object type.

emileyfadrosh · 2021-06-17T04:57:09Z

Reopening. There are two separate issues that need to be resolved:

Updates are needed for all workflows to include the appropriate files types and descriptions in the JSONs directly. This would hopefully address the issue for all newly processed data and would enable correct, unambiguous display in the portal. (Request to address this next sprint, please.)
For existing data in the portal, we need a solution that would work for display and bulk download (Portal Download - bulk nmdc-server#354).

For #2: we have a table (https://docs.google.com/document/d/1uq6J__NbCUezsD16VTVJa-3_HJn5DacASBmFw3xQ3Bc/edit?usp=sharing) that lists out the current display for Data Object Type and Data Object Description. The third column, File Type Enum, is an attempt to address what would need to be displayed in the portal. I am not sure if this is what you had in mind @wdduncan @dehays? Hopefully I am not further confusing the issue!

dwinston · 2021-06-30T21:57:39Z

data_object_type values are now in mongo data_object_set documents. The values are id values from the file_type_enum mongo collection , and documents there have name and description fields for display as “Data Object Type” and “Data Object Description”, respectively, in the portal.

Because the file_type_enum collection is currently small (30 documents currently, i.e. 30 data object types), it seems apt to post it here so the format is clear. Each document hosts a mongo filter document (encoded as a json string) that determines which data_object_set documents are to be assigned the corresponding file_type_enum.id as its data_obect_type.

/* 1 */
{
    "name" : "FT ICR-MS analysis results",
    "description" : "FT ICR-MS-based metabolite assignment results table",
    "filter" : "{\"url\": {\"$regex\": \"nom\\\\/results\"}, \"description\": {\"$regex\": \"FT ICR-MS\"}}",
    "id" : "wvc6-ya49-36"
}

/* 2 */
{
    "name" : "GC-MS Metabolomics Results",
    "description" : "GC-MS-based metabolite assignment results table",
    "filter" : "{\"url\": {\"$regex\": \"metabolomics\\\\/results\"}}",
    "id" : "cnem-v49k-65"
}

/* 3 */
{
    "name" : "Metaproteomics Workflow Statistics",
    "description" : "Aggregate workflow statistics file",
    "filter" : "{\"url\": {\"$regex\": \"QC_Metrics.tsv\"}}",
    "id" : "c1rq-13k4-88"
}

/* 4 */
{
    "name" : "Protein Report",
    "description" : "Filtered protein report file",
    "filter" : "{\"url\": {\"$regex\": \"Protein_Report.tsv\"}}",
    "id" : "9y8d-cev3-02"
}

/* 5 */
{
    "name" : "Peptide Report",
    "description" : "Filtered peptide report file",
    "filter" : "{\"url\": {\"$regex\": \"Peptide_Report.tsv\"}}",
    "id" : "wwfz-pgpk-29"
}

/* 6 */
{
    "name" : "Unfiltered Metaproteomics Results",
    "description" : "MSGFjobs and MASIC output file",
    "filter" : "{\"url\": {\"$regex\": \"MSGFjobs_MASIC_resultant.tsv\"}}",
    "id" : "wvdw-zpdr-48"
}

/* 7 */
{
    "name" : "Read Count and RPKM",
    "description" : "Annotation read count and RPKM per feature JSON",
    "filter" : "{\"url\": {\"$regex\": \"metat_out_json\\\\/output.json\"}}",
    "id" : "tqf0-k3hq-33"
}

/* 8 */
{
    "name" : "QC non-rRNA R2",
    "description" : "QC removed rRNA reads (R2) fastq",
    "filter" : "{\"url\": {\"$regex\": \"filtered_R2.fastq\"}}",
    "id" : "z7hs-t9kh-91"
}

/* 9 */
{
    "name" : "QC non-rRNA R1",
    "description" : "QC removed rRNA reads (R1) fastq",
    "filter" : "{\"url\": {\"$regex\": \"filtered_R1.fastq\"}}",
    "id" : "1hch-yta4-82"
}

/* 10 */
{
    "name" : "Metagenome Bins",
    "description" : "Metagenome bin contigs fasta",
    "filter" : "{\"url\": {\"$regex\": \"bins\\\\.\\\\d+\\\\.fa\"}}",
    "id" : "by1h-3550-40"
}

/* 11 */
{
    "name" : "CheckM Statistics",
    "description" : "CheckM statistics report",
    "filter" : "{\"url\": {\"$regex\": \"checkm_qa.out\"}}",
    "id" : "3bcs-dgye-29"
}

/* 12 */
{
    "name" : "Krona Plot",
    "description" : "[GOTTCHA2] krona plot HTML file",
    "filter" : "{\"url\": {\"$regex\": \"gottcha2.*krona.html\"}}",
    "id" : "efm2-ax0y-52"
}

/* 13 */
{
    "name" : "Krona Plot",
    "description" : "[Kraken2] krona plot HTML file",
    "filter" : "{\"url\": {\"$regex\": \"kraken2.*krona.html\"}}",
    "id" : "5m0p-gn5j-53"
}

/* 14 */
{
    "name" : "Classification Report",
    "description" : "[Kraken2] output report file",
    "filter" : "{\"url\": {\"$regex\": \"kraken2.*report.tsv\"}}",
    "id" : "33w2-pxn6-17"
}

/* 15 */
{
    "name" : "Taxonomic Classification",
    "description" : "[Kraken2] output read classification file",
    "filter" : "{\"url\": {\"$regex\": \"kraken2.*classification.tsv\"}}",
    "id" : "0hcf-0qqe-98"
}

/* 16 */
{
    "name" : "Krona Plot",
    "description" : "[Centrifuge] krona plot HTML file",
    "filter" : "{\"url\": {\"$regex\": \"centrifuge.*krona.html\"}}",
    "id" : "72n0-3j60-31"
}

/* 17 */
{
    "name" : "Classification Report",
    "description" : "[Centrifuge] output report file",
    "filter" : "{\"url\": {\"$regex\": \"centrifuge.*report.tsv\"}}",
    "id" : "sak3-sdrj-80"
}

/* 18 */
{
    "name" : "Taxonomic Classification",
    "description" : "[Centrifuge] output read classification file",
    "filter" : "{\"url\": {\"$regex\": \"centrifuge.*classification.tsv\"}}",
    "id" : "rwnf-51gr-63"
}

/* 19 */
{
    "name" : "Structural Annotation GFF",
    "description" : "GFF3 format file with structural annotations",
    "filter" : "{\"url\": {\"$regex\": \"annotation\\\\/.*structural_annotation\\\\.gff\"}}",
    "id" : "m180-dayy-20"
}

/* 20 */
{
    "name" : "Functional Annotation GFF",
    "description" : "GFF3 format file with functional annotations",
    "filter" : "{\"url\": {\"$regex\": \"annotation\\\\/.*functional_annotation\\\\.gff\"}}",
    "id" : "esw9-xah9-64"
}

/* 21 */
{
    "name" : "Annotation Amino Acid FASTA",
    "description" : "FASTA amino acid file for annotated proteins",
    "filter" : "{\"url\": {\"$regex\": \"annotation.*\\\\.faa\"}}",
    "id" : "a6n7-v14z-71"
}

/* 22 */
{
    "name" : "Annotation Enzyme Commission",
    "description" : "Tab delimited file for EC annotation",
    "filter" : "{\"url\": {\"$regex\": \"_ec.tsv\"}}",
    "id" : "xnqj-mxhp-70"
}

/* 23 */
{
    "name" : "Annotation KEGG Orthology",
    "description" : "Tab delimited file for KO annotation",
    "filter" : "{\"url\": {\"$regex\": \"_ko.tsv\"}}",
    "id" : "6w1k-fy0w-38"
}

/* 24 */
{
    "name" : "Assembly Coverage BAM",
    "description" : "Sorted bam file of reads mapping back to the final assembly",
    "filter" : "{\"url\": {\"$regex\": \"pairedMapped_sorted.bam\"}}",
    "id" : "9mqn-qnrp-47"
}

/* 25 */
{
    "name" : "Assembly AGP",
    "description" : "An AGP format file describes the assembly",
    "filter" : "{\"url\": {\"$regex\": \"assembly.agp\"}}",
    "id" : "70rw-5ye3-40"
}

/* 26 */
{
    "name" : "Assembly Scaffolds",
    "description" : "Final assembly scaffolds fasta",
    "filter" : "{\"url\": {\"$regex\": \"assembly_scaffolds.fna\"}}",
    "id" : "g558-0ejx-08"
}

/* 27 */
{
    "name" : "Assembly Contigs",
    "description" : "Final assembly contigs fasta",
    "filter" : "{\"url\": {\"$regex\": \"assembly_contigs.fna\"}}",
    "id" : "8n5m-t7hj-29"
}

/* 28 */
{
    "name" : "Assembly Coverage Stats",
    "description" : "Assembled contigs coverage information",
    "filter" : "{\"url\": {\"$regex\": \"mapping_stats.txt\"}}",
    "id" : "qdq8-xm5g-78"
}

/* 29 */
{
    "name" : "Filtered Sequencing Reads",
    "description" : "Reads QC result fastq (clean data)",
    "filter" : "{\"url\": {\"$regex\": \"filtered.fastq.gz\"}}",
    "id" : "ty5w-9zbs-90"
}

/* 30 */
{
    "name" : "QC Statistics",
    "description" : "Reads QC summary statistics",
    "filter" : "{\"url\": {\"$regex\": \"filterStats.txt\"}}",
    "id" : "trtt-76zn-67"
}

jeffbaumes assigned jbeezley Apr 6, 2021

jeffbaumes assigned dehays Apr 6, 2021

dehays transferred this issue from microbiomedata/nmdc-metadata Apr 9, 2021

ssarrafan added this to the Sprint 1 milestone Apr 12, 2021

jeffbaumes assigned wdduncan Apr 15, 2021

ssarrafan removed this from To do in NMDC April 2021 Sprint Apr 27, 2021

ssarrafan added this to To do in NMDC May 2021 Sprint via automation Apr 27, 2021

ssarrafan modified the milestones: Sprint 1, Sprint 2 Apr 27, 2021

wdduncan added the LARGE 7-10 days label May 5, 2021

dehays mentioned this issue May 10, 2021

Portal Help - search items microbiomedata/nmdc-server#357

Closed

dehays mentioned this issue May 26, 2021

Add descriptive info to NOM data files are displayed microbiomedata/nmdc-metadata#357

Open

dehays mentioned this issue May 28, 2021

existing NMDC entities attributes to make required #50

Closed

14 tasks

wdduncan removed this from To do in NMDC May 2021 Sprint May 29, 2021

wdduncan added this to To do in NMDC June 2021 Sprint via automation May 29, 2021

ssarrafan unassigned jbeezley and dehays Jun 2, 2021

ssarrafan modified the milestones: Sprint 2, Sprint 3 Jun 4, 2021

wdduncan mentioned this issue Jun 4, 2021

make NMDC data object a ga4gh DRS object? #54

Open

wdduncan mentioned this issue Jun 11, 2021

issue-20: Move file descriptors to the offical description field of data objects #66

Merged

wdduncan moved this from To do to In progress in NMDC June 2021 Sprint Jun 11, 2021

dehays mentioned this issue Jun 13, 2021

Portal Download - bulk microbiomedata/nmdc-server#354

Closed

wdduncan closed this as completed in #66 Jun 16, 2021

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 16, 2021

emileyfadrosh reopened this Jun 17, 2021

NMDC June 2021 Sprint automation moved this from Done to In progress Jun 17, 2021

dwinston closed this as completed Jun 30, 2021

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move file descriptors to the offical description field of data objects #20

Move file descriptors to the offical description field of data objects #20

jeffbaumes commented Apr 1, 2021

jeffbaumes commented Apr 6, 2021 •

edited

jeffbaumes commented Apr 6, 2021

dwinston commented Apr 6, 2021 •

edited

jeffbaumes commented Apr 6, 2021

dwinston commented Apr 6, 2021 •

edited

jeffbaumes commented Apr 7, 2021

cmungall commented Apr 7, 2021

ssarrafan commented Apr 27, 2021

scanon commented May 17, 2021

scanon commented May 17, 2021

ssarrafan commented May 25, 2021

cmungall commented May 26, 2021

SamuelPurvine commented May 27, 2021

ssarrafan commented Jun 2, 2021

jbeezley commented Jun 4, 2021

dehays commented Jun 10, 2021

wdduncan commented Jun 10, 2021

wdduncan commented Jun 11, 2021

dehays commented Jun 13, 2021

jbeezley commented Jun 14, 2021

jeffbaumes commented Jun 14, 2021

dehays commented Jun 14, 2021

wdduncan commented Jun 14, 2021

emileyfadrosh commented Jun 17, 2021

dwinston commented Jun 30, 2021

Move file descriptors to the offical description field of data objects #20

Move file descriptors to the offical description field of data objects #20

Comments

jeffbaumes commented Apr 1, 2021

jeffbaumes commented Apr 6, 2021 • edited

jeffbaumes commented Apr 6, 2021

dwinston commented Apr 6, 2021 • edited

jeffbaumes commented Apr 6, 2021

dwinston commented Apr 6, 2021 • edited

jeffbaumes commented Apr 7, 2021

cmungall commented Apr 7, 2021

ssarrafan commented Apr 27, 2021

scanon commented May 17, 2021

scanon commented May 17, 2021

ssarrafan commented May 25, 2021

cmungall commented May 26, 2021

SamuelPurvine commented May 27, 2021

ssarrafan commented Jun 2, 2021

jbeezley commented Jun 4, 2021

dehays commented Jun 10, 2021

wdduncan commented Jun 10, 2021

wdduncan commented Jun 11, 2021

dehays commented Jun 13, 2021

jbeezley commented Jun 14, 2021

jeffbaumes commented Jun 14, 2021

dehays commented Jun 14, 2021

wdduncan commented Jun 14, 2021

emileyfadrosh commented Jun 17, 2021

dwinston commented Jun 30, 2021

jeffbaumes commented Apr 6, 2021 •

edited

dwinston commented Apr 6, 2021 •

edited

dwinston commented Apr 6, 2021 •

edited