Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move file descriptors to the offical description field of data objects #20

Closed
jeffbaumes opened this issue Apr 1, 2021 · 25 comments · Fixed by #66
Closed

Move file descriptors to the offical description field of data objects #20

jeffbaumes opened this issue Apr 1, 2021 · 25 comments · Fixed by #66
Assignees
Labels
LARGE 7-10 days
Milestone

Comments

@jeffbaumes
Copy link

Currently the mapping is in the portal repo but should probably be associated with data objects upstream.

@jeffbaumes
Copy link
Author

jeffbaumes commented Apr 6, 2021

Here is the current mapping of descriptions in the client based on file name. It seems these descriptions should be added as part of the analysis output JSON instead of in the client:

https://github.com/microbiomedata/nmdc-server/blob/master/web/src/v2/components/DataObjectTable.vue#L14-L26

@jeffbaumes
Copy link
Author

This involves discussing JSON format on analysis output (requiring descriptions on data objects). @dehays or @dwinston do you have thoughts here?

@dwinston
Copy link
Collaborator

dwinston commented Apr 6, 2021

I recommend that each data object reference a file type, that can in turn be de-referenced to obtain a string description. I don't think data objects should e.g. store full string descriptions such as "Reads QC result fastq (clean data)", as any cosmetic change to these would require multiple-document updates.

Thus, the mapping currently in the front-end code would essentially live upstream and be consumed by the portal.

@jeffbaumes
Copy link
Author

I like this idea @dwinston. So this becomes (1) a schema change that adds a file type field to data objects, (2) an official registry that maps data types to description strings (minimally a simple JSON or CSV file). It would seem to make sense for the portal database to contain the description mapping and file type of each data object.

@dwinston do you have ideas on how to manage the data object type registry and ensure analysis workflows tag each data object appropriately?

@dwinston
Copy link
Collaborator

dwinston commented Apr 6, 2021

On ensuring appropriate tagging of data objects: this could be part of workflow post-processing -- i.e. this needn't be done as part of a workflow's data-intensive main logic. However, workflow authors would know best about how to type-annotate their workflow data products. They needn't be responsible for implementing the tagging logic, but they would need to be consulted about the logic's correctness (i.e. code review).

On the data object type registry: I think that a collaborative google sheets session among workflow authors could knock out creation in under an hour. The result could live anywhere, perhaps as a CSV in the nmdc-metadata repo, and updated as needed.

@jeffbaumes
Copy link
Author

On the data object type registry: I think that a collaborative google sheets session among workflow authors could knock out creation in under an hour.

There is a start to that effort in this document, which our current mapping derives from:

https://docs.google.com/document/d/1uq6J__NbCUezsD16VTVJa-3_HJn5DacASBmFw3xQ3Bc/edit?usp=sharing

The result could live anywhere, perhaps as a CSV in the nmdc-metadata repo, and updated as needed.

A CSV in nmdc-metadata sounds reasonable. Right now the google doc only contains exemplar file names, from which Brandon derived a regex/prefix name matching. Explicit file type IDs applied to analysis output data objects would be cleaner. @dwinston do you want to take a crack at converting to an official CSV and assigning IDs?

@cmungall
Copy link
Collaborator

cmungall commented Apr 7, 2021

We can do this in the schema

e.g.

file_type_enum:
  permissible_values:
     assembly_contigs.fna: 
        description: Final assembly contigs fasta
--



--



      

@dehays dehays transferred this issue from microbiomedata/nmdc-metadata Apr 9, 2021
@ssarrafan ssarrafan added this to the Sprint 1 milestone Apr 12, 2021
@ssarrafan
Copy link
Collaborator

Moving to May sprint per Bill. More discussions required before estimating work.

@ssarrafan ssarrafan removed this from To do in NMDC April 2021 Sprint Apr 27, 2021
@ssarrafan ssarrafan added this to To do in NMDC May 2021 Sprint via automation Apr 27, 2021
@ssarrafan ssarrafan modified the milestones: Sprint 1, Sprint 2 Apr 27, 2021
@wdduncan wdduncan added the LARGE 7-10 days label May 5, 2021
@scanon
Copy link
Collaborator

scanon commented May 17, 2021

I like @cmungall 's proposal. We should probably have the workflow owners help populate this.

@scanon
Copy link
Collaborator

scanon commented May 17, 2021

{
"description": "Functional annotation GFF file for gold:Gp0324006",
"url": "https://data.microbiomedata.org/data/503568_190750/annotation/503568_190750_functional_annotation.gff",
"md5_checksum": "a360c42d1403084110719be0d6d8d47a",
"file_size_bytes": 51348736,
"id": "nmdc:a360c42d1403084110719be0d6d8d47a",
"name": "gold:Gp0324006_Functional annotation GFF file",
"XXXXX": "functional_annotation_gff"
},

So I don't know if "type" would be the best one to use or "data_object_type". It sounds like type would be best but I think it should be constrained to some curated list of types.

@ssarrafan
Copy link
Collaborator

@wdduncan please update microbiomedata/nmdc-server#357 when this is completed so we know we can move forward with the next step of this fix. Thank you!

@cmungall
Copy link
Collaborator

I propose we change

  data object type:
    range: controlled term value
    description: >-
      The type of data object
    examples:
      - value: metagenome_assembly

changing "controlled term value" to an enum we define

(note: the two are actually similar but we have more fine-grained control with an enum, and we can still map our enum permissible values to terminologies like EDAM, if we like)

@SamuelPurvine
Copy link

Stab at proteomics file descriptors:
MSGFjobs_MASIC_resultant.tsv: Peptide-to-Spectrum Matches and LC-MS abundances
Peptide_Report.tsv: List of peptides and associated abundances filtered by FDR
Protein_Report.tsv: List of proteins derived from Peptide_Report with summed abundances
QC_Metrics.tsv: Report of Peptide-to-Spectrum matching tool performance statistics

@ssarrafan
Copy link
Collaborator

@wdduncan let me know if this should be assigned to someone else. I've removed the other assignees per discussion with Kjiersten.

@ssarrafan ssarrafan modified the milestones: Sprint 2, Sprint 3 Jun 4, 2021
@jbeezley
Copy link

jbeezley commented Jun 4, 2021

This issue will likely end up as a blocker for microbiomedata/nmdc-server#354 because we need to know the file types inside the database to perform the queries. I can do most of the engineering using the mapping from the client side code, but I don't think those values are necessarily what we want exposed in the bulk download UI.

@dehays
Copy link
Contributor

dehays commented Jun 10, 2021

@wdduncan Can the change that Chris describes above be included among those schema changes you will be completing for this schema version? I can work with Shane, Sam and others to get values for those descriptions.

@wdduncan
Copy link
Contributor

I can make the changes. Can you get me a list of file types for to include in the enum?

@wdduncan
Copy link
Contributor

In the schema, I've created the follow list of file enums.
@SamuelPurvine @jbeezley @jbeezley @jeffbaumes @cmungall @turbomam Please let me know what I need to edit.

file_type_enum:
    permissible_values:
      assembly_contigs.fna: 
        description: Final assembly contigs fasta.
      assembly_scaffolds.fna:
        description: Final assembly scaffolds fasta.
      assembly.agp:
        description: An AGP format file describes the assembly.
      filterStats.txt:
        description: Reads QC summary statistics
      filtered.fastq.gz:
        description: Reads QC result fastq (clean data)
      mapping_stats.txt :
        description: Assembled contigs coverage information 
      pairedMapped_sorted.bam:
        description: Sorted bam file of reads mapping back to the final assembly.
      KO TSV:
        description: Tab delimited file for KO annotation.
      EC TSV:
        description: Tab delimited file for EC annotation.
      Protein FAA:
        description: FASTA amino acid file for annotated proteins.
      MSGFjobs_MASIC_resultant.tsv:
        description: Tab delimited file of unfiltered metaproteomics results, both identifications and abundances.
      Peptide_Report.tsv:
        description: Tab delimited file of peptide results filtered to ~5% FDR, including protein and abundance information.
      Protein_Report.tsv:
        description: Tab delimited file of protein results derived from ~5% FDR filtered peptide data, including aggregated abundance information.
      QC_metrics.tsv:
        description: Tab delimited file of aggregate statistics derived from workflow results.

@dehays
Copy link
Contributor

dehays commented Jun 13, 2021

@jeffbaumes @jbeezley Can you remind me why the client is using a mapping from the data object name attribute (You must be doing a regex as part of that mapping considering name values like ""gold:Gp0324006_Functional annotation GFF file") to a set of descriptions instead of using the description attribute of each data object?

Bill has added the file type enum above to the schema, but there isn't a file type attribute on data objects (yet). Seems like you would need that.

@jbeezley
Copy link

Can you remind me why the client is using a mapping from the data object name attribute to a set of descriptions instead of using the description attribute of each data object?

I'm not certain, but I suspect it's because the descriptions that exist in the database are different than what we want displayed. E.g. in the database we have Filtered read data for gold:Gp0115663, but the client side code replaces it with Reads QC result fastq (clean data).

@jeffbaumes
Copy link
Author

The main reason IMO is that file type and file description should be separate things in the UI, but right now they are conflated as the same thing. Should we make two distinct columns in the table for these?

@dehays
Copy link
Contributor

dehays commented Jun 14, 2021

@jeffbaumes I think right now you are inferring a 'file type' by looking at the end of the data object name attribute value - then mapping that to a description to display in the UI. The data object description attribute is populated on all data objects, but is not used.

I spoke with Jon about this this morning. For filtering by file type for the bulk download feature - there needs to be a file type attribute to filter on. To do that - 1) need to check that the enumerated type values that Bill has are the ones people would want to see in the UI 2) add a 'file_type' attribute to each data object and populate it.

For the related issue of what description to display - if the data objects have file types, then Bill's file type enum can be used to obtain (and manage) the descriptions to display. This leaves the question of why we need to have a description attribute on the data objects if it is not being used anywhere.

@wdduncan
Copy link
Contributor

Just for clarity: the slot/property that will hold the enum is names data object type.

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 16, 2021
@emileyfadrosh emileyfadrosh reopened this Jun 17, 2021
NMDC June 2021 Sprint automation moved this from Done to In progress Jun 17, 2021
@emileyfadrosh
Copy link
Contributor

Reopening. There are two separate issues that need to be resolved:

  1. Updates are needed for all workflows to include the appropriate files types and descriptions in the JSONs directly. This would hopefully address the issue for all newly processed data and would enable correct, unambiguous display in the portal. (Request to address this next sprint, please.)

  2. For existing data in the portal, we need a solution that would work for display and bulk download (Portal Download - bulk nmdc-server#354).

For #2: we have a table (https://docs.google.com/document/d/1uq6J__NbCUezsD16VTVJa-3_HJn5DacASBmFw3xQ3Bc/edit?usp=sharing) that lists out the current display for Data Object Type and Data Object Description. The third column, File Type Enum, is an attempt to address what would need to be displayed in the portal. I am not sure if this is what you had in mind @wdduncan @dehays? Hopefully I am not further confusing the issue!

@dwinston
Copy link
Collaborator

data_object_type values are now in mongo data_object_set documents. The values are id values from the file_type_enum mongo collection , and documents there have name and description fields for display as “Data Object Type” and “Data Object Description”, respectively, in the portal.

Because the file_type_enum collection is currently small (30 documents currently, i.e. 30 data object types), it seems apt to post it here so the format is clear. Each document hosts a mongo filter document (encoded as a json string) that determines which data_object_set documents are to be assigned the corresponding file_type_enum.id as its data_obect_type.

/* 1 */
{
    "name" : "FT ICR-MS analysis results",
    "description" : "FT ICR-MS-based metabolite assignment results table",
    "filter" : "{\"url\": {\"$regex\": \"nom\\\\/results\"}, \"description\": {\"$regex\": \"FT ICR-MS\"}}",
    "id" : "wvc6-ya49-36"
}

/* 2 */
{
    "name" : "GC-MS Metabolomics Results",
    "description" : "GC-MS-based metabolite assignment results table",
    "filter" : "{\"url\": {\"$regex\": \"metabolomics\\\\/results\"}}",
    "id" : "cnem-v49k-65"
}

/* 3 */
{
    "name" : "Metaproteomics Workflow Statistics",
    "description" : "Aggregate workflow statistics file",
    "filter" : "{\"url\": {\"$regex\": \"QC_Metrics.tsv\"}}",
    "id" : "c1rq-13k4-88"
}

/* 4 */
{
    "name" : "Protein Report",
    "description" : "Filtered protein report file",
    "filter" : "{\"url\": {\"$regex\": \"Protein_Report.tsv\"}}",
    "id" : "9y8d-cev3-02"
}

/* 5 */
{
    "name" : "Peptide Report",
    "description" : "Filtered peptide report file",
    "filter" : "{\"url\": {\"$regex\": \"Peptide_Report.tsv\"}}",
    "id" : "wwfz-pgpk-29"
}

/* 6 */
{
    "name" : "Unfiltered Metaproteomics Results",
    "description" : "MSGFjobs and MASIC output file",
    "filter" : "{\"url\": {\"$regex\": \"MSGFjobs_MASIC_resultant.tsv\"}}",
    "id" : "wvdw-zpdr-48"
}

/* 7 */
{
    "name" : "Read Count and RPKM",
    "description" : "Annotation read count and RPKM per feature JSON",
    "filter" : "{\"url\": {\"$regex\": \"metat_out_json\\\\/output.json\"}}",
    "id" : "tqf0-k3hq-33"
}

/* 8 */
{
    "name" : "QC non-rRNA R2",
    "description" : "QC removed rRNA reads (R2) fastq",
    "filter" : "{\"url\": {\"$regex\": \"filtered_R2.fastq\"}}",
    "id" : "z7hs-t9kh-91"
}

/* 9 */
{
    "name" : "QC non-rRNA R1",
    "description" : "QC removed rRNA reads (R1) fastq",
    "filter" : "{\"url\": {\"$regex\": \"filtered_R1.fastq\"}}",
    "id" : "1hch-yta4-82"
}

/* 10 */
{
    "name" : "Metagenome Bins",
    "description" : "Metagenome bin contigs fasta",
    "filter" : "{\"url\": {\"$regex\": \"bins\\\\.\\\\d+\\\\.fa\"}}",
    "id" : "by1h-3550-40"
}

/* 11 */
{
    "name" : "CheckM Statistics",
    "description" : "CheckM statistics report",
    "filter" : "{\"url\": {\"$regex\": \"checkm_qa.out\"}}",
    "id" : "3bcs-dgye-29"
}

/* 12 */
{
    "name" : "Krona Plot",
    "description" : "[GOTTCHA2] krona plot HTML file",
    "filter" : "{\"url\": {\"$regex\": \"gottcha2.*krona.html\"}}",
    "id" : "efm2-ax0y-52"
}

/* 13 */
{
    "name" : "Krona Plot",
    "description" : "[Kraken2] krona plot HTML file",
    "filter" : "{\"url\": {\"$regex\": \"kraken2.*krona.html\"}}",
    "id" : "5m0p-gn5j-53"
}

/* 14 */
{
    "name" : "Classification Report",
    "description" : "[Kraken2] output report file",
    "filter" : "{\"url\": {\"$regex\": \"kraken2.*report.tsv\"}}",
    "id" : "33w2-pxn6-17"
}

/* 15 */
{
    "name" : "Taxonomic Classification",
    "description" : "[Kraken2] output read classification file",
    "filter" : "{\"url\": {\"$regex\": \"kraken2.*classification.tsv\"}}",
    "id" : "0hcf-0qqe-98"
}

/* 16 */
{
    "name" : "Krona Plot",
    "description" : "[Centrifuge] krona plot HTML file",
    "filter" : "{\"url\": {\"$regex\": \"centrifuge.*krona.html\"}}",
    "id" : "72n0-3j60-31"
}

/* 17 */
{
    "name" : "Classification Report",
    "description" : "[Centrifuge] output report file",
    "filter" : "{\"url\": {\"$regex\": \"centrifuge.*report.tsv\"}}",
    "id" : "sak3-sdrj-80"
}

/* 18 */
{
    "name" : "Taxonomic Classification",
    "description" : "[Centrifuge] output read classification file",
    "filter" : "{\"url\": {\"$regex\": \"centrifuge.*classification.tsv\"}}",
    "id" : "rwnf-51gr-63"
}

/* 19 */
{
    "name" : "Structural Annotation GFF",
    "description" : "GFF3 format file with structural annotations",
    "filter" : "{\"url\": {\"$regex\": \"annotation\\\\/.*structural_annotation\\\\.gff\"}}",
    "id" : "m180-dayy-20"
}

/* 20 */
{
    "name" : "Functional Annotation GFF",
    "description" : "GFF3 format file with functional annotations",
    "filter" : "{\"url\": {\"$regex\": \"annotation\\\\/.*functional_annotation\\\\.gff\"}}",
    "id" : "esw9-xah9-64"
}

/* 21 */
{
    "name" : "Annotation Amino Acid FASTA",
    "description" : "FASTA amino acid file for annotated proteins",
    "filter" : "{\"url\": {\"$regex\": \"annotation.*\\\\.faa\"}}",
    "id" : "a6n7-v14z-71"
}

/* 22 */
{
    "name" : "Annotation Enzyme Commission",
    "description" : "Tab delimited file for EC annotation",
    "filter" : "{\"url\": {\"$regex\": \"_ec.tsv\"}}",
    "id" : "xnqj-mxhp-70"
}

/* 23 */
{
    "name" : "Annotation KEGG Orthology",
    "description" : "Tab delimited file for KO annotation",
    "filter" : "{\"url\": {\"$regex\": \"_ko.tsv\"}}",
    "id" : "6w1k-fy0w-38"
}

/* 24 */
{
    "name" : "Assembly Coverage BAM",
    "description" : "Sorted bam file of reads mapping back to the final assembly",
    "filter" : "{\"url\": {\"$regex\": \"pairedMapped_sorted.bam\"}}",
    "id" : "9mqn-qnrp-47"
}

/* 25 */
{
    "name" : "Assembly AGP",
    "description" : "An AGP format file describes the assembly",
    "filter" : "{\"url\": {\"$regex\": \"assembly.agp\"}}",
    "id" : "70rw-5ye3-40"
}

/* 26 */
{
    "name" : "Assembly Scaffolds",
    "description" : "Final assembly scaffolds fasta",
    "filter" : "{\"url\": {\"$regex\": \"assembly_scaffolds.fna\"}}",
    "id" : "g558-0ejx-08"
}

/* 27 */
{
    "name" : "Assembly Contigs",
    "description" : "Final assembly contigs fasta",
    "filter" : "{\"url\": {\"$regex\": \"assembly_contigs.fna\"}}",
    "id" : "8n5m-t7hj-29"
}

/* 28 */
{
    "name" : "Assembly Coverage Stats",
    "description" : "Assembled contigs coverage information",
    "filter" : "{\"url\": {\"$regex\": \"mapping_stats.txt\"}}",
    "id" : "qdq8-xm5g-78"
}

/* 29 */
{
    "name" : "Filtered Sequencing Reads",
    "description" : "Reads QC result fastq (clean data)",
    "filter" : "{\"url\": {\"$regex\": \"filtered.fastq.gz\"}}",
    "id" : "ty5w-9zbs-90"
}

/* 30 */
{
    "name" : "QC Statistics",
    "description" : "Reads QC summary statistics",
    "filter" : "{\"url\": {\"$regex\": \"filterStats.txt\"}}",
    "id" : "trtt-76zn-67"
}

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LARGE 7-10 days
Projects
No open projects
10 participants