Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include isa-xlsx for ARC-specification 1.2 #76

Merged
merged 18 commits into from
Nov 9, 2023
Merged

Include isa-xlsx for ARC-specification 1.2 #76

merged 18 commits into from
Nov 9, 2023

Conversation

HLWeil
Copy link
Member

@HLWeil HLWeil commented Oct 13, 2023

This PR introduces the ISA-XLSX specification into the ARC specification.

Until now, the ARC specification mentioned ISA-Tab as a reference for the implementation of the experimental metadata files.
Contrary to this, many differences between the ISA-Tab specification and our tool implementations accumulated. Therefore I propose here the ISA-XLSX specification.

closes #73
closes #71

Copy link
Contributor

@Freymaurer Freymaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add information akin to ~"Any other column MAY be used in ISA-XLSX, but is not converted to ISA-JSON"

ISA-XLSX.md Outdated

## Inputs and Outputs

Each annotation table sheet MUST contain an `Input` and an `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input [<InputNodeType>]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They MUST be formatted in the pattern Input [<InputNodeType>].

"They MUST be formatted in the pattern Input [<InputNodeType>] and Output [<OutputNodeType>]." ?

Are InputNodeTypes defined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are listed opaquely in the next sentences:

A Source MUST be indicated with the node type Source Name

for example indicates that Source Name is a input node type

I agree that this should be more obvious. There should be a sentence right after that lists allowed input and output node types-

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kMutagene @Brilator
Thanks alot for your input. Can you check if my last commit clarifies it sufficiently?

ISA-XLSX.md Outdated
Each annotation table sheet MUST contain an `Input` and an `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input [<InputNodeType>]`.


A `Source` MUST be indicated with the node type `Source Name`. `Sources` MUST not be used as `Output` nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that there MUST be a Source Name (somewhere in the ARC), i.e. a Sample Name MUST not exist without a Source towards that Sample.

Or did I misunderstand that isa tables are changed towards allowing Input [Sample Name]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. With this change, we want to make the annotation more explicit.

Instead of

-----Sheet1-----|------Sheet2------|
Source -> Sample = Source -> Sample

we now have

---Sheet1---|---Sheet2---|
Source -> Sample -> Sample

The second entity in the example above was in actuality a Sample, but ambiguously annotated as a Source in the second sheet. Now instead, we annotate it as what it is, a Sample in both Sheets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, but this only answers half the question.

Can there be a sheet in any isa.study or isa.assay starting with a Sample, that does not have a source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes, and therefore not add any constraint.

To my understanding, Source is only a further specification of a Sample.


An `Labeled Extract Material` MUST be indicated with the node type `Labeled Extract Name`.

`Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about those categories of inputs and outputs (afaik they come from ISA). They don't really represent what's happening in the lab. a) Basically all of those would typically be called "sample". b) "extract" or "labeled extract" are just two types of possible outputs of a laboratory workflow, while other outputs types (a sample prepared for microscopy, ground powder, seeds, etc.) are not categorized.
So I don't really get, why these exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 100%, but that's just what is given by ISA. I guess Sample and Data would probably be sufficient tbh.

But with the ISA compatability (isa-json as interface) as a top-priority, IMO sticking to the given terminology is our best bet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Sample and Data would probably be sufficient tbh.

If that proves to be the case based on what people actually use, there is no problem here i think. There is nothing wrong with the spec providing more nodes than practically useful - you would not expect users to read the spec anyways. However, if the spec focuses on full compatibility, this must still be contained

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fair enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good point

ISA-XLSX.md Outdated

`Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity.

`Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant `Data` node to provide names or URIs of file locations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the "extract" types, I do not understand how an Image File is different from other Raw Data Files.

ISA-XLSX.md Outdated

For detail on ISA framework terminology, please read the [ISA Abstract Model specification](https://isa-specs.readthedocs.io/en/latest/isamodel.html).

This document describes the ISA Abstract Model reference implementation specified in the ISA-XLSX format. The XLSX format uses the SpreadsheetML markup language and schema to represent a spreadsheet document. Conceptually, using the terminology of the Spreadsheet ML specification [ISO/IEC 29500-1](https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml#:~:text=The%20XLSX%20format%20uses%20the,a%20rectangular%20grid%20of%20cells.), the document comprises one or more worksheets in a workbook. Every worksheet MUST contain one table object storing the metadata. Comments or auxiliary information MAY be stored alongside with table objects in a worksheet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every worksheet MUST contain one table object storing the metadata

  1. Means there can be no worksheet without metadata (e.g. random notes)
  2. metadata = ISA metadata ≈ experimental metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will cut this file.

I have this below:

Sheets described in this specification MUST follow one of the two given formats:

Sheets which do not follow any of these two formats are considered additional payload and are ignored in this specification.

ISA-XLSX.md Outdated
ISA-XLSX uses three types of files to capture the experimental metadata:
- Investigation file
- Study file
- Assay file (with associated data files)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Assay file (with associated data files)
  1. dataset files?
  2. Protocol files are also "associated" and can be referenced in Protocol REF
  3. Analog for "Study file" above: associated resources and protocols

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will cut the part after Assay file

ISA-XLSX.md Outdated

The Investigation file contains all the information needed to understand the overall goals and means used in an experiment; experimental steps (or sequences of events) are described in the Study and in the Assay file(s). For each Investigation file there may be one or more Studies defined with a corresponding Study file; for each Study there may be one or more Assays defined with corresponding Assay files; one assay file may be registered in different studies.

In order to facilitate identification of ISA-XLSX component files, specific naming patterns MUST follow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST be followed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

ISA-XLSX.md Outdated
For maximal portability file names SHOULD contain only ASCII characters not excluded
already (that is `A-Za-z0-9._!#$%&+,;=@^(){}'[]` - we exclude space as many utilities
do not accept spaces in file paths): non-English alphabetic characters cannot be guaranteed
to be supported in all locales. It would be good practice to avoid the shell metacharacters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good practice

is recommended? is good practice? should?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll pick is recommended

ISA-XLSX.md Outdated
The `Investigation file` fulfils four needs:

1. to declare key entities, such as factors, protocols, which may be referenced in the other files
2. to track provenance of the terminologies (controlled vocabularies or ontologies) there are used, where applicable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are used,

of the used terminologies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

ISA-XLSX.md Outdated
## INVESTIGATION section

This section is organized in several subsections, described in detail below. The Investigation section provides a
flexible mechanism for grouping two or more Study files where required. When only one Study is created, the values in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When only one Study is created, the values in
this section SHOULD be left empty and the relevant metadata values recorded in the Study section only.

Why?

ISA-XLSX.md Outdated

`Protocol Description` columns MAY be used to specify the description of the `Protocol` node implemented by the `Process` node. Per Annotation Table sheet there MUST be at most one `Protocol Description` column. The value MUST be free text.

`Protocol Uri` columns MAY be used to specify the uri of the `Protocol` node implemented by the `Process` node. Per Annotation Table sheet there MUST be at most one `Protocol Uri` column. The value MUST be free text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means I can reference an (external) protocol from e.g. a protocol database?

The value MUST be free text.

Would assume, that it MUST be a URI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More important question here is: do we want such columns Protocol URI, Protocol Description, etc. inside an annotation table sheet?

ISA-XLSX.md Outdated
## Factors

A `Factor` is an independent variable manipulated by an experimentalist with the intention to affect biological systems in a way that can be measured by an assay. This field holds the actual data for the `Factor` named between the
square brackets (as declared in the `Study Factors` section of a top-level metadata sheet) so MUST match; for example, `Factor [compound]`. The value MUST be free text, numeric, or an [`Ontology Annotation`](#ontology-annotations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so MUST match; for example,

so MUST match, for example,

ISA-XLSX.md Outdated

## Parameters

`Parameters` are all additional information about the experimental setup, that do not fall under the aforementioned 3 categories. It is formatted in the pattern `Parameter [<category term>]`. The value MUST be free text, numeric, or an [`Ontology Annotation`](#ontology-annotations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify first sentences.

  • Characteristics
  • A Factor
  • A Component
  • Parameters

ISA-XLSX.md Outdated

## Others

Columns whose headers do not follow any of the formats described above are considered additional payload and are ignored in this specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are ignored

are out of the scope / not affected in this spec

| | |
|---------------------|--------------------------------------|
| ASSAY |
| Assay File Name | assays/Proteomics/isa.assay.xlsx |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether this should be relative in the study and assay files. Or basically relative in all.

So in isa.investigation.xlsx it would be assays/Proteomics/isa.assay.xlsx.
And in the top-level metadataassays/Proteomics/isa.assay.xlsx:isa_assay it should be isa.assay.xlsx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relative to top-level root is my current modus-operandi for tooling.

I will try to incorporate specifications about this in the next patch (1.2.1)/minor (1.3.0) release.

I think this is something for the arc specification though. In ISA-XLSX it could be done in any other way, if used outside of the ARC context.

@HLWeil HLWeil merged commit 427935a into main Nov 9, 2023
@HLWeil HLWeil deleted the status_quo branch January 12, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release and versioning mechanism Incorporate status quo specification details
4 participants