import from irods using manifest with columns sample_id, irods_path #67

tnguyensanger · 2020-12-02T00:40:32Z

Partially addresses #27 and #54

We only use columns sample_id and irods_path now in the sample manifests. We no longer require run_ena.

Instead of using the run_ena for the read group IDs, we use the IRODS filename, since it uses the format {run}{lane}#{tag}.cram or {run}{lane}#{tag}.bam . This is sufficient to differentiate between which reads came from which lanelets in the final bam. We do not take in run_ena since this is not always available in the manifests, and would require extra code to query it.

However, the run_ena can easily be queried from the Sanger sequencing databases, so if we decide to have the read groups refer to the original lanelets by run_ena, it is possible to do so.

We should confirm with Alistair to ensure that using the IRODS filename instead of the ENA run accession is acceptable. In the past, malariagen would use whatever was configured as the lane information to be the new aligned lanelet read group ID. This would typically be the {run}_{lane}#{tag} filename of the IRODS cram/bam. Whether we want to continue with this in the future can be debated.

…to reflect #66

… irods_import

…d group, we use the IRODS filename, which is expected to be in format run_lane#tag.cram or run_lane#tag.bam. We do this because obtaining the run_ena is painful, and unecessary to differentiate between the lanelets in the final merged bam. We accept only sample_id and irods_path as the manifest columns.

… that miniwdl passes. Interestingly, womtool will validate on the previous version. We should switch to miniwdl for informal wdl checking on dev machines from now on

Lfulcrum

I have a couple of questions about the content of the READMEs and suggest one code improvement, but otherwise looks good!

Lfulcrum · 2020-12-02T14:50:14Z

tasks/farm5/ShortReadAlignmentTasks.wdl

@@ -199,9 +248,13 @@ task SamToFastq {
  }
 }

+# User must supply either one of read_group_id or read_group
+# If they supply read_group_id, a fake read_group_id will  be generated as


Should this be "a fake read_group will be generated as"?

pipelines/import-short-read-alignment-vector/README.md

Lfulcrum · 2020-12-03T23:55:55Z

...ead-alignment-and-genotyping-vector/farm5/input_files/small/batch_sample_size_2_lanelets.tsv

+AV0148-C	/seq/15370/15370_8#3.cram
+AV0079-C	/seq/15049/15049_4#30.cram
+AV0079-C	/seq/15163/15163_5#30.cram
+AV0079-C	/seq/15163/15163_6#30.cram


Should this file contain a (fake) run_ena column? If it is an optional column, should we state this explicitly in the README for BatchImportShortReadAlignmentAndGenotyping, ImportShortReadAlignment and ImportShortReadLaneletAlignment?

pipelines/import-short-read-lanelet-alignment-vector/farm5/ImportShortReadLaneletAlignment.wdl

tnguyensanger · 2020-12-07T11:59:52Z

I don't particularly like the indexing of columns here because it forces the manifest columns to be ordered. Don't think it matters too much, although it would be nice if they could be unordered. We could use a map as mentioned here, provided we also retain the header in the per_sample_manifest_file. I guess this is a bit more work and not really necessary, so I'm ok if you want to leave it as is, or save these changes for a rainy day. I also don't mind making the changes if you think it's worth it.

Good call. Didn't realize that WDL had that feature. Perhaps we'll want to modify the ShortReadAlignment* pipelines to do similar. I'm content to just modify the BatchImportShortReadAlignmentAndGenotyping pipeline in this merge request, and let whoever uses ShortReadAlignment* pipelines to make those changes.

tnguyensanger · 2020-12-07T12:02:32Z

Should this file contain a (fake) run_ena column? If it is an optional column, should we state this explicitly in the README for BatchImportShortReadAlignmentAndGenotyping, ImportShortReadAlignment and ImportShortReadLaneletAlignment?

No, the run_ena column is being removed completely. It can be added as an option column. The references to run_ena in the README.md are leftover from previous commits that should have gone in earlier when run_ena was required, but weren't checked in. I think the pull request was merged before those commits could be checked in.

I'll modify the README.me accordingly to reflect flexibility in column selection and order. But sample_id, irods_path are mandatory.

tnguyensanger · 2020-12-07T15:47:22Z

@Lfulcrum We hit a snag with the suggested WDL changes. The suggested as_map function isn't available in the current WDL v1.0. (openwdl/wdl#194 (comment)).

Array[Array[String]] lanelet_infos = read_tsv(per_sample_manifest_file)
Array[String] header = lanelet_infos[0]
scatter (idx in range(length(lanelet_infos)-1)) {
    Array[String] lanelet_info = lanelet_infos[(idx+1)]
    Map[String, String] lanelet_info_map = as_map(zip(header, lanelet_info))
...
}

The other read_objects suggestion in the same issue openwdl/wdl#194 (comment) seems to work on our cromwell server (v49)

Array[Object] lanelet_infos = read_objects(per_sample_manifest_file)
scatter (idx in range(length(lanelet_infos))) {

  String irods_path = lanelet_infos[idx][LANELET_INFO_COLNAME_IRODS_PATH
  ...
}

However, read_objects doesn't seem to be supported by miniwdl used in the github actions. So it won't pass code validation. I'll sync with @gbggrant to determine the best approach going forward.

gbggrant

Looks good. Very happy to see the simpler input file format.

gbggrant · 2020-12-07T19:14:28Z

tasks/farm5/ShortReadAlignmentTasks.wdl

+  }
+
+  output {
+    File read_group_file = "~{read_group_filename}"


Pretty sure you can just use 'read_group_filename' here:

Suggested change

File read_group_file = "~{read_group_filename}"

File read_group_file = read_group_filename

…only has .bam suffix, not .bam.bam. Update README.md to reflect that run_ena is not required or used. Clean use of read_group_filename in WDL output

tnguyensanger · 2020-12-08T10:19:19Z

@Lfulcrum I logged an issue #70 to for more flexible manifest columns. As per requested by @seretol in standup, we will expedite output of production data by checking in the changes for flexible manifest columns in a separate merge request after we have modified the github actions. In this merge request, we will keep the manifest column order static and mandatory as "sample_id,irods_path".

tnguyensanger added 3 commits November 30, 2020 14:23

Partially addresses #52. Update comments on usage. Update lsf_groups …

8aec478

…to reflect #66

Merge branch 'master' of https://github.com/malariagen/pipelines into…

8a92e6b

… irods_import

tnguyensanger requested review from Lfulcrum, magnusmanske and seretol December 2, 2020 00:42

Partially addresses #54, #27. Make changes to string concatenation so…

50ec7c4

… that miniwdl passes. Interestingly, womtool will validate on the previous version. We should switch to miniwdl for informal wdl checking on dev machines from now on

tnguyensanger changed the title ~~WIP: import from irods using manifest with columns sample_id, irods_path~~ import from irods using manifest with columns sample_id, irods_path Dec 2, 2020

Lfulcrum requested changes Dec 4, 2020

View reviewed changes

tnguyensanger requested a review from gbggrant December 7, 2020 16:01

gbggrant approved these changes Dec 7, 2020

View reviewed changes

Lfulcrum approved these changes Dec 8, 2020

View reviewed changes

Partially addresses #68, #54. Make sure output of FixMateInformation …

4f0f32d

…only has .bam suffix, not .bam.bam. Update README.md to reflect that run_ena is not required or used. Clean use of read_group_filename in WDL output

tnguyensanger mentioned this pull request Dec 8, 2020

Use flexible fields in manifest #70

Closed

tnguyensanger merged commit f2fef6f into master Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import from irods using manifest with columns sample_id, irods_path #67

import from irods using manifest with columns sample_id, irods_path #67

tnguyensanger commented Dec 2, 2020 •

edited

Loading

Lfulcrum left a comment

Lfulcrum Dec 2, 2020

Lfulcrum Dec 3, 2020

tnguyensanger commented Dec 7, 2020

tnguyensanger commented Dec 7, 2020

tnguyensanger commented Dec 7, 2020

gbggrant left a comment

gbggrant Dec 7, 2020

tnguyensanger commented Dec 8, 2020

	File read_group_file = "~{read_group_filename}"
	File read_group_file = read_group_filename

import from irods using manifest with columns sample_id, irods_path #67

import from irods using manifest with columns sample_id, irods_path #67

Conversation

tnguyensanger commented Dec 2, 2020 • edited Loading

Lfulcrum left a comment

Choose a reason for hiding this comment

Lfulcrum Dec 2, 2020

Choose a reason for hiding this comment

Lfulcrum Dec 3, 2020

Choose a reason for hiding this comment

tnguyensanger commented Dec 7, 2020

tnguyensanger commented Dec 7, 2020

tnguyensanger commented Dec 7, 2020

gbggrant left a comment

Choose a reason for hiding this comment

gbggrant Dec 7, 2020

Choose a reason for hiding this comment

tnguyensanger commented Dec 8, 2020

tnguyensanger commented Dec 2, 2020 •

edited

Loading