-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import from irods using manifest with columns sample_id, irods_path #67
Conversation
…d group, we use the IRODS filename, which is expected to be in format run_lane#tag.cram or run_lane#tag.bam. We do this because obtaining the run_ena is painful, and unecessary to differentiate between the lanelets in the final merged bam. We accept only sample_id and irods_path as the manifest columns.
… that miniwdl passes. Interestingly, womtool will validate on the previous version. We should switch to miniwdl for informal wdl checking on dev machines from now on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of questions about the content of the READMEs and suggest one code improvement, but otherwise looks good!
@@ -199,9 +248,13 @@ task SamToFastq { | |||
} | |||
} | |||
|
|||
# User must supply either one of read_group_id or read_group | |||
# If they supply read_group_id, a fake read_group_id will be generated as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "a fake read_group will be generated as"?
AV0148-C /seq/15370/15370_8#3.cram | ||
AV0079-C /seq/15049/15049_4#30.cram | ||
AV0079-C /seq/15163/15163_5#30.cram | ||
AV0079-C /seq/15163/15163_6#30.cram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this file contain a (fake) run_ena column? If it is an optional column, should we state this explicitly in the README for BatchImportShortReadAlignmentAndGenotyping, ImportShortReadAlignment and ImportShortReadLaneletAlignment?
pipelines/import-short-read-lanelet-alignment-vector/farm5/ImportShortReadLaneletAlignment.wdl
Show resolved
Hide resolved
Good call. Didn't realize that WDL had that feature. Perhaps we'll want to modify the ShortReadAlignment* pipelines to do similar. I'm content to just modify the BatchImportShortReadAlignmentAndGenotyping pipeline in this merge request, and let whoever uses ShortReadAlignment* pipelines to make those changes. |
No, the run_ena column is being removed completely. It can be added as an option column. The references to run_ena in the README.md are leftover from previous commits that should have gone in earlier when run_ena was required, but weren't checked in. I think the pull request was merged before those commits could be checked in. I'll modify the README.me accordingly to reflect flexibility in column selection and order. But sample_id, irods_path are mandatory. |
@Lfulcrum We hit a snag with the suggested WDL changes. The suggested
The other
However, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Very happy to see the simpler input file format.
} | ||
|
||
output { | ||
File read_group_file = "~{read_group_filename}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure you can just use 'read_group_filename' here:
File read_group_file = "~{read_group_filename}" | |
File read_group_file = read_group_filename |
…only has .bam suffix, not .bam.bam. Update README.md to reflect that run_ena is not required or used. Clean use of read_group_filename in WDL output
@Lfulcrum I logged an issue #70 to for more flexible manifest columns. As per requested by @seretol in standup, we will expedite output of production data by checking in the changes for flexible manifest columns in a separate merge request after we have modified the github actions. In this merge request, we will keep the manifest column order static and mandatory as "sample_id,irods_path". |
Partially addresses #27 and #54
We only use columns sample_id and irods_path now in the sample manifests. We no longer require run_ena.
Instead of using the run_ena for the read group IDs, we use the IRODS filename, since it uses the format {run}{lane}#{tag}.cram or {run}{lane}#{tag}.bam . This is sufficient to differentiate between which reads came from which lanelets in the final bam. We do not take in run_ena since this is not always available in the manifests, and would require extra code to query it.
However, the run_ena can easily be queried from the Sanger sequencing databases, so if we decide to have the read groups refer to the original lanelets by run_ena, it is possible to do so.
We should confirm with Alistair to ensure that using the IRODS filename instead of the ENA run accession is acceptable. In the past, malariagen would use whatever was configured as the lane information to be the new aligned lanelet read group ID. This would typically be the {run}_{lane}#{tag} filename of the IRODS cram/bam. Whether we want to continue with this in the future can be debated.