Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import from irods using manifest with columns sample_id, irods_path #67

Merged
merged 5 commits into from
Dec 8, 2020

Conversation

tnguyensanger
Copy link
Contributor

@tnguyensanger tnguyensanger commented Dec 2, 2020

Partially addresses #27 and #54

We only use columns sample_id and irods_path now in the sample manifests. We no longer require run_ena.

Instead of using the run_ena for the read group IDs, we use the IRODS filename, since it uses the format {run}{lane}#{tag}.cram or {run}{lane}#{tag}.bam . This is sufficient to differentiate between which reads came from which lanelets in the final bam. We do not take in run_ena since this is not always available in the manifests, and would require extra code to query it.

However, the run_ena can easily be queried from the Sanger sequencing databases, so if we decide to have the read groups refer to the original lanelets by run_ena, it is possible to do so.

We should confirm with Alistair to ensure that using the IRODS filename instead of the ENA run accession is acceptable. In the past, malariagen would use whatever was configured as the lane information to be the new aligned lanelet read group ID. This would typically be the {run}_{lane}#{tag} filename of the IRODS cram/bam. Whether we want to continue with this in the future can be debated.

…d group, we use the IRODS filename, which is expected to be in format run_lane#tag.cram or run_lane#tag.bam. We do this because obtaining the run_ena is painful, and unecessary to differentiate between the lanelets in the final merged bam. We accept only sample_id and irods_path as the manifest columns.
… that miniwdl passes. Interestingly, womtool will validate on the previous version. We should switch to miniwdl for informal wdl checking on dev machines from now on
@tnguyensanger tnguyensanger changed the title WIP: import from irods using manifest with columns sample_id, irods_path import from irods using manifest with columns sample_id, irods_path Dec 2, 2020
Copy link
Collaborator

@Lfulcrum Lfulcrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions about the content of the READMEs and suggest one code improvement, but otherwise looks good!

@@ -199,9 +248,13 @@ task SamToFastq {
}
}

# User must supply either one of read_group_id or read_group
# If they supply read_group_id, a fake read_group_id will be generated as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "a fake read_group will be generated as"?

AV0148-C /seq/15370/15370_8#3.cram
AV0079-C /seq/15049/15049_4#30.cram
AV0079-C /seq/15163/15163_5#30.cram
AV0079-C /seq/15163/15163_6#30.cram
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file contain a (fake) run_ena column? If it is an optional column, should we state this explicitly in the README for BatchImportShortReadAlignmentAndGenotyping, ImportShortReadAlignment and ImportShortReadLaneletAlignment?

@tnguyensanger
Copy link
Contributor Author

I don't particularly like the indexing of columns here because it forces the manifest columns to be ordered. Don't think it matters too much, although it would be nice if they could be unordered. We could use a map as mentioned here, provided we also retain the header in the per_sample_manifest_file. I guess this is a bit more work and not really necessary, so I'm ok if you want to leave it as is, or save these changes for a rainy day. I also don't mind making the changes if you think it's worth it.

Good call. Didn't realize that WDL had that feature. Perhaps we'll want to modify the ShortReadAlignment* pipelines to do similar. I'm content to just modify the BatchImportShortReadAlignmentAndGenotyping pipeline in this merge request, and let whoever uses ShortReadAlignment* pipelines to make those changes.

@tnguyensanger
Copy link
Contributor Author

Should this file contain a (fake) run_ena column? If it is an optional column, should we state this explicitly in the README for BatchImportShortReadAlignmentAndGenotyping, ImportShortReadAlignment and ImportShortReadLaneletAlignment?

No, the run_ena column is being removed completely. It can be added as an option column. The references to run_ena in the README.md are leftover from previous commits that should have gone in earlier when run_ena was required, but weren't checked in. I think the pull request was merged before those commits could be checked in.

I'll modify the README.me accordingly to reflect flexibility in column selection and order. But sample_id, irods_path are mandatory.

@tnguyensanger
Copy link
Contributor Author

@Lfulcrum We hit a snag with the suggested WDL changes. The suggested as_map function isn't available in the current WDL v1.0. (openwdl/wdl#194 (comment)).

Array[Array[String]] lanelet_infos = read_tsv(per_sample_manifest_file)
Array[String] header = lanelet_infos[0]
scatter (idx in range(length(lanelet_infos)-1)) {
    Array[String] lanelet_info = lanelet_infos[(idx+1)]
    Map[String, String] lanelet_info_map = as_map(zip(header, lanelet_info))
...
}

The other read_objects suggestion in the same issue openwdl/wdl#194 (comment) seems to work on our cromwell server (v49)

Array[Object] lanelet_infos = read_objects(per_sample_manifest_file)
scatter (idx in range(length(lanelet_infos))) {

  String irods_path = lanelet_infos[idx][LANELET_INFO_COLNAME_IRODS_PATH
  ...
}

However, read_objects doesn't seem to be supported by miniwdl used in the github actions. So it won't pass code validation. I'll sync with @gbggrant to determine the best approach going forward.

Copy link
Contributor

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Very happy to see the simpler input file format.

}

output {
File read_group_file = "~{read_group_filename}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you can just use 'read_group_filename' here:

Suggested change
File read_group_file = "~{read_group_filename}"
File read_group_file = read_group_filename

…only has .bam suffix, not .bam.bam. Update README.md to reflect that run_ena is not required or used. Clean use of read_group_filename in WDL output
@tnguyensanger
Copy link
Contributor Author

@Lfulcrum I logged an issue #70 to for more flexible manifest columns. As per requested by @seretol in standup, we will expedite output of production data by checking in the changes for flexible manifest columns in a separate merge request after we have modified the github actions. In this merge request, we will keep the manifest column order static and mandatory as "sample_id,irods_path".

@tnguyensanger tnguyensanger merged commit f2fef6f into master Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants