Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

Merged
merged 7 commits into from
Sep 17, 2021

Conversation

darachm
Copy link

@darachm darachm commented Sep 5, 2021

This is an answer to the issue raised about "tidy output" or tabular output:
#28 (comment)

For use of clustering of DNA barcodes before analyzing combinations of these barcodes,
it's necessary not just to tabulate counts but to know where which read has a barcode of each cluster.
starcode currently does provide this information but in a format (commas) that requires an awkward awk command
(or the like, with potential for errors to accumulate).

This series of edits implements the option to output the canonical-ized barcodes on a --per-read basis
such that the user may sort by the sequence-ids (which last I checked were 1-based input rows), and then
paste/join them onto their original file. This permits correcting tables of multiple barcodes with each
independently clustered type of barcode, in a less shell-tangled way.

I believe this is correct, and I threw two quick little tests in teh test folder, under extratests. I am unfamiliar with other
test writing formats.

If you're interested in incorporating this feature into starcode, please provide feedback and guidance for how this
can be best integrated into your code.

darachm added 7 commits September 5, 2021 13:59
incorporate that into argument after showid flag.

Actually using as a separate output format
the sequences on each line. Instead of printing the head and then the
seq ids, I believe I'm just passing a pointer to the canonical seq and
then printing that out each time. Works via hand testing. Won't work
with sphere clustering yet.
stack that would only fire for the canonical, I think
@gui11aume gui11aume changed the base branch from master to feature/tidy September 17, 2021 15:33
@gui11aume gui11aume merged commit 9195391 into gui11aume:feature/tidy Sep 17, 2021
@gui11aume
Copy link
Owner

Thank you for this pull request @darachm. I have rebased it in the branch feature/tidy. We will perform some tests there before merging to the master branch.

@gui11aume
Copy link
Owner

gui11aume commented Sep 17, 2021

@darachm I am looking into the output of the --per-read command. So far I do not see anything wrong.
Do I understand correctly that the order of the reads is the same as appearing with --seq-id? Would that order ever make sense for any application? If not, the only use for this output format is to sort on the ID number and match the sequences with the original file, right?

If there is no application for a temporary format, starcode has to return the final sorted sequences. Also, it seems to me that a better format would be to have the sequence of the original read in the first column and the sequence of the centroid in the second column.

I would like to have your thoughts on this.

@darachm
Copy link
Author

darachm commented Sep 20, 2021

It would indeed be better to output both the original and centroid sequence separated by a tab, in the exact same order as it was input in. That is the ideal format for me, and I think it is of general utility.

I don't believe it outputs in the order of the seq IDs, and that's why I chose to print the seq ID to re-order it. It didn't seem easy to sort by seq ID internally, as the code I was copying is cluster-centric (not input centric). Outputing the corrected (cluster centroid) in the same order as the input sequences is very useful for using this tool to cluster-correct sequences.

If the output file order is different than the input, this would require sorting both input and output files to do a join, which takes extra time and diskspace. For me, I think use the read-IDs to pair up all the post-starcode outputs, so then I'd just have to sort the output back to the original. But that's just me.

Internally, is it worth the coding time and run-time efficient to re-write the output generation as separate functions? Perhaps this would make it easier to do further development of different output formats, or updating all the formats if any of the internal datastructures change?

@gui11aume
Copy link
Owner

@darachm You are absolutely right that the engine is cluster-centric. This is what requires a bit of thought. Still, it seems to me that the correspondence of the sequence to centroid may have many applications so it's worth adding this option.

If the order of the current output makes no particular sense (which seems to be the case), then we should really sort the reads in the order of the input, and we can drop the sequence ID (because it's 1, 2, 3, 4... anyway).

Changing the output format is mostly a cosmetic change, and even if it is tedious and perhaps buggy, it's nothing in comparison of changing the core data structure. The kind of bugs we would have by changing the core would be far-reaching, so the time commitment would be much larger, with possible disasters for users. So we are not updating the core any time soon.

@darachm
Copy link
Author

darachm commented Sep 21, 2021

Oh for sure, not changing the core at all! I was just wondering if you thought it would be best to leave the output block/loops as is, or to re-organize at least this output mode. It felt a bit redundant that I wrote two blocks, one for each algorithm:
darachm@1901d81#diff-681d5f5dc019670cb142a289a50b3438492cc0dcc36da37caa9f188ef53c63cfR546
and
darachm@c637876#diff-681d5f5dc019670cb142a289a50b3438492cc0dcc36da37caa9f188ef53c63cfR636

I'm not experienced in software design, so would you prefer to just add sorting into each block or to have a separate sorting function that gets passed the clustering outputs and then returns a sorted output array of sequence->centroid pairs for the output to use? Then another function for actually generating the per-read output? Something like that?

Whatever's best for the maintainability !

@gui11aume
Copy link
Owner

@darachm The best is that I give it a shot myself. I had a look a the code and I got an idea of how to do it without too much hassle. I'll get back when I have something concrete.

@gui11aume
Copy link
Owner

@darachm I pushed the changes on branch feature/tidy. There is no testing whatsoever for now. I would like to make sure that the result is correct before merging with master.

@darachm
Copy link
Author

darachm commented Sep 23, 2021

Okay, do you want testing?
I am using my fork for my pipelines now and getting sane results. I can rebuild containers with this branch and re-run those.

@gui11aume
Copy link
Owner

@darachm I checked that the results were correct on a few files. I will write some unit tests; in the meantime, you can use it on your data and let me know if you find any issue.

@gui11aume
Copy link
Owner

@darachm I wrote some basic tests for the option --tidy in commit 0f042eac. From my point of view, this is ready for merging with master. Let me know when you are done testing this mode on your files. If you find some issues, I will fix them. Otherwise I will merge.

@darachm
Copy link
Author

darachm commented Oct 6, 2021

Hey sorry for long wait. I tested it, it seemed to be working as normal. I have no systematic tests, but spot checks on large datasets look good. All looks good, as far as I can see.

Re performance: zcat, clustering, rejoining, and gzipping about 350 million ~30bp sequences (smaller barcode library) took 25minutes with 20-core cluster computer, using about 60Gb RAM. A more complex library, 350million of ~50k different barcodes, used about 32minutes and 66GB. Had zcat, cut, sed, paste, in the call as well, but I don't think those use too much RAM.

@gui11aume

@gui11aume
Copy link
Owner

Thanks for checking all this @darachm. I expect that the additional memory required for the tidy mode is small compared to the typical run. I think that the users would not really notice the difference.

@gui11aume
Copy link
Owner

gui11aume commented Oct 6, 2021

I merged feature/tidy with master and deleted the branch. Thanks again for everything @darachm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants