Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

darachm · 2021-09-05T22:25:04Z

This is an answer to the issue raised about "tidy output" or tabular output:
#28 (comment)

For use of clustering of DNA barcodes before analyzing combinations of these barcodes,
it's necessary not just to tabulate counts but to know where which read has a barcode of each cluster.
starcode currently does provide this information but in a format (commas) that requires an awkward awk command
(or the like, with potential for errors to accumulate).

This series of edits implements the option to output the canonical-ized barcodes on a --per-read basis
such that the user may sort by the sequence-ids (which last I checked were 1-based input rows), and then
paste/join them onto their original file. This permits correcting tables of multiple barcodes with each
independently clustered type of barcode, in a less shell-tangled way.

I believe this is correct, and I threw two quick little tests in teh test folder, under extratests. I am unfamiliar with other
test writing formats.

If you're interested in incorporating this feature into starcode, please provide feedback and guidance for how this
can be best integrated into your code.

incorporate that into argument after showid flag. Actually using as a separate output format

the sequences on each line. Instead of printing the head and then the seq ids, I believe I'm just passing a pointer to the canonical seq and then printing that out each time. Works via hand testing. Won't work with sphere clustering yet.

clustering algorithm

…for the proper use of the outputs

stack that would only fire for the canonical, I think

gui11aume · 2021-09-17T15:34:56Z

Thank you for this pull request @darachm. I have rebased it in the branch feature/tidy. We will perform some tests there before merging to the master branch.

gui11aume · 2021-09-17T22:36:18Z

@darachm I am looking into the output of the --per-read command. So far I do not see anything wrong.
Do I understand correctly that the order of the reads is the same as appearing with --seq-id? Would that order ever make sense for any application? If not, the only use for this output format is to sort on the ID number and match the sequences with the original file, right?

If there is no application for a temporary format, starcode has to return the final sorted sequences. Also, it seems to me that a better format would be to have the sequence of the original read in the first column and the sequence of the centroid in the second column.

I would like to have your thoughts on this.

darachm · 2021-09-20T19:30:01Z

It would indeed be better to output both the original and centroid sequence separated by a tab, in the exact same order as it was input in. That is the ideal format for me, and I think it is of general utility.

I don't believe it outputs in the order of the seq IDs, and that's why I chose to print the seq ID to re-order it. It didn't seem easy to sort by seq ID internally, as the code I was copying is cluster-centric (not input centric). Outputing the corrected (cluster centroid) in the same order as the input sequences is very useful for using this tool to cluster-correct sequences.

If the output file order is different than the input, this would require sorting both input and output files to do a join, which takes extra time and diskspace. For me, I think use the read-IDs to pair up all the post-starcode outputs, so then I'd just have to sort the output back to the original. But that's just me.

Internally, is it worth the coding time and run-time efficient to re-write the output generation as separate functions? Perhaps this would make it easier to do further development of different output formats, or updating all the formats if any of the internal datastructures change?

gui11aume · 2021-09-21T01:21:35Z

@darachm You are absolutely right that the engine is cluster-centric. This is what requires a bit of thought. Still, it seems to me that the correspondence of the sequence to centroid may have many applications so it's worth adding this option.

If the order of the current output makes no particular sense (which seems to be the case), then we should really sort the reads in the order of the input, and we can drop the sequence ID (because it's 1, 2, 3, 4... anyway).

Changing the output format is mostly a cosmetic change, and even if it is tedious and perhaps buggy, it's nothing in comparison of changing the core data structure. The kind of bugs we would have by changing the core would be far-reaching, so the time commitment would be much larger, with possible disasters for users. So we are not updating the core any time soon.

darachm · 2021-09-21T03:13:56Z

Oh for sure, not changing the core at all! I was just wondering if you thought it would be best to leave the output block/loops as is, or to re-organize at least this output mode. It felt a bit redundant that I wrote two blocks, one for each algorithm:
darachm@1901d81#diff-681d5f5dc019670cb142a289a50b3438492cc0dcc36da37caa9f188ef53c63cfR546
and
darachm@c637876#diff-681d5f5dc019670cb142a289a50b3438492cc0dcc36da37caa9f188ef53c63cfR636

I'm not experienced in software design, so would you prefer to just add sorting into each block or to have a separate sorting function that gets passed the clustering outputs and then returns a sorted output array of sequence->centroid pairs for the output to use? Then another function for actually generating the per-read output? Something like that?

Whatever's best for the maintainability !

gui11aume · 2021-09-22T18:35:04Z

@darachm The best is that I give it a shot myself. I had a look a the code and I got an idea of how to do it without too much hassle. I'll get back when I have something concrete.

gui11aume · 2021-09-22T22:35:11Z

@darachm I pushed the changes on branch feature/tidy. There is no testing whatsoever for now. I would like to make sure that the result is correct before merging with master.

darachm · 2021-09-23T14:58:04Z

Okay, do you want testing?
I am using my fork for my pipelines now and getting sane results. I can rebuild containers with this branch and re-run those.

gui11aume · 2021-09-23T15:07:10Z

@darachm I checked that the results were correct on a few files. I will write some unit tests; in the meantime, you can use it on your data and let me know if you find any issue.

gui11aume · 2021-09-23T23:35:03Z

@darachm I wrote some basic tests for the option --tidy in commit 0f042eac. From my point of view, this is ready for merging with master. Let me know when you are done testing this mode on your files. If you find some issues, I will fix them. Otherwise I will merge.

darachm · 2021-10-06T00:16:04Z

Hey sorry for long wait. I tested it, it seemed to be working as normal. I have no systematic tests, but spot checks on large datasets look good. All looks good, as far as I can see.

Re performance: zcat, clustering, rejoining, and gzipping about 350 million ~30bp sequences (smaller barcode library) took 25minutes with 20-core cluster computer, using about 60Gb RAM. A more complex library, 350million of ~50k different barcodes, used about 32minutes and 66GB. Had zcat, cut, sed, paste, in the call as well, but I don't think those use too much RAM.

@gui11aume

gui11aume · 2021-10-06T00:52:06Z

Thanks for checking all this @darachm. I expect that the additional memory required for the tidy mode is small compared to the typical run. I think that the users would not really notice the difference.

gui11aume · 2021-10-06T00:57:50Z

I merged feature/tidy with master and deleted the branch. Thanks again for everything @darachm.

darachm added 7 commits September 5, 2021 13:59

Adding in an option for per-read outputs, and modeling how to

55361b3

incorporate that into argument after showid flag. Actually using as a separate output format

Wrote a little short test in the extratests.sh

ece37f2

Adapted another block to print out the clusters per-read for the sphere

c637876

clustering algorithm

Now the per-read option forces seq-id to be on, as this is necessary …

a1f39b6

…for the proper use of the outputs

Fixing mistake, a bug, an extra duplicated line of pushing into the

0691fbe

stack that would only fire for the canonical, I think

Adding in tests now to test this all, should be good for prime-time

5f4c362

darachm mentioned this pull request Sep 5, 2021

Feature request - add a statistical cutoff for cluster merging? #38

Open

gui11aume changed the base branch from master to feature/tidy September 17, 2021 15:33

gui11aume merged commit 9195391 into gui11aume:feature/tidy Sep 17, 2021

darachm mentioned this pull request Sep 23, 2021

Feature Request: Tidy Output #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

darachm commented Sep 5, 2021

gui11aume commented Sep 17, 2021

gui11aume commented Sep 17, 2021 •

edited

darachm commented Sep 20, 2021

gui11aume commented Sep 21, 2021

darachm commented Sep 21, 2021 •

edited

gui11aume commented Sep 22, 2021

gui11aume commented Sep 22, 2021

darachm commented Sep 23, 2021

gui11aume commented Sep 23, 2021

gui11aume commented Sep 23, 2021

darachm commented Oct 6, 2021 •

edited

gui11aume commented Oct 6, 2021

gui11aume commented Oct 6, 2021 •

edited

Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

Conversation

darachm commented Sep 5, 2021

gui11aume commented Sep 17, 2021

gui11aume commented Sep 17, 2021 • edited

darachm commented Sep 20, 2021

gui11aume commented Sep 21, 2021

darachm commented Sep 21, 2021 • edited

gui11aume commented Sep 22, 2021

gui11aume commented Sep 22, 2021

darachm commented Sep 23, 2021

gui11aume commented Sep 23, 2021

gui11aume commented Sep 23, 2021

darachm commented Oct 6, 2021 • edited

gui11aume commented Oct 6, 2021

gui11aume commented Oct 6, 2021 • edited

gui11aume commented Sep 17, 2021 •

edited

darachm commented Sep 21, 2021 •

edited

darachm commented Oct 6, 2021 •

edited

gui11aume commented Oct 6, 2021 •

edited