-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40
Conversation
incorporate that into argument after showid flag. Actually using as a separate output format
the sequences on each line. Instead of printing the head and then the seq ids, I believe I'm just passing a pointer to the canonical seq and then printing that out each time. Works via hand testing. Won't work with sphere clustering yet.
clustering algorithm
…for the proper use of the outputs
stack that would only fire for the canonical, I think
Thank you for this pull request @darachm. I have rebased it in the branch feature/tidy. We will perform some tests there before merging to the master branch. |
@darachm I am looking into the output of the If there is no application for a temporary format, I would like to have your thoughts on this. |
It would indeed be better to output both the original and centroid sequence separated by a tab, in the exact same order as it was input in. That is the ideal format for me, and I think it is of general utility. I don't believe it outputs in the order of the seq IDs, and that's why I chose to print the seq ID to re-order it. It didn't seem easy to sort by seq ID internally, as the code I was copying is cluster-centric (not input centric). Outputing the corrected (cluster centroid) in the same order as the input sequences is very useful for using this tool to cluster-correct sequences. If the output file order is different than the input, this would require sorting both input and output files to do a join, which takes extra time and diskspace. For me, I think use the read-IDs to pair up all the post-starcode outputs, so then I'd just have to sort the output back to the original. But that's just me. Internally, is it worth the coding time and run-time efficient to re-write the output generation as separate functions? Perhaps this would make it easier to do further development of different output formats, or updating all the formats if any of the internal datastructures change? |
@darachm You are absolutely right that the engine is cluster-centric. This is what requires a bit of thought. Still, it seems to me that the correspondence of the sequence to centroid may have many applications so it's worth adding this option. If the order of the current output makes no particular sense (which seems to be the case), then we should really sort the reads in the order of the input, and we can drop the sequence ID (because it's 1, 2, 3, 4... anyway). Changing the output format is mostly a cosmetic change, and even if it is tedious and perhaps buggy, it's nothing in comparison of changing the core data structure. The kind of bugs we would have by changing the core would be far-reaching, so the time commitment would be much larger, with possible disasters for users. So we are not updating the core any time soon. |
Oh for sure, not changing the core at all! I was just wondering if you thought it would be best to leave the output block/loops as is, or to re-organize at least this output mode. It felt a bit redundant that I wrote two blocks, one for each algorithm: I'm not experienced in software design, so would you prefer to just add sorting into each block or to have a separate sorting function that gets passed the clustering outputs and then returns a sorted output array of sequence->centroid pairs for the output to use? Then another function for actually generating the per-read output? Something like that? Whatever's best for the maintainability ! |
@darachm The best is that I give it a shot myself. I had a look a the code and I got an idea of how to do it without too much hassle. I'll get back when I have something concrete. |
@darachm I pushed the changes on branch |
Okay, do you want testing? |
@darachm I checked that the results were correct on a few files. I will write some unit tests; in the meantime, you can use it on your data and let me know if you find any issue. |
@darachm I wrote some basic tests for the option |
Hey sorry for long wait. I tested it, it seemed to be working as normal. I have no systematic tests, but spot checks on large datasets look good. All looks good, as far as I can see. Re performance: zcat, clustering, rejoining, and gzipping about 350 million ~30bp sequences (smaller barcode library) took 25minutes with 20-core cluster computer, using about 60Gb RAM. A more complex library, 350million of ~50k different barcodes, used about 32minutes and 66GB. Had zcat, cut, sed, paste, in the call as well, but I don't think those use too much RAM. |
Thanks for checking all this @darachm. I expect that the additional memory required for the tidy mode is small compared to the typical run. I think that the users would not really notice the difference. |
I merged |
This is an answer to the issue raised about "tidy output" or tabular output:
#28 (comment)
For use of clustering of DNA barcodes before analyzing combinations of these barcodes,
it's necessary not just to tabulate counts but to know where which read has a barcode of each cluster.
starcode
currently does provide this information but in a format (commas) that requires an awkward awk command(or the like, with potential for errors to accumulate).
This series of edits implements the option to output the canonical-ized barcodes on a
--per-read
basissuch that the user may sort by the sequence-ids (which last I checked were 1-based input rows), and then
paste/join them onto their original file. This permits correcting tables of multiple barcodes with each
independently clustered type of barcode, in a less shell-tangled way.
I believe this is correct, and I threw two quick little tests in teh test folder, under extratests. I am unfamiliar with other
test writing formats.
If you're interested in incorporating this feature into starcode, please provide feedback and guidance for how this
can be best integrated into your code.