-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output 10x counts #160
Output 10x counts #160
Conversation
|
Two questions:
|
Regarding 1, the barcodes file contains the actual barcodes list, it looks like this:
Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter. |
yes, so how will you actually know which barcode comes from which biological sample?
Just writing a text file vs. writing a binary file. Tbh, I don't really have experience with writing these files, but reading |
Because those 3 files will be generated on a per-sample basis. This is the folder structure:
Yeah, unfortunately for some integrations we need those files in text format. I have added the parameter |
91b2d77
to
080c258
Compare
As mentioned on slack, no new functionality should be needed to produce these files - Maybe some All the aligners produce mtx files already and we use that python script to build h5ad files from them.In the example output of the workflow:
|
I got your point. However, I have checked those files and there are two problems with them.
My goal is to format correctly those files independently of the aligner so we can add another step to upload them automatically for downstream analysis. I am wary of adding those "R formatting" options into the actual aligners' steps to avoid polluting them with formatting. How would you do that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK then, having consistent mtx
files for all alignment routes sounds reasonable.
I added a few comments.
Actually, maybe this could be simplified and at the same time used to solve #159:
At the same time we should think about how we organize the output directory. We currently have the
@apeltzer, @fmalmeida, @kafkasl, what do you think? |
Well, I agree that having it standard so they can used in the same way afterwards is a good idea. I gues, for example, instead of adding this Maybe this makes more sense, no?
On this second comment, I totally agree that we should reshape it and I have no comments on it. I liked the structure proposed. |
It depends a bit what needs to be done. Reading/writing mtx files in a Python script is way slower than h5(ad) files. Not reading (i.e. using the data already in memory) is even better. So purely from a runtime perspective, it is beneficial to read whatever output files the aligners create once and then write all desired outputs. That being said, transposing a mtx matrix could probably done on the command line using awk, which would be very fast. |
I agree that we should not read often but once and output what is necessary / standardize this a bit. The conversion modules also need to add versions for example, so would be great to do all of the above to make sure we're also following best practices 👍 If that even closes more issues even better 🥳 |
The current |
72e4c53
to
1415f2e
Compare
8e43147
to
fd5d1ae
Compare
@grst @apeltzer I've modified the PR to do the following:
as a side note, the |
dc04849
to
5ac1fa2
Compare
5ac1fa2
to
c69cf8d
Compare
f48823a
to
b36f935
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kafkasl! We are getting there, but I think this needs one more iteration!
I added the export the 10x counts param --export_mtx that you suggested but setting it to false breaks the downstream process mtx_to_seurat which depends on this matrix counts being exported so I think it should not be added.
I think it's ok to always export it. When I had performance concerns initially, I thought you'd want to export the merged count matrix including all samples, which could contain hundreds of thousands of cells.
However, I don't see why it should break the mtx_to_seurat? It (currently) doesn't use the mtx files generated by your script. (Although probably it should because it makes the script simpler / or convert from h5ad
as I argued previously -- It doesn't make any sense to handle all special cases twice, once in R and once in Python. Anyway, this is not the scope of this PR)
've enriched the features.tsv files with the gene names, by default they only have the gene IDs. I extracted them from txp2gene for kallisto, and the geneInfo.tab file for star & cellranger. For alevin, we haven't managed to find where to get that translation info so far.
Great! There's a t2g_3col.tsv
in the salmon index directory. It even has it's own channel already, just need to emit it from the subworkflow.
Co-authored-by: Gregor Sturm <mail@gregor-sturm.de>
@grst I've addressed all your comments. The most important ones:
|
1c87d7c
to
d6bc6ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now just a few cosmetic things, then OK from my side.
Maybe @apeltzer can have another glance as well.
As per this discussion on slack we now know a way to get the gene mapping for alevin (-> upgrade simpleaf version). But we agreed on following up on that in a separate PR.
Co-authored-by: Gregor Sturm <mail@gregor-sturm.de>
Co-authored-by: Gregor Sturm <mail@gregor-sturm.de>
8a99c58
to
aae4f5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR adds support to generate 10x count files as output (features.tsv, barcodes.tsv, and matrix.mtx) as part of the pipeline.
Issue: #66
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).