Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output 10x counts #160

Merged
merged 18 commits into from
Nov 17, 2022
Merged

Output 10x counts #160

merged 18 commits into from
Nov 17, 2022

Conversation

kafkasl
Copy link
Contributor

@kafkasl kafkasl commented Oct 11, 2022

This PR adds support to generate 10x count files as output (features.tsv, barcodes.tsv, and matrix.mtx) as part of the pipeline.

Issue: #66

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/scrnaseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Oct 11, 2022

nf-core lint overall result: Passed ✅

Posted for pipeline commit aae4f5e

+| ✅ 158 tests passed       |+

✅ Tests passed:

Run details

  • nf-core/tools version 2.6
  • Run at 2022-11-17 14:11:35

@grst
Copy link
Member

grst commented Oct 12, 2022

Two questions:

  • isn't this essentially throwing away sample information? I see you are only storing the cell index in the barcodes.tsv.
  • I'm afraid this will be quite slow when the merged dataset get's large (i.e. > 100k cells). This is one of the reasons to prefer binary formats like h5 over text representation. Maybe not run this by default, or at least make it easy to switch off.

@kafkasl
Copy link
Contributor Author

kafkasl commented Oct 12, 2022

Regarding 1, the barcodes file contains the actual barcodes list, it looks like this:

AAACCTGAGCTAGTGG
AAACCTGCATCGACGC
...

Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter.

@grst
Copy link
Member

grst commented Oct 12, 2022

Regarding 1, the barcodes file contains the actual barcodes list, it looks like this

yes, so how will you actually know which barcode comes from which biological sample?

Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter.

Just writing a text file vs. writing a binary file. Tbh, I don't really have experience with writing these files, but reading feat/barcode/mtx is way slower than reading the h5 files produced by cellranger.

@kafkasl
Copy link
Contributor Author

kafkasl commented Oct 12, 2022

Regarding 1, the barcodes file contains the actual barcodes list, it looks like this

yes, so how will you actually know which barcode comes from which biological sample?

Because those 3 files will be generated on a per-sample basis. This is the folder structure:

├── Sample_X_matrix
│   ├── barcodes.tsv
│   ├── features.tsv
│   └── matrix.mtx
└── Sample_Y_matrix
    ├── barcodes.tsv
    ├── features.tsv
    └── matrix.mtx

Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter.

Just writing a text file vs. writing a binary file. Tbh, I don't really have experience with writing these files, but reading feat/barcode/mtx is way slower than reading the h5 files produced by cellranger.

Yeah, unfortunately for some integrations we need those files in text format.

I have added the parameter --output_10x and it's disabled by default.

@grst
Copy link
Member

grst commented Oct 13, 2022

As mentioned on slack, no new functionality should be needed to produce these files - Maybe some publishDir options need to be tuned to get them into the right places in the output directory.

All the aligners produce mtx files already and we use that python script to build h5ad files from them.In the example output of the workflow:

@kafkasl
Copy link
Contributor Author

kafkasl commented Oct 13, 2022

I got your point. However, I have checked those files and there are two problems with them.

  1. The matrix is transposed so it will failed when trying to load them in R
  2. The genes files contains the ensemble IDs but we want that file to include also the gene names (which we are doing in our code).

My goal is to format correctly those files independently of the aligner so we can add another step to upload them automatically for downstream analysis. I am wary of adding those "R formatting" options into the actual aligners' steps to avoid polluting them with formatting. How would you do that?

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then, having consistent mtx files for all alignment routes sounds reasonable.

I added a few comments.

subworkflows/local/mtx_conversion.nf Outdated Show resolved Hide resolved
bin/h5ad_to_10x.py Outdated Show resolved Hide resolved
modules/local/h5ad_to_10x.nf Outdated Show resolved Hide resolved
modules/local/h5ad_to_10x.nf Outdated Show resolved Hide resolved
@grst
Copy link
Member

grst commented Oct 14, 2022

Actually, maybe this could be simplified and at the same time used to solve #159:

At the same time we should think about how we organize the output directory. We currently have the mtx_conversions output directory for each aligner with the h5ad and rds files in them. This is fine, but I think the name is not ideal and it needs some substructure, e.g.

aligner_cellranger/matrices
|
| - per_sample
|   | 
|    - h5ad
|    - mtx
|    - seurat
  - merged
    |
     - h5ad
     - seurat

@apeltzer, @fmalmeida, @kafkasl, what do you think?

@fmalmeida
Copy link
Contributor

fmalmeida commented Oct 14, 2022

Well, I agree that having it standard so they can used in the same way afterwards is a good idea. I gues, for example, instead of adding this export on the pyScript, we can actaully add a step before, which does this standardization on files, and "saves" it and then the other conversions come from these standardised matrices and we work on re-shaping the conversion modules and scripts to use that new (standard) formatting.

Maybe this makes more sense, no?

At the same time we should think about how we organize the output directory. We currently have the mtx_conversions output directory for each aligner with the h5ad and rds files in them. This is fine, but I think the name is not ideal and it needs some substructure, e.g.

On this second comment, I totally agree that we should reshape it and I have no comments on it. I liked the structure proposed.

@grst
Copy link
Member

grst commented Oct 14, 2022

export on the pyScript, we can actaully add a step before, which does this standardization on files, and "saves" it and then the other conversions come from these standardised matrices and we work on re-shaping the conversion modules and scripts to use that new (standard) formatting.

It depends a bit what needs to be done. Reading/writing mtx files in a Python script is way slower than h5(ad) files. Not reading (i.e. using the data already in memory) is even better. So purely from a runtime perspective, it is beneficial to read whatever output files the aligners create once and then write all desired outputs.

That being said, transposing a mtx matrix could probably done on the command line using awk, which would be very fast.

@apeltzer
Copy link
Member

I agree that we should not read often but once and output what is necessary / standardize this a bit. The conversion modules also need to add versions for example, so would be great to do all of the above to make sure we're also following best practices 👍

If that even closes more issues even better 🥳

@kafkasl
Copy link
Contributor Author

kafkasl commented Oct 17, 2022

I think what you mentioned @grst sounds good. We have an idea of how to add the gene symbols and the --export_mtx parameter. We will have to look into how to change the output structure. I don't know what @apeltzer refers to with "adding versions".

@apeltzer
Copy link
Member

The current mxt_... modules do not output the required versions.yml which should be done too 👍🏻 Thats something that was missed but is also highlighted by the nf-core lint tool, so we should adhere to the standard of generating this when the conversion is done :-)

@grst grst added this to the 2.2.0 milestone Oct 21, 2022
@kafkasl
Copy link
Contributor Author

kafkasl commented Nov 11, 2022

@grst @apeltzer I've modified the PR to do the following:

  1. Standardize the mtx_conversions outputs. They now look like this:
├── mtx_conversions
│   ├── Sample_X
│   │   ├── Sample_X_matrix.h5ad
│   │   ├── Sample_X_matrix.rds
│   │   ├── barcodes.tsv
│   │   ├── features.tsv
│   │   └── matrix.mtx
│   ├── Sample_Y
│   │   ├── Sample_Y_matrix.h5ad
│   │   ├── Sample_Y_matrix.rds
│   │   ├── barcodes.tsv
│   │   ├── features.tsv
│   │   └── matrix.mtx
  1. Export the 10x counts format files, barcodes.tsv, features.tsv, and matrix.mtx. This has been done inside an existing module from data already in memory so there shouldn't be any performance impact other than writing to disk.
  2. I've enriched the features.tsv files with the gene names, by default they only have the gene IDs. I extracted them from txp2gene for kallisto, and the geneInfo.tab file for star & cellranger. For alevin, we haven't managed to find where to get that translation info so far.
  3. I added the export the 10x counts param --export_mtx that you suggested but setting it to false breaks the downstream process mtx_to_seurat which depends on this matrix counts being exported so I think it should not be added.

as a side note, the prettier command and nf-core schema build have conflicting formatting for nextflow_schema.json

@kafkasl kafkasl force-pushed the output-10x-counts branch 2 times, most recently from dc04849 to 5ac1fa2 Compare November 11, 2022 08:47
@kafkasl kafkasl requested a review from grst November 11, 2022 08:54
Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kafkasl! We are getting there, but I think this needs one more iteration!

I added the export the 10x counts param --export_mtx that you suggested but setting it to false breaks the downstream process mtx_to_seurat which depends on this matrix counts being exported so I think it should not be added.

I think it's ok to always export it. When I had performance concerns initially, I thought you'd want to export the merged count matrix including all samples, which could contain hundreds of thousands of cells.

However, I don't see why it should break the mtx_to_seurat? It (currently) doesn't use the mtx files generated by your script. (Although probably it should because it makes the script simpler / or convert from h5ad as I argued previously -- It doesn't make any sense to handle all special cases twice, once in R and once in Python. Anyway, this is not the scope of this PR)

've enriched the features.tsv files with the gene names, by default they only have the gene IDs. I extracted them from txp2gene for kallisto, and the geneInfo.tab file for star & cellranger. For alevin, we haven't managed to find where to get that translation info so far.

Great! There's a t2g_3col.tsv in the salmon index directory. It even has it's own channel already, just need to emit it from the subworkflow.

bin/cellranger_mtx_to_h5ad.py Outdated Show resolved Hide resolved
bin/cellranger_mtx_to_h5ad.py Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
bin/mtx_to_h5ad.py Show resolved Hide resolved
bin/cellranger_mtx_to_h5ad.py Outdated Show resolved Hide resolved
bin/mtx_to_h5ad.py Outdated Show resolved Hide resolved
@kafkasl
Copy link
Contributor Author

kafkasl commented Nov 15, 2022

@grst I've addressed all your comments. The most important ones:

  • mtx_to_h5ad now exports a versions file
  • cellrange_mtx_to_h5ad has been deleted and now uses also mtx_to_h5ad
  • improved the parsing of t2g-like files with your suggestions

@kafkasl kafkasl requested a review from grst November 15, 2022 09:13
Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now just a few cosmetic things, then OK from my side.
Maybe @apeltzer can have another glance as well.

As per this discussion on slack we now know a way to get the gene mapping for alevin (-> upgrade simpleaf version). But we agreed on following up on that in a separate PR.

CHANGELOG.md Show resolved Hide resolved
bin/mtx_to_h5ad.py Outdated Show resolved Hide resolved
bin/mtx_to_h5ad.py Outdated Show resolved Hide resolved
bin/mtx_to_h5ad.py Outdated Show resolved Hide resolved
kafkasl and others added 3 commits November 17, 2022 13:57
Co-authored-by: Gregor Sturm <mail@gregor-sturm.de>
Co-authored-by: Gregor Sturm <mail@gregor-sturm.de>
Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants