Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low map rate for 10X scATAC-seq #15

Open
badoi opened this issue Jul 2, 2021 · 24 comments
Open

low map rate for 10X scATAC-seq #15

badoi opened this issue Jul 2, 2021 · 24 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@badoi
Copy link

badoi commented Jul 2, 2021

I'm aligning 10x genomics scATAC-seq to the rheMac10 genome. The sequencing is 50bp for R1 and R2 paired-end sequencing on the Novaseq. The genome was built with k17 and w7 for min frag 30. But the alignment rate is 7% which is too low. For 10x scATAC-seq datasets, how was the genome built to better map fragments as low as 20 or 30 bp after index trimming? Are there any other advice to work with this short of sequencing?

In a different dataset where the sequencing strategy was 100bp for R1 and R2, the same macaque index aligned 91% of the reads.

Mapped all reads in 55348.73s.
Number of reads: 1132465712.
Number of mapped reads: 77938628.
Number of uniquely mapped reads: 72568652.
Number of reads have multi-mappings: 5369976.
Number of candidates: 781477061.
Number of mappings: 77938628.
Number of uni-mappings: 72568652.
Number of multi-mappings: 5369976.
Number of barcodes in whitelist: 2890.
Number of corrected barcodes: 41349117.
Sorted, deduped and outputed mappings in 17.12s.

chromap_1741011_1_out.txt

@haowenz
Copy link
Owner

haowenz commented Jul 3, 2021

Can you share the command line you used to run Chromap? It seems that your barcode whitelist file does not match the barcodes in your data. So most of the reads are dropped since their barcodes are not on the whitelist.

@badoi
Copy link
Author

badoi commented Jul 3, 2021

The command I used was the following. I'll check but I remember that 10x only had 1 ATAC barcode list, which is the list I provided. The issue first is that only 77938628/ 1132465712 reads mapped, which is ~6% of reads.

BARCODE_LIST=/home/bnphan/src/cellranger-atac-1.2.0/cellranger-atac-cs/1.2.0/lib/python/barcodes/737K-cratac-v1.txt
GENOME_FASTA=/home/bnphan/resources/genomes/rheMac10/rheMac10.fa
CHROMAP_IDX=/home/bnphan/resources/genomes/chromap/rheMac10/index
ALIGN_BED=${SAMPLE}.aln.bed

~/src/chromap-0.1_x64-linux/chromap --preset atac \
-x $CHROMAP_IDX -r ${GENOME_FASTA} -t 8 \
--remove-pcr-duplicates-at-cell-level \
--bc-error-threshold 2 \
-1 ${FQ1} -2 ${FQ2} -b ${BC1} -o ${ALIGN_BED} \
--barcode-whitelist ${BARCODE_LIST}

@haowenz
Copy link
Owner

haowenz commented Jul 3, 2021

Chromap will check if the barcode is in the whitelist, if not then it tries to correct the barcode. And if the barcode is still not in the whitelist, it will skip mapping the reads. Can you try a run with the parameters you used and also with "--output-mappings-not-in-whitelist" to see if the mapping rate is still low?

Moreover, can I know the 10X assay you used? Single Cell ATAC or Single Cell Multiome? They have different barcode whitelists. Also the time looks weird. Is it possible for you to share 1M read pairs so we may have a look?

@badoi
Copy link
Author

badoi commented Jul 5, 2021

Thanks I'll run with that parameter. I tried aligning over the weekend with K=15 and W=7 and got a lower map rate, so I'll try to go for higher K' and W' indices then with your parameter to see if the barcode whitelist is the issue.

Our collaborators used the scATAC protocol. They haven't tried the Multiome yet.

chromap_1741095_1_out.txt

@mourisl
Copy link
Collaborator

mourisl commented Jul 5, 2021

Based on the two lines:
Number of barcodes in whitelist: 2890.
Number of corrected barcodes: 41349117.

It seems only a tiny fraction of reads' barcodes are in the whitelist. Could you please check whether the barcode used in the scATAC-seq data and the barcode whitelist are matched? Thank you.

@badoi
Copy link
Author

badoi commented Jul 7, 2021

Hi all, I reached out to 10x and they advised that some sequencers provide the reverse complement of the barcodes, so I made a rev-comp whitelist. I get more hits w/ that list, and rerunning now w/ an index built with K20 W7. That should fix both problems of low alignment rate and not enough matches w/ the whitelist. I'll confirm this is indeed correct once the run finishes.

For others who find it useful, the unix command to make a rev comp of the whitelist:

cat 737K-cratac-v1.txt | tr ACGT TGCA | rev | sort > 737K-cratac-v1_revcomp.txt

@haowenz
Copy link
Owner

haowenz commented Jul 7, 2021

Please also try a run with index built with "-k 17 -w 7", which is default. It may give you the better results. Increasing k seems not necessary to me.

@badoi
Copy link
Author

badoi commented Jul 7, 2021

The initial runs were with an index built w/ K17 W7, which produced very low map rates.

@haowenz
Copy link
Owner

haowenz commented Jul 7, 2021

Did you try "--output-mappings-not-in-whitelist" with "k17 w7" and get low mapping rate? Or you can just remove "-b" and run your data as bulk ATAC-seq data and see if the reads map.

It seems to me that you got you low mapping rate in the initial run since your barcode whitelist did not have the reverse complement of the barcodes. Those reads are dropped without even trying the mapping procedure. In this case you will always get low mapping rate, and the value of k and w have no effect on the results. When you use the right barcode whitelist file, the default parameters are supposed to work. Otherwise, there might be other problems.

You can try larger k if you want. But make sure you use an odd value for k (e.g, 19, 21).

@haowenz
Copy link
Owner

haowenz commented Jul 7, 2021

@mourisl I think we should add a new mapping stats saying the number of reads skipped since their barcodes are not on the whitelist. This would help the users and us to identify the problem in the data. What do you think?

@mourisl
Copy link
Collaborator

mourisl commented Jul 7, 2021

Yes. We should be more clear that Chromap will skip the reads whose barcodes are not in the whitelist, at least in the README. I think the line "Number of barcodes in whitelist: 2890. Number of corrected barcodes: 41349117." serves the purpose, but the user might not know that the other reads are filtered by default.

@badoi
Copy link
Author

badoi commented Jul 8, 2021

Now that you mentioned all of that, I think I get the error now. I aligned w/ the default index k17 w7, with the reverse complement barcode list and looks like it's working well. Thanks for your feedback, and I agree that a bit clearer mapping statistics could help warn about this quirk in the future.

Number of reads: 1096959080.
Number of mapped reads: 992840174.
Number of uniquely mapped reads: 917016382.
Number of reads have multi-mappings: 75823792.
Number of candidates: 9525419205.
Number of mappings: 992840174.
Number of uni-mappings: 917016382.
Number of multi-mappings: 75823792.
Number of barcodes in whitelist: 482067912.
Number of corrected barcodes: 44422687.
Sorted, deduped and outputed mappings in 346.11s.
# uni-mappings: 213743382, # multi-mappings: 28074758, total: 241818140.
Number of output mappings (passed filters): 207028528
Total time: 32917.30s.

@badoi badoi closed this as completed Jul 8, 2021
@dannyconrad
Copy link

Thank you @badoi for posting the reverse-complement tip. I was having a similar map rate issue and using the revcomp whitelist fixed it for me 👍

Just as a suggestion to the authors/developers, this would be a good thing to make a note of somewhere in the README (unless you have and I missed it), since this will be a problem for anyone aligning scATACseq reads sequenced via the reverse complement workflow (including NovaSeq w/ v1.5 reagents, HiSeq 4000, NextSeq, etc.).

Additionally, to ease some of the confusion around how mapping rates are achieved, it might make sense to order the final alignment statistics in the order that they are prioritized for filtering. Without the above thread, I also would not have known that only reads with a confirmed cell barcode are used for mapping (though of course, that makes a lot of sense for efficiency), since that stat is one of the last reported.

@haowenz
Copy link
Owner

haowenz commented Nov 18, 2021

Thank you so much for the suggestions! They make a lot of sense. Li and I will take care of this.

@haowenz haowenz reopened this Nov 18, 2021
@haowenz haowenz added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 18, 2021
@haowenz haowenz self-assigned this Nov 18, 2021
@dannyconrad
Copy link

I should also add that I am blown away with how fast Chromap is, by the way. I couldn't believe my eyes when you reported 16X speed enhancement over cellranger-atac, yet here I am aligning 5 different single-cell libraries in a single workday. Excellent work :)

@mourisl
Copy link
Collaborator

mourisl commented Nov 18, 2021

Thank you for the suggestions and the information on the reverse-complement barcodes. We just added a check of whether the barcodes from the data are too different from the whitelist. If they are too different, Chromap will stop early and warn the user about the whitelist issue. So you don't need to wait until the end to find that the whitelist needs to be modified. Hope this will be helpful.

@dannyconrad
Copy link

Excellent! That's a much more elegant solution, since it should also be useful in cases where users mistakenly use the wrong barcode whitelist. Thanks so much for addressing that so quickly.

@bli25
Copy link

bli25 commented Dec 23, 2021

@mourisl @haowenz , does it make sense to add another option (e.g. '--barcode-reversion') to reverse complement barcode FASTQ? As @dannyconrad mentioned, Illumina's new machines (e.g. NextSeq, NovaSeq) implement the reverse complement workflow, in which case index2 (barcode read for 10x) will be reverse complemented. I expect that having reverse complement of the barcode will be a very common use case. Reversing the white list is not an elegant solution and may introduce problems for 10x multiome (how to translate reverse complemented atac barcodes to RNA barcodes?).

Please let me know what you think.

Best,
Bo

@mourisl
Copy link
Collaborator

mourisl commented Dec 23, 2021

I am planning to add the strand information to the option "--read-format" so Chromap will reverse the barcode or sequence when reading in the data. It should be done in the next a few days. Thanks for the suggestions!

@bli25
Copy link

bli25 commented Dec 23, 2021

@mourisl, awesome, thanks a lot for keep developing this great software!

@mourisl
Copy link
Collaborator

mourisl commented Dec 24, 2021

We have extended the fields in the read-format option, and you can set the option like "--read-format bc:0:-1:-" to specify the reverse-complement barcode now.

@bli25
Copy link

bli25 commented Feb 2, 2022

@mourisl ,

I detected one bug related to this new feature. When I set '--read-format bc:0:-1:-', I still got the following error message:

Less than 5% barcodes can be found or corrected based on the barcode whitelist.
Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false positive warning, please run Chromap with the option --skip-barcode-check.

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2022

Thanks for the testing! I will check the implementation.

@mourisl
Copy link
Collaborator

mourisl commented Feb 5, 2022

@bli25 Hi Bo, yes, indeed there is a bug at the checking stage. I've fixed that in the "custom_readformat" branch, could you please give it a try? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants