Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inStrain profile not profiling all genes #167

Closed
rebeccasophiasalcedo opened this issue Mar 4, 2024 · 7 comments
Closed

inStrain profile not profiling all genes #167

rebeccasophiasalcedo opened this issue Mar 4, 2024 · 7 comments

Comments

@rebeccasophiasalcedo
Copy link

Hi!

I'm running inStrain on a bunch of whole metagenome assemblies to calculate gene by gene coverage in each sample. It works like a dream on all of my samples except for one, where it's not profiling the majority of my genes or scaffolds.

Here's my command for that sample specifically:
inStrain profile /scratch/users/rsalcedo/4500m1000m_site/bowtie/bams/4500m1000m_reads_to_contigs_sorted.bam /scratch/users/rsalcedo/OC17wc_sequences/assemblies/4500m1000m_contigs_min1000.fa -o /scratch/users/rsalcedo/4500m1000m_site/inStrain/4500m1000m_inStrain/ -g /scratch/users/rsalcedo/OC17wc_sequences/prodigal_annotations/fna/4500m1000m_contigs_min1000_prodigalORFs.fna -p 32 --skip_mm_profiling --skip_plot_generation -c 1

I ran inStrain version 1.7.1 and my bam was generated with bowtie2.

Right now the gene_info.tsv only has 911 entries and the scaffold_info.tsv entry has 1105 entries. My whole assembly has 983997 genes and 385088 scaffolds. I dug into the raw_data directory and saw that the genes_table.csv has the correct number of entries, but the genes_coverage.csv (2692) and scaffold_list.txt (1105) are both off.

I checked the mapping_info.tsv and it has all 385088 scaffolds listed but looks like most of my scaffolds didn't have any mapped reads. I went back and checked my bam file and it looks okay. I also ran coverM on my assembly and all contigs recruited reads, with ~82% recruiting at 5x coverage or more, so there's a discrepancy between inStrain that says my scaffolds have no recruited reads vs coverM/my bam file that says that there is recruitment.

I think it has something to do with how inStrain is reading in the coverage info for this particular bam file? I'm not sure though. If you have any advice it'd be much appreciated! Here is my log.log file, let me know if I can get you any more information to help problem solve.

Thank you!

@MrOlm
Copy link
Owner

MrOlm commented Mar 4, 2024

Hi @rebeccasophiasalcedo -

Thank you for the detailed bug report! Based on what you've described, it does seem like there's a problem with inStrain reading the reads mapped to some of your scaffolds. This could be due to 1) a mismatch between the scaffold names in the .fasta file and the scaffold names in the .bam file, 2) a weird character in the names of the .fasta file or .bam file, or 3) some other bug with pysam (what inStrain uses to load the reads). A couple of questions-

About how many other samples did you run that worked? A few, or many?

Did the other samples use the same .fasta file? I there anything different about how this sample was processed as compared to the other files?

Thanks,
Matt

@MrOlm
Copy link
Owner

MrOlm commented Mar 4, 2024

Also in looking at the log file, inStrain is reporting that it's removing ~50% of reads during filtering. Could you provide the mapping_info.tsv file as well?

Thanks,
Matt

@rebeccasophiasalcedo
Copy link
Author

I have a total of 28 samples and had no problems with the other 27! And they all used their respective assembly fasta, so the .fa file used in this command is only ever used here for this analysis. Though I generated MAGs using all assemblies (incl. this one) and cross-mapped for metaBAT and ran into no issues there.

I just checked my sam and fasta and the headers look alright. I didn't catch any whitespace characters either.

head 4500m1000m_reads_to_contigs.sam
@hd VN:1.5 SO:unsorted GO:query
@sq SN:OC1703_4500m_1000m_sens_contig_1483280 LN:1041
@sq SN:OC1703_4500m_1000m_sens_contig_3485708 LN:1284

head 4500m1000m_contigs_min1000.fa

OC1703_4500m_1000m_sens_contig_4153184
OC1703_4500m_1000m_sens_contig_593313

Here's the mapping_info.tsv file!

@rebeccasophiasalcedo
Copy link
Author

oh and when I ran coverM I required 95% ANI too so the recruitment of coverM and what inStrain's default requirements are for mapping are comparable!

@MrOlm
Copy link
Owner

MrOlm commented Mar 4, 2024

OK thanks for this. In looking at your mapping_info.tsv file, it looks to be like inStrain thinks that all the reads mapping to lots of the contigs are singletons (unpaired reads). This could be because they are, or because of some sort of bug in inStrain / pysam that's make it think they are.

You could confirm one way or the other by trying to filter the .bam file to remove unpaired reads and then running coverM again (or something like that).

Or, if you could just add --pairing_filter all_reads to just keep the singleton reads and not filter them out.

-Matt

@rebeccasophiasalcedo
Copy link
Author

okay I'll give both of those a shot and report back, thank you!

@rebeccasophiasalcedo
Copy link
Author

running it with --pairing_filter all_reads worked! I think it's because my reads actually are all singletons and probably not a bug based on some samtools digging, thank you so much for helping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants