Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment outside of contig #2

Closed
MatthiasLienhard opened this issue May 14, 2021 · 11 comments
Closed

Alignment outside of contig #2

MatthiasLienhard opened this issue May 14, 2021 · 11 comments

Comments

@MatthiasLienhard
Copy link

I used your tool (version 0.3, installed with pip) to align isoseq data to the human reference genome. For at least one contig I got reads beyond the last position (I renamed, sorted and binarized the output file reads.sam to CTL_S2_sorted.bam):

$ samtools view -H  CTL_S2_sorted.bam |grep KI270827.1
@SQ     SN:KI270827.1   LN:67707

e.g. Contig KI270827.1 has 67707 bases.
These are the start position of the aligned reads:

$ samtools view CTL_S2_sorted.bam KI270827.1  | cut -f 4
134490
134490
134494
134494
134496
134501
134510
134516
134520
134520
[...]
236450
236450
236450
236450
236450
236450
236453
236464
250325
259444
@ksahlin
Copy link
Owner

ksahlin commented May 14, 2021

This is strange, I will have a better look at this over the weekend. It would help if I can get at least one read(/sequence) that has aligned coordinates outside this contig, as well as your command line calling uLTRA. Would this be possible?

@ksahlin
Copy link
Owner

ksahlin commented May 16, 2021

I decided to wait with tracing down this bug until I get a sequence that produces violating alignment coordinates, as this bug may be nontrivial.

@MatthiasLienhard
Copy link
Author

MatthiasLienhard commented May 17, 2021

I prepared a minimal example, by taking every 100th read from the affected contig:

samtools view CTL_S2_sorted.bam KI270827.1|perl -lane '$c+=1; if ($c%100==0){print "\@read$c\n$F[9]\n+\n$F[10]"}'>test.fastq

This resulted in 10 reads. As the aligned reads did not contain quality strings, there are no quality values in the fastq, but I think it does not matter.
You can get this file and the resulting alignment here:
test.fastq

The reference genome and annotation I used is from GENCODE.
I used your tool with the new parameter options as follows, to reprocess the selected reads:

# prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# align the reads
uLTRA align GRCh38.genome.fa test.fastq . --prefix test --index uLTRA_index --isoseq

Strangely, the resulting alignment contained only 4 reads (even though all where aligned to KI270827.1 when origninally processed together with all reads). For each read, two alignments are found, one on chr11 and one on KI270827.1. Sometimes chr11 is reported as primary, sometimes KI270827.1. Except for the strand[1] the alignment to KI270827.1 is the same as in my original uLTRA alignment with all reads.

[1] apparently the sequence in the sam file is reversed if aligned to the - strand, which I did not consider when creating the fastq.

@ksahlin
Copy link
Owner

ksahlin commented May 18, 2021

Hi Matthias,

Thanks for the effort in producing a minimal example, it really helps.

I just wanted to double-check with you if you are sure that you are using v36 on the gencode annotation? The reason that I ask is that I cannot find KI270827.1 in the gencode.v36.chr_patch_hapl_scaff.annotation.gtf (downloaded from http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.chr_patch_hapl_scaff.annotation.gtf.gz -- which is the "ALL" entry at https://www.gencodegenes.org/human/release_36.html ). It is however present in the GRCh38.genome.fa file. Could you check this with e.g. grep "KI270827.1" [your annotation].gtf

This may be precisely what is causing the bug, but just thought I would check if I have the correct GTF annotation file before I go on a bug hunt.

@MatthiasLienhard
Copy link
Author

I can confirm that there is no annotation on KI270827.1 in the version I was using (v36).

@ksahlin
Copy link
Owner

ksahlin commented May 19, 2021

So I looked into this and here are my observations:

  • "Strangely, the resulting alignment contained only 4 reads"
    I also observed this behavior when using the fastq file that you produced. I do not observe this when I convert the file to fasta (then all 10 reads are aligned). I then modified your fastq so that the quality strings were the same length as the sequences, then I also get all 10 sequences aligned. So this makes me conclude that the fastq parser (silently) bugs out when quality strings are not of the same lengths. I should implement a check for this and also understand why this happens.

  • To the core issue, when I tried to reproduce your issue aligning the fasta version of your reads, I get that all these reads are aligned either to chr11 or contig KI270831.1. Contig KI270831.1 is of length 296895 (@SQ SN:KI270831.1 LN:296895) and all alignments are within the boundary of this contig. I reran a couple of times but didn't observe any stochasticity in the results.

I'm therefore wondering whether in your case, uLTRA made use of an old database.db file created from an earlier run where contig KI270827.1 was present in the GTF annotation? This could be checked by performing a complete rerun of uLTRA by specifying a new index folder path, alternatively removing all contents in the indexing folder.

If you do such a "complete rerun" (and also use a fasta file or properly formatted fastq file) and still observe this behaviour I would have to take more serious debugging action. Let me know.

@MatthiasLienhard
Copy link
Author

I reran the complete pipeline, but the issue persists...
Again, I used the two stage comands directly, with the --disable_infer parameter for the index step. Might that cause the issue?

# prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# prepare the isoseq reads
samtools fastq ${isoseq_reads}.bam > ${isoseq_reads}.fastq

# align the reads
uLTRA align GRCh38.genome.fa ${isoseq_reads}.fastq . --prefix ${isoseq_reads}_aligned --index uLTRA_index --isoseq --t 48
samtools view -Sb ${isoseq_reads}_aligned.sam | samtools sort -o ${isoseq_reads}_aligned.bam -
samtools index ${isoseq_reads}_aligned.bam
samtools view ${isoseq_reads}_aligned.bam KI270827.1|wc -l
1001
samtools view ${isoseq_reads}_aligned.bam KI270827.1|head -1
m64080_200120_120529/16976146/ccs       256     KI270827.1      134490  0       24=1X9=1X24=1X28=851N166=1X109=4748N98=865N161=220N133=6911N119=1246N108=1210N101=2653N102=232N121=2393N232= *       0       0       GGCTCATCCCGGAAGGACCGGTGTCTAGGTCACCCTGGAGCGCTCACCCCACCGGCACCCGTGCCCAAGCCCGCCCCTGCAAAGGCAGGCAAGGCCAGGCGGGTGCTGCCTGGGACCCAGTGACTCAGCACCCCTGCCCGGATCAACTGGACTTTTGCCCCCTGCTCCGCCAGCCTCCTGCTTGGATCTCTCCTGGGTCTCCCTGCTGCGCCTGTCCAGGATGCAGGGAGCTCGGGCTCCCAGGGACCAGGGCCGGTCCCCCGGCAGGATGAGCGCTCTAGGCCGGTCCTCGGTCATCTTGCTTACCTACGTGCTGGCCGCCACAGAACTTACCTGCCTCTTCATGCAGTTCTCCATCGTGCCATACCTGTCTCGGAAACTGGGCCTGGATTCCATTGCCTTCGGCTACCTGCAAACCACCTTCGGGGTGCTGCAGCTGCTGGGCGGGCCGGTGTTTGGCAGGTTCGCAGACCAGCGCGGGGCGCGGGCGGCGCTCACGCTCTCCTTCCTGGCTGCCTTGGCGCTCTACCTGCTCCTGGCGGCCGCCTCCAGCCCGGCCCTGCCCGGGGTCTACCTGCTCTTCGCCTCGCGCCTGCCCGGAGCGCTCATGCACACGCTGCCAGCCGCCCAGATGGTCATCACGGACCTGTCGGCACCCGAGGAGCGGCCCGCGGCCCTGGGCCGGCTGGGCCTCTGCTTCGGCGTCGGAGTCATCCTCGGCTCCCTGCTGGGCGGGACCCTGGTCTCCGCGTACGGGATTCAGTGCCCGGCCATCCTGGCTGCCCTGGCCACCCTCCTGGGAGCTGTCCTCAGCTTCACCTGCATCCCCGCCAGCACCAAAGGGGCCAAAACTGACGCCCAGGCTCCACTGCCAGGCGGCCCCCGGGCCAGTGTGTTCGACCTGAAGGCCATCGCCTCCCTGCTGCGGCTGCCAGACGTCCCGAGGATCTTCCTGGTGAAGGTGGCCTCCAACTGCCCCACAGGGCTCTTCATGGTCATGTTCTCCATCATCTCCATGGACTTCTTCCAGCTGGAGGCCGCCCAAGCTGGCTACCTCATGTCCTTCTTCGGGCTCCTCCAGATGGTGACCCAGGGCCTGGTCATCGGGCAGCTGAGCAGCCACTTCTCGGAGGAGGTGCTGCTCCGGGCCAGCGTGCTGGTCTTCATCGTGGTGGGCCTGGCCATGGCCTGGATGTCCAGCGTCTTCCACTTCTGCCTCCTGGTGCCCGGCCTGGTGTTCAGCCTCTGCACCCTCAACGTGGTCACCGACAGCATGCTGATCAAGGCTGTCTCCACCTCGGACACAGGGACCATGCTGGGCCTCTGCGCCTCTGTACAACCACTGCTCCGAACTCTGGGACCCACGGTCGGCGGCCTCCTGTACCGCAGCTTTGGCGTCCCCGTCTTCGGCCACGTGCAGGTTGCTATCAATACCCTTGTCCTCCTGGTCCTCTGGAGGAAACCTATGCCCCAGAGGAAGGACAAAGTCCGGTGACCGCTGCCCAGACACAGACTGGCAATAAACTCCTACTAAATCCC        *       XA:Z:ENST00000610526.4  XC:Z:FSM        NM:i:4
grep ENST00000610526.4 gencode.v36.chr_patch_hapl_scaff.annotation.gtf | head -1
KI270831.1      HAVANA  transcript      134406  157362  .       +       .       gene_id "ENSG00000276130.4"; transcript_id "ENST00000610526.4"; gene_type "protein_coding"; gene_name "SLC22A18"; transcript_type "protein_coding"; transcript_name "SLC22A18-214"; level 2; protein_id "ENSP00000479357.1"; transcript_support_level "1"; hgnc_id "HGNC:10964"; tag "basic"; havana_gene "OTTHUMG00000190753.1"; havana_transcript "OTTHUMT00000485914.1";

As you found, the transcript is actually on KI270831.1, and the position seems to match. However, for me KI270827.1 is reported.
I'll rerun the indexing without the --disable-infer, to see if that solves the issue. How exactly did you process the example?

@ksahlin
Copy link
Owner

ksahlin commented May 20, 2021

Thanks for your time with this! Really helps as the bug may be difficult. I will rerun it on another machine to try to reproduce this bug.

Are you sure that you didn't specify the full path to the gtf file? I get an error ValueError: unknown url type: from gffutilswhen not specifying the full path, as described here

I ran with these commands specifically:

./uLTRA index GRCh38.p13.genome.fa /Users/kxs624/Downloads/uLTRA_issue/gencode.v36.chr_patch_hapl_scaff.annotation.gtf ~/tmp/ULTRA/human_test/ --disable_infer

./uLTRA align ~/Documents/data/genomes/human/HG_38/GRCh38.p13.genome.fa ~/Downloads/uLTRA_issue/test.fast[a/q] ~/tmp/ULTRA/human_test/ --isoseq

I also tried the alignment step using only one core, i.e., with --t 1.

I ran on my Macbook. Pythons dictionary iteration order change across runs and I use a dictionary to store reference name to ref_ID (used internally in program) so will start by examining this property.

@MatthiasLienhard
Copy link
Author

You are right I used absolute paths. I removed them as I thought it makes things more easy to read.

I noticed a mistake on my side, not sure whether this is relevant:
I used GRCh38.genome.fa (e.g. not containing the haplotypes) instead of GRCh38.p13.genome.fa - so the genome fasta did not contain all sequences mentioned in the gtf. Both genomes do contain the sequence for KI270827.1 though. And the tool did not complain about the missing sequences.

I reran the index and alignment steps with GRCh38.p13.genome.fa and got different results, however having the same issue. This time I got 45 alignments on KI270827.1, of which 42 are beyond the end position of the sequence. I also noticed other sequences affected, but I guess this one is very obvious, as it is relatively short. If you need more examples (e.g. the index I produced) let me know.

As you mention that dict implementation might be related to it, my exact python version and linux kernel:

python --version -VV
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0]
uname -srm
Linux 5.10.24.mx64.375 x86_64

@ksahlin
Copy link
Owner

ksahlin commented Jul 31, 2021

I have not forgotten about this issue. Will take a look at it again after vacation.

ksahlin added a commit that referenced this issue Oct 23, 2022
@ksahlin
Copy link
Owner

ksahlin commented Oct 26, 2022

I can finally close this issue as it, with the highest likelihood, is fixed due to a very similar bug report in #17 that I just fixed.

If you still have this instance easily set up, it would be good with a verification, but I'm not expecting anything as it was so long ago now.

@ksahlin ksahlin closed this as completed Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants