Two alignments from the same subread to the CCS read #4

sjin09 · 2021-09-02T10:54:36Z

Hello,

I was exploring the test dataset and the alignment before applying deepConsensus on my dataset. I noticed that m64014_181209_091052/3146438/0_11098 subread was aligned both in the forward and reverse direction to the m64014_181209_091052/3146438/ccs.

I would assume that both alignments will be used for the consensus calling by deepconsenus, which I believe should not be the case. Or does deepconsensus have an internal filter to account for double counting of the subreads for the target CCS read?

Regards,
Sangjin

danielecook · 2021-09-02T14:40:58Z

m64014_181209_091052/3146438/0_11098    16
m64014_181209_091052/3146438/0_11098    2064

Actually it looks like the second alignment is a supplementary alignment (explain flags).

Due to the way we perform alignments, it is possible for subreads to align multiple times to the same CCS read (a primary alignment + supplementary alignments). DeepConsensus does have an internal filter: we take the first alignment and discard the rest. Here is the code responsible for that filtering:

deepconsensus/deepconsensus/preprocess/merge_datasets_transforms.py

Lines 156 to 170 in a497d55

    
           """Merge aligned read proto and unaligned read proto.""" 
        
           (_, (aligned_reads, unaligned_reads)) = subread_data 
        
           if not aligned_reads or not unaligned_reads: 
        
             return 
        
           aligned = aligned_reads[0] 
        
           unaligned = unaligned_reads[0] 
        
           # Do not error when there exist, for a given read, multiple alignments to 
        
           # correct molecule. Some of the alignments may be supplementary alignments. 
        
           # Keep a count and use the first alignment. 
        
           if len(aligned_reads) > 1: 
        
             logging.info('Unexpected: %d aligned reads for %s', len(aligned_reads), 
        
                          aligned.fragment_name) 
        
             self.multiple_alignments_counter.inc()

Please reopen if you have further questions.

sjin09 · 2021-09-03T11:00:14Z

I find instances where the supplementary alignment occurs before the primary alignment and the deepconsensus internal filter subsequently would used the supplementary alignment instead of the primary alignment for the consensus calling.

The attached image shows that the supplementary alignment for subread m64089_200122_110840/198615/5556_21349 before the primary alignment.

Simple python script to only select primary alignments with matching ZMW before deepConsensus.

import sys
import pysam
import natsort
import argparse

class BAM:
    def __init__(self, line):
        # target
        self.tname = line.reference_name
        self.tzmw = self.tname.split("/")[1]
        # query
        self.qname = line.query_name
        self.qzmw = self.qname.split("/")[1]


def parse_args(args):
    parser = argparse.ArgumentParser(
        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument(
        "-i",
        "--input",
        type=str,
        required=True,
        help="path to input BAM file"
    )
    parser.add_argument(
        "-o",
        "--output",
        type=str,
        required=True,
        help="path to output BAM file"
    )
    args = args[1:]
    return parser.parse_args(args)


def bamfilter(infile, outfile):
    insam = pysam.AlignmentFile(infile, "rb", check_sq=False)
    outsam = pysam.AlignmentFile(outfile, "wb", template=insam)
    for line in insam:
        read = BAM(line)
        flag = line.to_string().strip().split()[1]
        if (flag == "0" or flag == "16") and read.tzmw == read.qzmw:
            outsam.write(line)
    outsam.close()
    insam.close()
    pysam.index(outfile)


def main():
    options = parse_args(sys.argv)
    bamfilter(options.input, options.output)
    sys.exit(0)


if __name__ == "__main__":
    main()

MariaNattestad · 2021-09-03T16:56:52Z

We were just looking into this, and it looks like supplying --best-n 1 to the pbmm2 align command makes it output only the primary alignment. The most important thing though is that we keep the alignment to the correct ZMW, so we are still investigating the downstream effects of changes to the alignment and filtering process.

danielecook closed this as completed Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two alignments from the same subread to the CCS read #4

Two alignments from the same subread to the CCS read #4

sjin09 commented Sep 2, 2021

danielecook commented Sep 2, 2021 •

edited

sjin09 commented Sep 3, 2021 •

edited

MariaNattestad commented Sep 3, 2021

Two alignments from the same subread to the CCS read #4

Two alignments from the same subread to the CCS read #4

Comments

sjin09 commented Sep 2, 2021

danielecook commented Sep 2, 2021 • edited

sjin09 commented Sep 3, 2021 • edited

MariaNattestad commented Sep 3, 2021

danielecook commented Sep 2, 2021 •

edited

sjin09 commented Sep 3, 2021 •

edited