In this notebook I will trial a working concept of the core functionality.

Main design:
- Indexed bam file is read by multiple workers in paralllel
- Each worker fetches a region and applies a user-specified function over the region.
- Regions can be specified manually by user, or defined automatically according to different schemes. First scheme to implement is `byfeature` which bins genome into N non-overlapping regions that each contain features of coordinates `(start, end)` such that each feature is contained in only one region.

In [1]:
import pysam
import mppysam.read_bam as rb
from datetime import datetime

In [2]:
bamfilepath = "../data/SLX-18505_N701_A03_r2.umiAppend.Aligned.out.featureCounts.sorted.bam"

In [3]:
samfile = pysam.AlignmentFile(bamfilepath, "rb")

In [4]:
samfile.lengths[0:9]

(248956422,
 133797422,
 135086622,
 133275309,
 114364328,
 107043718,
 101991189,
 90338345,
 83257441)

In [5]:
samfile.references[0:9]

('1.human',
 '10.human',
 '11.human',
 '12.human',
 '13.human',
 '14.human',
 '15.human',
 '16.human',
 '17.human')

In [6]:
samfile.nreferences

352

In [7]:
print(datetime.now())
bam = rb.read_bam(bamfilepath, processes=1)
print(datetime.now())

2020-11-27 18:56:07.635149
2020-11-27 18:56:39.494383


In [8]:
print(datetime.now())
bam = rb.read_bam(bamfilepath, processes=3)
print(datetime.now())

2020-11-27 18:56:39.504260
2020-11-27 18:57:03.360582


In [9]:
print(datetime.now())
bam = rb.read_bam(bamfilepath, processes=1, contigs=["1.human"])
print(datetime.now())

2020-11-27 18:57:03.365315
2020-11-27 18:57:08.036760


In [10]:
n_contigs = 3
print(datetime.now())
bam = rb.read_bam(
    bamfilepath, processes=n_contigs,
    contigs=[str(chrom)+".human" for chrom in range(1,n_contigs+1)])
print(datetime.now())

2020-11-27 18:57:08.046944
2020-11-27 18:57:11.639905
