# Introduction

The aim of this notebook is to determine a sensible threshold for creating a DUST mask. This came about after conversations with Roberto about how to remove the many variants in AT repeats (or near AT repeats) for the PPQ GWAS. In addition, I also evaluate windowmasker and tantan, but find them to be less use (especially windowmasker which is removing a large proportion of the genome)

The earlier version of the notbeook (20151027_dustmasker) used petl intervals to find overlaps between regions, but I then realised using numpy boolean arrays would be much more efficient (important when trying out lots of different thresholds)

In [1]:
%run _shared_setup.ipynb

python 3.4.3 |Anaconda 2.2.0 (64-bit)| (default, Mar  6 2015, 12:03:53) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
numpy 1.9.2
scipy 0.15.1
pandas 0.15.2
numexpr 2.3.1
pysam 0.8.3
petl 1.0.11
petlx 1.0.3
vcf 0.6.7
h5py 2.4.0
tables 3.1.1
vcfplt 0.8
tbl_pgv_metadata length = 5729
tbl_pgv_locations length = 102
tbl_pf3k_metadata length = 2512
tbl_pf_solaris length = 10879
tbl_assembled_samples length = 11


In [2]:
install_dir = '../opt_4'
REF_GENOME="/lustre/scratch110/malaria/rp7/Pf3k/GATKbuild/Pfalciparum_GeneDB_Aug2015/Pfalciparum.genome.fasta"
regions_fn = '/nfs/users/nfs_r/rp7/src/github/malariagen/pf-crosses/meta/regions-20130225.bed.gz'
# regions_fn = '/Users/rpearson/src/github/malariagen/pf-crosses/meta/regions-20130225.bed.gz'
ref_gff = "%s/snpeff/snpEff/data/Pfalciparum_GeneDB_Aug2015/genes.gff" % install_dir
ref_cds_gff = REF_GENOME.replace('.fasta', '.CDS.gff')

In [3]:
!head -n -34 {ref_gff} | grep -P '\tCDS\t' > {ref_cds_gff}

# Download software

In [4]:
# !wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/dustmasker/dustmasker -O {install_dir}/dustmasker
# !chmod a+x {install_dir}/dustmasker

--2015-10-28 10:05:23--  ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/dustmasker/dustmasker
           => `../opt_4/dustmasker'
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 2607:f220:41e:250::7
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/agarwala/dustmasker ... done.
==> SIZE dustmasker ... 11688414
==> PASV ... done.    ==> RETR dustmasker ... done.
Length: 11688414 (11M) (unauthoritative)


2015-10-28 10:05:31 (1.85 MB/s) - `../opt_4/dustmasker' saved [11688414]



In [None]:
!wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker -O {install_dir}/windowmasker
!chmod a+x {install_dir}/windowmasker

In [33]:
current_dir = !pwd
current_dir = current_dir[0]
!wget http://cbrc3.cbrc.jp/~martin/tantan/tantan-13.zip -O {install_dir}/tantan-13.zip
%cd {install_dir}
!unzip tantan-13.zip
%cd tantan-13
!make
%cd {current_dir}

--2015-10-29 10:54:39--  http://cbrc3.cbrc.jp/~martin/tantan/tantan-13.zip
Resolving wwwcache.sanger.ac.uk (wwwcache.sanger.ac.uk)... 172.18.24.2, 172.18.24.1
Connecting to wwwcache.sanger.ac.uk (wwwcache.sanger.ac.uk)|172.18.24.2|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 113155 (111K) [application/zip]
Saving to: `../opt_4/tantan-13.zip'


2015-10-29 10:54:39 (23.5 MB/s) - `../opt_4/tantan-13.zip' saved [113155/113155]

/nfs/users/nfs_r/rp7/src/github/malariagen/methods-dev/pf3k_techbm/opt_4
Archive:  tantan-13.zip
   creating: tantan-13/
  inflating: tantan-13/ChangeLog.txt  
  inflating: tantan-13/Makefile      
  inflating: tantan-13/README.txt    
   creating: tantan-13/src/
  inflating: tantan-13/src/mcf_score_matrix_probs.hh  
  inflating: tantan-13/src/mcf_score_matrix_probs.cc  
  inflating: tantan-13/src/mcf_fasta_sequence.hh  
  inflating: tantan-13/src/mcf_tantan_options.cc  
  inflating: tantan-13/src/mcf_tantan_options.hh  
  inflating: t

# Run algorithms on ref genome

In [4]:
ref_dict=SeqIO.to_dict(SeqIO.parse(open(REF_GENOME), "fasta"))
chromosome_lengths = [len(ref_dict[chrom]) for chrom in ref_dict]
tbl_chromosomes=(etl.wrap(zip(ref_dict.keys(), chromosome_lengths))
    .pushheader(['chrom', 'stop'])
    .addfield('start', 0)
    .cut(['chrom', 'start', 'stop'])
    .sort('chrom')
)
tbl_chromosomes

chrom,start,stop
Pf3D7_01_v3,0,640851
Pf3D7_02_v3,0,947102
Pf3D7_03_v3,0,1067971
Pf3D7_04_v3,0,1200490
Pf3D7_05_v3,0,1343557


In [5]:
tbl_regions = (etl
    .fromtsv(regions_fn)
    .pushheader(['chrom', 'start', 'stop', 'region'])
    .convertnumbers()
)
tbl_regions.display(10)

chrom,start,stop,region
Pf3D7_01_v3,0,27336,SubtelomericRepeat
Pf3D7_01_v3,27336,92900,SubtelomericHypervariable
Pf3D7_01_v3,92900,457931,Core
Pf3D7_01_v3,457931,460311,Centromere
Pf3D7_01_v3,460311,575900,Core
Pf3D7_01_v3,575900,616691,SubtelomericHypervariable
Pf3D7_01_v3,616691,640851,SubtelomericRepeat
Pf3D7_02_v3,0,23100,SubtelomericRepeat
Pf3D7_02_v3,23100,105800,SubtelomericHypervariable
Pf3D7_02_v3,105800,447300,Core


In [6]:
iscore_array = collections.OrderedDict()
for chromosomes_row in tbl_chromosomes.data():
    chrom=chromosomes_row[0]
    iscore_array[chrom] = np.zeros(chromosomes_row[2], dtype=bool)
    for regions_row in tbl_regions.selecteq('chrom', chrom).selecteq('region', 'Core').data():
        iscore_array[chrom][regions_row[1]:regions_row[2]] = True

In [7]:
tbl_ref_cds_gff = (
    etl.fromgff3(ref_cds_gff)
    .select(lambda rec: rec['end'] > rec['start'])
    .unpackdict('attributes')
    .select(lambda rec: rec['Parent'].endswith('1')) # Think there are alternate splicings for some genes, here just using first
    .distinct(['seqid', 'start'])
)

In [8]:
tbl_coding_regions = (tbl_ref_cds_gff
    .cut(['seqid', 'start', 'end'])
    .rename('end', 'stop')
    .rename('seqid', 'chrom')
    .convert('start', lambda val: val-1)
)
tbl_coding_regions                   

chrom,start,stop
Pf3D7_01_v3,29509,34762
Pf3D7_01_v3,35887,37126
Pf3D7_01_v3,38981,39923
Pf3D7_01_v3,40153,40207
Pf3D7_01_v3,42366,43617


In [9]:
iscoding_array = collections.OrderedDict()
for chromosomes_row in tbl_chromosomes.data():
    chrom=chromosomes_row[0]
    iscoding_array[chrom] = np.zeros(chromosomes_row[2], dtype=bool)
    for coding_regions_row in tbl_coding_regions.selecteq('chrom', chrom).data():
        iscoding_array[chrom][coding_regions_row[1]:coding_regions_row[2]] = True

In [10]:
def which_lower(string):
    return np.array([str.islower(x) for x in string])
which_lower('abCDeF') 
# np.array([str.islower(x) for x in 'abCDeF'])

array([ True,  True, False, False,  True, False], dtype=bool)

In [11]:
def find_regions(masked_pos, number_of_regions=3):
    masked_regions = list()
    start = masked_pos[0]
    stop = start
    region_number = 1
    for pos in masked_pos:
        if pos > (stop + 1):
            masked_regions.append([start, stop])
            start = stop = pos
            region_number = region_number + 1
            if region_number > number_of_regions:
                break
        else:
            stop = pos
    return(masked_regions)

In [12]:
def summarise_masking(
    classification_array,
    masking_description = "Dust level 20",
    number_of_regions = 10,
    max_sequence_length = 60
):
    number_core_coding_masked = np.count_nonzero(classification_array['Core coding masked'])
    number_core_coding_unmasked = np.count_nonzero(classification_array['Core coding unmasked'])
    number_core_noncoding_masked = np.count_nonzero(classification_array['Core noncoding masked'])
    number_core_noncoding_unmasked = np.count_nonzero(classification_array['Core noncoding unmasked'])
    proportion_core_coding_masked = number_core_coding_masked / (number_core_coding_masked + number_core_coding_unmasked)
    proportion_core_noncoding_masked = number_core_noncoding_masked / (number_core_noncoding_masked + number_core_noncoding_unmasked)
    print("%s: %4.1f%% coding and %4.1f%% non-coding masked" % (
            masking_description,
            proportion_core_coding_masked*100,
            proportion_core_noncoding_masked*100
        )
    )
    coding_masked_pos = np.where(classification_array['Core coding masked'])[0]
    noncoding_masked_pos = np.where(classification_array['Core noncoding masked'])[0]
    coding_masked_regions = find_regions(coding_masked_pos, number_of_regions)
    noncoding_masked_regions = find_regions(noncoding_masked_pos, number_of_regions)    
    
    print("    First %d Pf3D7_01_v3 coding sequences masked:" % number_of_regions)
    for region in coding_masked_regions:
        if region[1] - region[0] > max_sequence_length:
            masked_sequence = "%s[...]" % ref_dict['Pf3D7_01_v3'].seq[region[0]:region[1]][0:max_sequence_length]
        else:
            masked_sequence = ref_dict['Pf3D7_01_v3'].seq[region[0]:region[1]]
        print("        %d - %d: %s" % (
                region[0],
                region[1],
                masked_sequence
            )
        )
    print("    First %d Pf3D7_01_v3 non-coding sequences masked:" % number_of_regions)
    for region in noncoding_masked_regions:
        if region[1] - region[0] > max_sequence_length:
            masked_sequence = "%s[...]" % ref_dict['Pf3D7_01_v3'].seq[region[0]:region[1]][0:max_sequence_length]
        else:
            masked_sequence = ref_dict['Pf3D7_01_v3'].seq[region[0]:region[1]]
        print("        %d - %d: %s" % (
                region[0],
                region[1],
                masked_sequence
            )
        )
    print()
    

In [13]:
def evaluate_dust_threshold(
    dust_level=20,
    verbose=False
):
    masked_genome_fn = "%s.dustmasker.%d.fasta" % (REF_GENOME.replace('.fasta', ''), dust_level)
    
    if verbose:
        print("Running dustmasker %d" % dust_level)
    !{install_dir}/dustmasker \
    -in {REF_GENOME} \
    -outfmt fasta \
    -out {masked_genome_fn} \
    -level {dust_level}

    if verbose:
        print("Reading in fasta %d" % dust_level)
    masked_ref_dict=SeqIO.to_dict(SeqIO.parse(open(masked_genome_fn), "fasta"))

    if verbose:
        print("Creating mask array %d" % dust_level)
    ismasked_array = collections.OrderedDict()
    classification_array = collections.OrderedDict()
    
    genome_length = sum([len(ref_dict[chrom]) for chrom in ref_dict])
    for region_type in [
        'Core coding unmasked',
        'Core coding masked',
        'Core noncoding unmasked',
        'Core noncoding masked',
        'Noncore coding unmasked',
        'Noncore coding masked',
        'Noncore noncoding unmasked',
        'Noncore noncoding masked',
    ]:
        classification_array[region_type] = np.zeros(genome_length, dtype=bool)
        
    offset=0
    for chromosomes_row in tbl_chromosomes.data():
        chrom=chromosomes_row[0]
        masked_ref_dict_chrom = "lcl|%s" % chrom
        if verbose:
            print(chrom)
        chrom_length=chromosomes_row[2]
        ismasked_array[chrom] = which_lower(masked_ref_dict[masked_ref_dict_chrom].seq)
        classification_array['Core coding unmasked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & iscoding_array[chrom] & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Core coding masked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & iscoding_array[chrom] & ismasked_array[chrom]
        )
        classification_array['Core noncoding unmasked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & np.logical_not(iscoding_array[chrom]) & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Core noncoding masked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & np.logical_not(iscoding_array[chrom]) & ismasked_array[chrom]
        )
        classification_array['Noncore coding unmasked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & iscoding_array[chrom] & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Noncore coding masked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & iscoding_array[chrom] & ismasked_array[chrom]
        )
        classification_array['Noncore noncoding unmasked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & np.logical_not(iscoding_array[chrom]) & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Noncore noncoding masked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & np.logical_not(iscoding_array[chrom]) & ismasked_array[chrom]
        )
        offset = offset + chrom_length

    summarise_masking(classification_array, "Dust level %d" % dust_level)
                      
#     number_core_coding_masked = np.count_nonzero(classification_array['Core coding masked'])
#     number_core_coding_unmasked = np.count_nonzero(classification_array['Core coding unmasked'])
#     number_core_noncoding_masked = np.count_nonzero(classification_array['Core noncoding masked'])
#     number_core_noncoding_unmasked = np.count_nonzero(classification_array['Core noncoding unmasked'])
#     proportion_core_coding_masked = number_core_coding_masked / (number_core_coding_masked + number_core_coding_unmasked)
#     proportion_core_noncoding_masked = number_core_noncoding_masked / (number_core_noncoding_masked + number_core_noncoding_unmasked)
#     print("dustmasker dust_level=%d: %4.1f%% coding and %4.1f%% non-coding masked" % (
#             dust_level,
#             proportion_core_coding_masked*100,
#             proportion_core_noncoding_masked*100,
# #             ''.join(np.array(ref_dict['Pf3D7_01_v3'].seq)[classification_array['Core coding masked'][0:640851]])[0:60]
#        )
#     )
#     non_coding_masked_pos = np.where(classification_array['Core coding masked'])[0]
#     for masked_coding_region in find_regions(non_coding_masked_pos):
#         print("\t%d - %d: %s" % (
#                 masked_coding_region[0],
#                 masked_coding_region[1],
#                 ref_dict['Pf3D7_01_v3'].seq[masked_coding_region[0]:masked_coding_region[1]]
#             )
#         )

    return(classification_array, masked_ref_dict, ismasked_array)


In [14]:
def evaluate_windowmasker(
    check_dup='true',
    use_dust='false',
    verbose=False
):
    ustat_fn = "%s.windowmasker.%s.%s.ustat" % (REF_GENOME.replace('.fasta', ''), check_dup, use_dust)
    masked_genome_fn = "%s.windowmasker.%s.%s.fasta" % (REF_GENOME.replace('.fasta', ''), check_dup, use_dust)
    
    if verbose:
        print("Running dustmasker check_dup=%s use_dust=%s" % (check_dup, use_dust))
    !{install_dir}/windowmasker -mk_counts \
    -in {REF_GENOME} \
    -checkdup {check_dup} \
    -out {ustat_fn}

    !{install_dir}/windowmasker \
    -ustat {ustat_fn} \
    -in {REF_GENOME} \
    -outfmt fasta \
    -out {masked_genome_fn} \
    -dust {use_dust} \

    if verbose:
        print("Reading in fasta check_dup=%s use_dust=%s" % (check_dup, use_dust))
    masked_ref_dict=SeqIO.to_dict(SeqIO.parse(open(masked_genome_fn), "fasta"))

    if verbose:
        print("Creating mask array check_dup=%s use_dust=%s" % (check_dup, use_dust))
    ismasked_array = collections.OrderedDict()
    classification_array = collections.OrderedDict()
    
    genome_length = sum([len(ref_dict[chrom]) for chrom in ref_dict])
    for region_type in [
        'Core coding unmasked',
        'Core coding masked',
        'Core noncoding unmasked',
        'Core noncoding masked',
        'Noncore coding unmasked',
        'Noncore coding masked',
        'Noncore noncoding unmasked',
        'Noncore noncoding masked',
    ]:
        classification_array[region_type] = np.zeros(genome_length, dtype=bool)
        
    offset=0
    for chromosomes_row in tbl_chromosomes.data():
        chrom=chromosomes_row[0]
        if verbose:
            print(chrom)
        chrom_length=chromosomes_row[2]
        ismasked_array[chrom] = which_lower(masked_ref_dict[chrom].seq)
        classification_array['Core coding unmasked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & iscoding_array[chrom] & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Core coding masked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & iscoding_array[chrom] & ismasked_array[chrom]
        )
        classification_array['Core noncoding unmasked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & np.logical_not(iscoding_array[chrom]) & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Core noncoding masked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & np.logical_not(iscoding_array[chrom]) & ismasked_array[chrom]
        )
        classification_array['Noncore coding unmasked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & iscoding_array[chrom] & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Noncore coding masked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & iscoding_array[chrom] & ismasked_array[chrom]
        )
        classification_array['Noncore noncoding unmasked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & np.logical_not(iscoding_array[chrom]) & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Noncore noncoding masked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & np.logical_not(iscoding_array[chrom]) & ismasked_array[chrom]
        )
        offset = offset + chrom_length

    summarise_masking(classification_array, "windowmasker check_dup=%s use_dust=%s" % (check_dup, use_dust))

#     number_core_coding_masked = np.count_nonzero(classification_array['Core coding masked'])
#     number_core_coding_unmasked = np.count_nonzero(classification_array['Core coding unmasked'])
#     number_core_noncoding_masked = np.count_nonzero(classification_array['Core noncoding masked'])
#     number_core_noncoding_unmasked = np.count_nonzero(classification_array['Core noncoding unmasked'])
#     proportion_core_coding_masked = number_core_coding_masked / (number_core_coding_masked + number_core_coding_unmasked)
#     proportion_core_noncoding_masked = number_core_noncoding_masked / (number_core_noncoding_masked + number_core_noncoding_unmasked)
#     print("windowmasker check_dup=%s use_dust=%s: %4.1f%% coding and %4.1f%% non-coding masked\n\t%s" % (
#             check_dup,
#             use_dust,
#             proportion_core_coding_masked*100,
#             proportion_core_noncoding_masked*100,
#             ''.join(np.array(ref_dict['Pf3D7_01_v3'].seq)[classification_array['Core coding masked'][0:640851]])[0:60]
#         )
#     )
        
    return(classification_array, masked_ref_dict, ismasked_array)


In [15]:
def evaluate_tantan(
    r=0.005,
    m=None,
    verbose=False
):
    masked_genome_fn = "%s.tantan.%s.%s.fasta" % (REF_GENOME.replace('.fasta', ''), r, m)
    
    if verbose:
        print("Running tantan r=%s m=%s" % (r, m))
    if m is None:
        !{install_dir}/tantan-13/src/tantan -r {r} {REF_GENOME} > {masked_genome_fn}
    elif m == 'atMask.mat':
        !{install_dir}/tantan-13/src/tantan -r {r} -m {install_dir}/tantan-13/test/atMask.mat {REF_GENOME} > \
            {masked_genome_fn}
    else:
        stop("Unknown option m=%s" % m)

    if verbose:
        print("Reading in fasta r=%s m=%s" % (r, m))
    masked_ref_dict=SeqIO.to_dict(SeqIO.parse(open(masked_genome_fn), "fasta"))

    if verbose:
        print("Creating mask array r=%s m=%s" % (r, m))
    ismasked_array = collections.OrderedDict()
    classification_array = collections.OrderedDict()
    
    genome_length = sum([len(ref_dict[chrom]) for chrom in ref_dict])
    for region_type in [
        'Core coding unmasked',
        'Core coding masked',
        'Core noncoding unmasked',
        'Core noncoding masked',
        'Noncore coding unmasked',
        'Noncore coding masked',
        'Noncore noncoding unmasked',
        'Noncore noncoding masked',
    ]:
        classification_array[region_type] = np.zeros(genome_length, dtype=bool)
        
    offset=0
    for chromosomes_row in tbl_chromosomes.data():
        chrom=chromosomes_row[0]
        if verbose:
            print(chrom)
        chrom_length=chromosomes_row[2]
        ismasked_array[chrom] = which_lower(masked_ref_dict[chrom].seq)
        classification_array['Core coding unmasked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & iscoding_array[chrom] & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Core coding masked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & iscoding_array[chrom] & ismasked_array[chrom]
        )
        classification_array['Core noncoding unmasked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & np.logical_not(iscoding_array[chrom]) & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Core noncoding masked'][offset:(offset+chrom_length)] = (
            iscore_array[chrom] & np.logical_not(iscoding_array[chrom]) & ismasked_array[chrom]
        )
        classification_array['Noncore coding unmasked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & iscoding_array[chrom] & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Noncore coding masked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & iscoding_array[chrom] & ismasked_array[chrom]
        )
        classification_array['Noncore noncoding unmasked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & np.logical_not(iscoding_array[chrom]) & np.logical_not(ismasked_array[chrom])
        )
        classification_array['Noncore noncoding masked'][offset:(offset+chrom_length)] = (
            np.logical_not(iscore_array[chrom]) & np.logical_not(iscoding_array[chrom]) & ismasked_array[chrom]
        )
        offset = offset + chrom_length

    summarise_masking(classification_array, "tantan r=%s m=%s" % (r, m))

#     number_core_coding_masked = np.count_nonzero(classification_array['Core coding masked'])
#     number_core_coding_unmasked = np.count_nonzero(classification_array['Core coding unmasked'])
#     number_core_noncoding_masked = np.count_nonzero(classification_array['Core noncoding masked'])
#     number_core_noncoding_unmasked = np.count_nonzero(classification_array['Core noncoding unmasked'])
#     proportion_core_coding_masked = number_core_coding_masked / (number_core_coding_masked + number_core_coding_unmasked)
#     proportion_core_noncoding_masked = number_core_noncoding_masked / (number_core_noncoding_masked + number_core_noncoding_unmasked)
#     print("tantan r=%s m=%s: %4.1f%% coding and %4.1f%% non-coding masked\n\t%s" % (
#             r,
#             m,
#             proportion_core_coding_masked*100,
#             proportion_core_noncoding_masked*100,
#             ''.join(np.array(ref_dict['Pf3D7_01_v3'].seq)[classification_array['Core coding masked'][0:640851]])[0:60]
#         )
#     )
        
    return(classification_array, masked_ref_dict, ismasked_array)


In [16]:
dustmasker_classification_arrays = collections.OrderedDict()
for dust_level in [20, 30, 40, 50, 60, 70, 80, 90, 100]:
    dustmasker_classification_arrays[str(dust_level)] = evaluate_dust_threshold(dust_level, verbose=False)

................
Dust level 20: 29.0% coding and 70.9% non-coding masked
    First 10 Pf3D7_01_v3 coding sequences masked:
        98885 - 98891: aaaaaa
        99590 - 99597: aaaaaaa
        99980 - 100043: aaaagctgcagaagcagaaatgaagaaaagagctcaaaaaccgaagaagaaaaaaagtag[...]
        100059 - 100066: ggggggg
        100758 - 100764: aaaaaa
        101525 - 101531: aaaaaa
        101577 - 101591: aaaaaagcaaaaaa
        101716 - 102281: atgatgctgaagaaaatgtagaacatgatgctgaagaaaatgttgaagaaaatgtagaag[...]
        104965 - 104971: aaaaaa
        105030 - 105139: taagtaaaaagacttcagtggaagagaaactaaaagaaaaagttgtaaaagaaaataaag[...]
    First 10 Pf3D7_01_v3 non-coding sequences masked:
        92955 - 93094: aataattatatataataatttattttatacttgttttattatatattcctctataaaaca[...]
        93097 - 93103: aaaaaa
        93850 - 93939: atattatattgtgaacaaaaaataattaatataaaaaggtaaatggatgtaatatatata[...]
        94015 - 94021: aaaaaa
        94119 - 94185: tattattattttatattttatatttttagaaaagaatattatatgtttttaaaataaatt

In [16]:
windowmasker_classification_arrays = collections.OrderedDict()
for check_dup in ['true', 'false']:
    windowmasker_classification_arrays[check_dup] = collections.OrderedDict()
    for use_dust in ['true', 'false']:
        windowmasker_classification_arrays[check_dup][use_dust] = evaluate_windowmasker(check_dup, use_dust, verbose=False)

computing the genome length
pass 1
pass 2
windowmasker check_dup=true use_dust=true: 52.0% coding and 87.6% non-coding masked
    First 10 Pf3D7_01_v3 coding sequences masked:
        98885 - 98891: aaaaaa
        98908 - 98941: tttgatgatgaagaaaaaagaaatgaaaataag
        98994 - 99012: actatatatcattttaaa
        99219 - 99219: 
        99582 - 99602: gacattataaaaaaaatgca
        99623 - 99668: ggatattaataaaagaaaatatgattctttaaaagaaaaattaca
        99813 - 99829: cagaaatatttaaatc
        99906 - 99921: agaaaaattatgaat
        99935 - 99956: ctttaaacatataaatgaatt
        99980 - 100043: aaaagctgcagaagcagaaatgaagaaaagagctcaaaaaccgaagaagaaaaaaagtag[...]
    First 10 Pf3D7_01_v3 non-coding sequences masked:
        92900 - 92930: ataatacatgcataatattaaaatgtattt
        92952 - 93094: accaataattatatataataatttattttatacttgttttattatatattcctctataaa[...]
        93097 - 93103: aaaaaa
        93129 - 93149: tagttaaaatatataatctt
        93168 - 93204: ctgtaaatatatacttatgtttattatataaatcaa
        93221

In [16]:
tantan_classification_arrays = collections.OrderedDict()
for m in ['atMask.mat', None]:
    tantan_classification_arrays[str(m)] = collections.OrderedDict()
#     for r in [0.000000000001, 0.000000001, 0.000001, 0.00001, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5]:
    for r in [0.000000000001, 0.000000001, 0.000001, 0.001, 0.1]:
        tantan_classification_arrays[str(m)][str(r)] = evaluate_tantan(r, m, verbose=False)

tantan r=1e-12 m=atMask.mat:  5.1% coding and  6.8% non-coding masked
    First 10 Pf3D7_01_v3 coding sequences masked:
        100362 - 100535: gaacatgtagaagaacacactgctgatgacgaacatgtagaagaaccaactgttgctgat[...]
        101707 - 101956: atgtagaacatgatgctgaagaaaatgtagaacatgatgctgaagaaaatgttgaagaaa[...]
        101958 - 102269: gtagaagaaaatgttgaagaatatgatgaagaaaatgttgaagaagtagaagaaaatgta[...]
        113105 - 113241: ttcttcttgttcttgtttttgttcttgttctcgtttttgttcttcttcttgttttttttt[...]
        127265 - 127360: attattattattattatgattcatgttattcatataatttatatgattcatattatccat[...]
        127695 - 127758: ttattattattattattattattattattgttgttgttgttatgattattattattgtta[...]
        135562 - 135678: ttataatcatcatcatatatattattatcatcaccacatatattataatcatcatcatat[...]
        136456 - 136720: ttactatgaatatcttcattactatgaatatcttctttaccatgaatatcttcattccta[...]
        137473 - 137557: ttattattattgtttttgttattattattattgtttttgttattattattattgttttta[...]
        144708 - 144826: tccatattgttatttttttgaatattattatttttt

In [18]:
str(ref_dict['Pf3D7_01_v3'].seq[100362:100536])

'gaacatgtagaagaacacactgctgatgacgaacatgtagaagaaccaactgttgctgatgatgaacatgtagaagaaccaactgttgctgatgaacacgtagaagaaccaactgttgctgaagaacatgtagaagaaccaactgttgctgaagaacacgtagaagaaccagct'

In [19]:
str(ref_dict['Pf3D7_01_v3'].seq[101707:101957])

'atgtagaacatgatgctgaagaaaatgtagaacatgatgctgaagaaaatgttgaagaaaatgtagaagaaaatgtagaagaaaatgtagaagaaaatgtagaagaaaatgtagaagaaaatgtagaagaaaatgtagaagaaaatgtagaagaaaatgttgaagaaaatgtagaagaaaatgttgaagaaaatgtagaagaaaatgtagaagaaaatgttgaagaatatgatgaagaaaatgttgaaga'

In [58]:
for dust_level in [20, 30, 40, 50, 60, 70]:
    summarise_masking(
        dustmasker_classification_arrays[str(dust_level)][0],
        "Dust level %d" % dust_level
    )

Dust level 20: 29.0% coding and 70.9% non-coding masked
    First 10 Pf3D7_01_v3 coding sequences masked:
        98885 - 98891: aaaaaa
        99590 - 99597: aaaaaaa
        99980 - 100043: aaaagctgcagaagcagaaatgaagaaaagagctcaaaaaccgaagaagaaaaaaagtag[...]
        100059 - 100066: ggggggg
        100758 - 100764: aaaaaa
        101525 - 101531: aaaaaa
        101577 - 101591: aaaaaagcaaaaaa
        101716 - 102281: atgatgctgaagaaaatgtagaacatgatgctgaagaaaatgttgaagaaaatgtagaag[...]
        104965 - 104971: aaaaaa
        105030 - 105139: taagtaaaaagacttcagtggaagagaaactaaaagaaaaagttgtaaaagaaaataaag[...]
    First 10 Pf3D7_01_v3 non-coding sequences masked:
        92955 - 93094: aataattatatataataatttattttatacttgttttattatatattcctctataaaaca[...]
        93097 - 93103: aaaaaa
        93850 - 93939: atattatattgtgaacaaaaaataattaatataaaaaggtaaatggatgtaatatatata[...]
        94015 - 94021: aaaaaa
        94119 - 94185: tattattattttatattttatatttttagaaaagaatattatatgtttttaaaataaatt[...]
        942

In [59]:
for check_dup in ['true', 'false']:
    for use_dust in ['true', 'false']:
        summarise_masking(
            windowmasker_classification_arrays[check_dup][use_dust][0],
            "windowmasker check_dup=%s use_dust=%s" % (check_dup, use_dust)
        )


NameError: name 'windowmasker_classification_arrays' is not defined

In [21]:
# for r in [0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5]:
for r in [0.000000000001, 0.000000001, 0.000001, 0.001, 0.1]:
    for m in [None, 'atMask.mat']:
        summarise_masking(
            tantan_classification_arrays[str(m)][str(r)][0],
            "tantan r=%s m=%s" % (r, m)
        )
       

tantan r=1e-12 m=None:  8.5% coding and 47.2% non-coding masked
    First 10 Pf3D7_01_v3 coding sequences masked:
        100357 - 100564: atgatgaacatgtagaagaacacactgctgatgacgaacatgtagaagaaccaactgttg[...]
        101705 - 102281: aaatgtagaacatgatgctgaagaaaatgtagaacatgatgctgaagaaaatgttgaaga[...]
        113093 - 113244: tttttgttcttcttcttcttgttcttgtttttgttcttgttctcgtttttgttcttcttc[...]
        115632 - 115672: ttactattattattattattattattattattattattag
        121147 - 121248: gtaaaaaaacagatataatattcccaaaaaagatataataataaatttaattaatataga[...]
        125474 - 125483: agctaaata
        127224 - 127362: ttaatattattacaattattctttatattacaattattattattattattattattatga[...]
        127673 - 127760: aatattattattatatctgttattattattattattattattattattattgttgttgtt[...]
        135537 - 135678: attattatcatcatcatcatatatgttataatcatcatcatatatattattatcatcacc[...]
        135853 - 135902: tttaattttttattttttttattttccttttccttttccttttccgttt
    First 10 Pf3D7_01_v3 non-coding sequences masked:
        93903 - 939

In [19]:
import pickle
dustmasker_classification_arrays_fn = REF_GENOME.replace('.fasta', 'dustmasker_classification_arrays.p')
pickle.dump(dustmasker_classification_arrays, open(dustmasker_classification_arrays_fn, "wb"))

MemoryError: 