# Fixed experimental parameters

Results of varying `minRepeatPeriod` showed that we may confine our analysis to: 

In [2]:
! echo "minRepeatPeriod = $($root/bin/jq --raw-output '.makeRegions.minRepeatPeriod' data/minRepeatLength=0/config.json)"

minRepeatPeriod = 6


In [3]:
root = "/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/analysis/locally_assemble_short_reads/trfermikit"

def print_fixed_parameters(config):
    ! $root/bin/jq 'del(.makeRegions.minRepeatLength) | del(.makeRegions.minRepeatPeriod)' $config
    
print_fixed_parameters('data/minRepeatLength=0/config.json')

[1;39m{
  [0m[34;1m"makeRegions"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"slop"[0m[1;39m: [0m[0;32m"250"[0m[1;39m,
    [0m[34;1m"minCoverage"[0m[1;39m: [0m[0;32m"0"[0m[1;39m,
    [0m[34;1m"maxCoverage"[0m[1;39m: [0m[0;32m"200"[0m[1;39m,
    [0m[34;1m"maxRegionLength"[0m[1;39m: [0m[0;32m"100000"[0m[1;39m,
    [0m[34;1m"functionalRegions"[0m[1;39m: [0m[0;32m"none"[0m[1;39m,
    [0m[34;1m"genomeBuild"[0m[1;39m: [0m[0;32m"hg38"[0m[1;39m,
    [0m[34;1m"overlappedFunctionalRegions"[0m[1;39m: [0m[0;32m"false"[0m[1;39m,
    [0m[34;1m"ucscTable"[0m[1;39m: [0m[0;32m"simpleRepeat"[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"makeCalls"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"singleBaseMatchReward"[0m[1;39m: [0m[0;32m"10"[0m[1;39m,
    [0m[34;1m"singleBaseMismatchPenalty"[0m[1;39m: [0m[0;32m"12"[0m[1;39m,
    [0m[34;1m"gapOpenPenalties"[0m[1;39m: [0m[0;32m"6,26"[0m[1;39m,
    [0m[34;1m"gapExte

# The effect of tandem-repeat length on performance

In [2]:
import json
import pandas as pd 

def add_performance(table, truvari_data, tool, calls):
    table.append([
            tool,
            calls,
            truvari_data['TP-base'],
            truvari_data['FN'],
            truvari_data['FP'], 
            truvari_data['TP-base'] + truvari_data['FN'],
            truvari_data['TP-base'] + truvari_data['FP']
    ])
    
def create_performance_table(output):
    table = []
    with open('{}/truvari-pacbio-manta/summary.txt'.format(output)) as json_file:
        add_performance(table, json.load(json_file), 'manta', 'all')    
    with open('{}/truvari-pacbio-trfermikit/summary.txt'.format(output)) as json_file:
        add_performance(table, json.load(json_file), 'trfermikit', 'all')
    with open('{}/truvari-pacbio-trfermikit.unitigSupport/summary.txt'.format(output)) as json_file:
        add_performance(table, json.load(json_file), 'trfermikit', 'unitigSupport')
    with open('{}/truvari-pacbio-trfermikit.unitigSupport.thinned/summary.txt'.format(output)) as json_file:
        add_performance(table, json.load(json_file), 'trfermikit', 'unitigSupport.thinned')
    return table 


def visualize_performance_table(output):
    from IPython.display import HTML
    columns = ['tool', 'calls', 'TP', 'FN', 'FP', '# real events', '# calls']
    df_ = pd.DataFrame(
        create_performance_table(output),
        columns=columns
    )
    return HTML(df_.to_html(index=False))

In [3]:
visualize_performance_table('data/minRepeatLength=0')

tool,calls,TP,FN,FP,# real events,# calls
manta,all,1086,3414,1018,4500,2104
trfermikit,all,1746,2754,8052,4500,9798
trfermikit,unitigSupport,1565,2935,2542,4500,4107
trfermikit,unitigSupport.thinned,1458,3042,1688,4500,3146


In [4]:
visualize_performance_table('data/minRepeatLength=50')

tool,calls,TP,FN,FP,# real events,# calls
manta,all,1091,3461,1021,4552,2112
trfermikit,all,1774,2778,7571,4552,9345
trfermikit,unitigSupport,1589,2963,2576,4552,4165
trfermikit,unitigSupport.thinned,1477,3075,1694,4552,3171


In [5]:
visualize_performance_table('data/minRepeatLength=100')

tool,calls,TP,FN,FP,# real events,# calls
manta,all,1077,3501,1020,4578,2097
trfermikit,all,1786,2792,6884,4578,8670
trfermikit,unitigSupport,1590,2988,2613,4578,4203
trfermikit,unitigSupport.thinned,1478,3100,1722,4578,3200


Notice that the number of real events and the number of calls, both reported by truvari, do not always increase as the constraint on the interrogated regions is relaxed. This appears to be a `truvari` artifact as the number of real events and the numbers of calls in the vcfs supplied to `truvari` *do* increase as the constraint is relaxed, as expected: 

In [6]:
! bash line_counts.2.sh


minRepeatLength=0
--------------------------------
# manta calls: 2349
# trfermikit calls: 3462
# pacbio calls: 4881
# regions: 402343

minRepeatLength=50
--------------------------------
# manta calls: 2282
# trfermikit calls: 3385
# pacbio calls: 4806
# regions: 183852

minRepeatLength=100
--------------------------------
# manta calls: 2178
# trfermikit calls: 3308
# pacbio calls: 4711
# regions: 89031


These data show that most of the pacbio DELs (>50bp) that lie in tandem repeats, lie in ones larger than 100bp. However tandem repeat regions larger than 100bp are significantly less numerous, making the runtime of `trfermikit` significantly shorter. Our experiments show that most of the benefit of `trfermikit` can be obtained in about 2 hours (the average time to run the software for tandem repeats larger than 100bp on a 70X sample). 