# Simulation

The goal of this notebook is to simulate the index mis-assignment encountered in sequencing, and validate the formulae used in the main program against the simulation truth.

In [3]:
indexes = [
    ("A","1"), ("B","1"), ("C", "1"),
    ("A","2"),            ("C", "2"),
    ("A","3"), ("B","3"), ("C", "3"),
    ("A","4"), ("B","4"),
];

### True number of reads
This code generates a set of true

In [26]:
sample_read_mean = 1e5               # Number of reads
sample_read_sd = sample_read_mean*0.5 # Statistical standard deviation (normal dist.)

true_sample_reads = Dict([index => (randn()*sample_read_sd + sample_read_mean) for index = indexes])

Dict{Any,Any} with 10 entries:
  ("A","3") => 142167.72414019416
  ("C","3") => 24622.823125911833
  ("A","1") => 30240.368727347202
  ("C","1") => 172974.40833888104
  ("B","3") => 120788.0926044527
  ("A","4") => 114377.5565550738
  ("B","4") => 32440.83441178038
  ("B","1") => 197304.94797745062
  ("C","2") => 12905.836513986796
  ("A","2") => 125474.84071311245

### Simulate number of reads
The code below generates read multiplicities for each index combination, in accordance with what would be expected for a set of mis-assignment probabilities. The simulation logic assumes that the index1 and index2 mis-assignment events are independent.  Wright and Vetsigian found a higher probability than what would be expected for independent variables, but the double mis-assignment rate is still very low -- negligible for practical purposes:

|                  |              |
|------------------|--------------|
|Incorrect i5:     |   0.0604 %   |
|Incorrect i7:     |   0.0955 %   |
|Incorrect sequence|   0.0872 %   |
|Multiple incorrect|   0.0003 %   |

It is also notable that the i7 and i5 (respectively, the first and second index) mis-assignment rates are significantly different.

We cannot detect sequence misassignments at all using demultiplexing data, and most double misassignments will not be detectable.

### Simulation parameters (part 2)

In [37]:
index1_misassign_rate = 0.0955 / 100.0
index2_misassign_rate = 0.0604 / 100.0
## sequence_misassign_rate = 0.0872 / 100.0 ; #Not simulated

0.000604

### Possible indexes in the soup
The following arrays the read multiplicity of each single index in the data. I.e. if a sample has a large fraction of the reads, one expect the "free" indexes available for mis-assignment to also be relatively larger. Whether this is actually a valid model depends on the mechanism of mis-assigment.

In [28]:
index1_mult = Dict()
index2_mult = Dict()

for ((index1, index2), num_reads) in true_sample_reads
    index1_mult[index1] = get(index1_mult, index1, 0) + num_reads
    index2_mult[index2] = get(index2_mult, index2, 0) + num_reads
end

### Simulated number of reads per index sequence
This is the main simulation loop, which generates read counts per sample. It's a bit inefficient, could be optimised to not loop over each read.

In [50]:
println("Index 1 miss rate: ", index1_misassign_rate)
println("Index 2 miss rate: ", index2_misassign_rate)

Index 1 miss rate: 0.000955
Index 2 miss rate: 0.000604


In [67]:
read_norm = sum([reads for (seq, reads) = true_sample_reads])

973297.4331081909

In [75]:
sim_index_reads = Dict()

function weightedrandom(norm, read_dict)
    value = rand() * norm
    for (seq, num) in read_dict
        value -= num
        if (value <= 0) 
            return seq
        end
    end
    throw(ErrorException("What happened?"))
end

for ((index1, index2), num_reads) in true_sample_reads
    for i = 1:num_reads
        if rand() < index1_misassign_rate
            index1var = weightedrandom(read_norm, index1_mult)
        else
            index1var = index1
        end
        if rand() < index2_misassign_rate
            index2var = weightedrandom(read_norm, index2_mult)
        else
            index2var = index2
        end
        index = (index1var, index2var)
        index_reads = get(sim_index_reads, index, 0)
        sim_index_reads[index] = index_reads + 1
    end
end
sim_index_reads

Dict{Any,Any} with 12 entries:
  ("C","1") => 172835
  ("C","4") => 43
  ("A","2") => 125375
  ("A","3") => 142105
  ("C","3") => 24684
  ("B","2") => 84
  ("A","1") => 30465
  ("B","3") => 120757
  ("A","4") => 114311
  ("B","4") => 32477
  ("B","1") => 197217
  ("C","2") => 12938

#### Test case for weightedrandom
Does it reproduce the distribution?

In [73]:

println("Original index1 multiplicity (fraction): ")
for (seq, num) = index1_mult
    println(seq, ": ", num / read_norm)
end
# Generate weighted random, see if we get the same
wr_num = Dict()
N=1e5
for i = 0:N
    seq = weightedrandom(read_norm, index1_mult)
    wr_num[seq] = get(wr_num, seq, 0) + 1
end

println("Randomised index1 multiplicity (fraction): ")
for (seq, num) = wr_num
    println(seq, ": ", num / N)
end

Original index1 multiplicity (fraction): 
B: 0.360150826530248
A: 0.42357092098680293
C: 0.2162782524829492
Randomised index1 multiplicity (fraction): 
B: 0.36031
A: 0.4239
C: 0.2158


### Setup for analysis

In [7]:
i1s = Set()
sample_single_indexes = map(Set, zip(indexes...))

2-element Array{Set{String},1}:
 Set(String["B","A","C"])    
 Set(String["4","1","2","3"])