# Simulation

The goal of this notebook is to simulate the index mis-assignment encountered in sequencing, and validate the formulae used in the main program against the simulation truth.

In [9]:
indexes = [
    ("A","1"), ("B","1"), ("C", "1"),
    ("A","2"),            ("C", "2"),
    ("A","3"), ("B","3"), ("C", "3"),
    ("A","4"), ("B","4"),
];

### True number of reads
This code generates a set of true

In [10]:
sample_read_mean = 10e6               # Number of reads
sample_read_sd = sample_read_mean*0.5 # Statistical standard deviation (normal dist.)

true_sample_reads = Dict([index => (randn()*sample_read_sd + sample_read_mean) for index = indexes]);

### Simulate number of reads
The code below generates read multiplicities for each index combination, in accordance with what would be expected for a set of mis-assignment probabilities. The simulation logic assumes that the index1 and index2 mis-assignment events are independent.  Wright and Vetsigian found a higher probability than what would be expected for independent variables, but the double mis-assignment rate is still very low -- negligible for practical purposes:

|                  |              |
|------------------|--------------|
|Incorrect i5:     |   0.0604 %   |
|Incorrect i7:     |   0.0955 %   |
|Incorrect sequence|   0.0872 %   |
|Multiple incorrect|   0.0003 %   |

It is also notable that the i7 and i5 (respectively, the first and second index) mis-assignment rates are significantly different.

We cannot detect sequence misassignments at all using demultiplexing data, and most double misassignments will not be detectable.

### Simulation parameters (part 2)

In [23]:
index1_misassign_rate = 0.0955 / 100.0
index2_misassign_rate = 0.0604 / 100.0
sequence_misassign_rate = 0.0872 / 100.0 ;

### Possible indexes in the soup
The following arrays the read multiplicity of each single index in the data. I.e. if a sample has a large fraction of the reads, one expect the "free" indexes available for mis-assignment to also be relatively larger. Whether this is actually a valid model depends on the mechanism of mis-assigment.

In [26]:
index1_mult = Dict()
index2_mult = Dict()
for ((index1, index2), num_reads) in true_sample_reads
    index1_mult[index1] = get(index1_mult, index1, 0) + num_reads
    index2_mult[index1] = get(index2_mult, index2, 0) + num_reads
end

### Simulated number of reads per index sequence
This is the main simulation loop, which generates read counts per sample.

In [27]:
sim_index_reads = Dict()

for ((index1, index2), num_reads) in true_sample_reads
    if rand() < index1_misassign_rate
        index1 = 1
    end
    index_reads = get(sim_index_reads, index, 0)
    sim_index_reads[index] = 3
end

LoadError: UndefVarError: index not defined

In [6]:
myset = setdiff(Set([1,2]), Set([1]))
myset[0]

LoadError: MethodError: `getindex` has no method matching getindex(::Set{Int64}, ::Int64)

### Setup for analysis

In [7]:
i1s = Set()
sample_single_indexes = map(Set, zip(indexes...))

2-element Array{Set{String},1}:
 Set(String["B","A","C"])    
 Set(String["4","1","2","3"])