In [1]:
from simulations import dropletSimulation

In [2]:
import numpy as np

In [3]:
def random_frequency_vector(length):
    unnormalized = np.random.randint(low=0, high=370, size=length)
    return unnormalized / sum(unnormalized)

def random_copy_numbers(length):
    return np.random.randint(low=1, high=16, size=length)

A = 2*(np.random.random((100,100)) - 0.5)
beta = np.random.random(100)

simulation = dropletSimulation(number_species=100, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(100), frequency_vector=random_frequency_vector(100),
             glv_interaction_coefficients=A, glv_baserate_coefficients=beta, noise_scale=8, seed=42,
                              timestep=0.01, batch_window=2)

In [4]:
%%time
simulation.run_simulation()

CPU times: user 909 ms, sys: 380 ms, total: 1.29 s
Wall time: 1.5 s


In [5]:
simulation2 = dropletSimulation(number_species=100, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(100), frequency_vector=random_frequency_vector(100),
             glv_interaction_coefficients=A, glv_baserate_coefficients=beta, noise_scale=8,
                               timestep = 0.001, batch_window=20)

In [6]:
%%time
simulation2.run_simulation()

CPU times: user 868 ms, sys: 378 ms, total: 1.25 s
Wall time: 3.97 s


In [7]:
simulation3 = dropletSimulation(number_species=100, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(100), frequency_vector=random_frequency_vector(100),
             glv_interaction_coefficients=A, glv_baserate_coefficients=beta, noise_scale=8,
                               timestep = 0.0001, batch_window=200)

In [8]:
%%time
simulation3.run_simulation()

CPU times: user 897 ms, sys: 399 ms, total: 1.3 s
Wall time: 33.7 s


In [9]:
np.max(simulation.cells.counts)

624168203877.475

In [10]:
np.max(simulation2.cells.counts)

23712.272496690533

In [11]:
np.max(simulation3.cells.counts)

10000.0

Dang OK so nice the carrying capacity limit works a lot better with smaller time steps yay

In [12]:
np.max(simulation3.reads.counts)

15412.548733897793

In [13]:
simulation4 = dropletSimulation(number_species=10, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(10), frequency_vector=random_frequency_vector(10),
             glv_interaction_coefficients=A[:,0:10][0:10,:], glv_baserate_coefficients=beta[0:10], noise_scale=8,
                               timestep = 0.0001, batch_window=200)

In [14]:
%%time
simulation4.run_simulation()

CPU times: user 818 ms, sys: 328 ms, total: 1.15 s
Wall time: 32.1 s


In [15]:
np.max(simulation4.cells.counts)

10000.00000004811

In [16]:
np.max(simulation4.reads.counts)

13234.261582990615

In [17]:
simulation5 = dropletSimulation(number_species=1000, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(1000), frequency_vector=random_frequency_vector(1000),
             glv_interaction_coefficients=2*(np.random.random((1000,1000))-0.5), glv_baserate_coefficients=np.random.random(1000), 
                                noise_scale=8,timestep = 0.0001, batch_window=200)

In [18]:
%%time
simulation5.run_simulation()

CPU times: user 1.88 s, sys: 1.56 s, total: 3.44 s
Wall time: 32.1 s


In [19]:
np.max(simulation5.cells.counts)

10000.0

In [20]:
np.max(simulation5.reads.counts)

12272.134271466313

HOLY COW! the time is actually nearly independent of the number of strains/species now! yay!

I might not have to use any more dumb hacks! yay!

oh well I forgot about the fact that the coefficients will still be more expensive to compute though (you know if there are one million of them versus 100 but anyway)

In [21]:
simulation6 = dropletSimulation(number_species=1000, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(1000), frequency_vector=random_frequency_vector(1000),
             glv_interaction_coefficients=2*(np.random.random((1000,1000))-0.5), glv_baserate_coefficients=np.random.random(1000), 
                                noise_scale=8,timestep = 0.0002, batch_window=100)

In [22]:
%%time
simulation6.run_simulation()

CPU times: user 1.95 s, sys: 1.55 s, total: 3.5 s
Wall time: 17.6 s


In [23]:
np.max(simulation6.cells.counts)

10425.320558545247

In [24]:
np.max(simulation6.reads.counts)

60626.140328808004

more inaccurate than the other ones, but not too bad probably, especially since it's about twice as fast or whatever

In [25]:
simulation7 = dropletSimulation(number_species=1000, number_droplets=1000, number_batches=5, 
             copy_numbers=random_copy_numbers(1000), frequency_vector=random_frequency_vector(1000),
             glv_interaction_coefficients=2*(np.random.random((1000,1000))-0.5), glv_baserate_coefficients=np.random.random(1000), 
                                noise_scale=8,timestep = 0.0005, batch_window=20)

In [26]:
%%time
simulation7.run_simulation()

CPU times: user 1.84 s, sys: 1.47 s, total: 3.31 s
Wall time: 6.16 s


In [27]:
np.max(simulation7.cells.counts)

7899.869247461537

In [28]:
np.max(simulation7.reads.counts)

29438.231687435455

In [29]:
%%time
simulation3.group_droplets()

CPU times: user 868 ms, sys: 32.2 ms, total: 900 ms
Wall time: 903 ms


In [30]:
%%time
simulation4.group_droplets()

CPU times: user 14.3 ms, sys: 1.85 ms, total: 16.1 ms
Wall time: 14.8 ms


In [31]:
%%time
simulation5.group_droplets()

CPU times: user 1min 19s, sys: 2.96 s, total: 1min 22s
Wall time: 1min 22s


ok yeah this is about quadratic scaling, and just for the droplet grouping, not even the actual coefficient calculations...

So maybe keeping the number of strains small would still be good? Yes maybe but we could still plausibly do 100 probably, or at least 20 or something. Perhaps...

In [32]:
%%time
simulation6.group_droplets()

CPU times: user 1min 24s, sys: 22.2 s, total: 1min 46s
Wall time: 1min 47s


In [33]:
%%time
simulation7.group_droplets()

CPU times: user 1min 19s, sys: 3.61 s, total: 1min 23s
Wall time: 1min 23s


keep in mind that even though this doesn't look that bad even though it's a 1000 species, it's also only 1,000 droplets per batch. It will/would get worse with more droplets per batch (not sure about the exact scaling but last time I checked it was super-linear so it's not great). Maybe it wouldn't be 8 hours this time but still

In [34]:
simulation8 = dropletSimulation(number_species=1000, number_droplets=10000, number_batches=5, 
             copy_numbers=random_copy_numbers(1000), frequency_vector=random_frequency_vector(1000),
             glv_interaction_coefficients=2*(np.random.random((1000,1000))-0.5), glv_baserate_coefficients=np.random.random(1000), 
                                noise_scale=8,timestep = 0.0001, batch_window=200)

In [35]:
%%time
simulation8.run_simulation()

CPU times: user 17.5 s, sys: 29.3 s, total: 46.8 s
Wall time: 5min 44s


notice this is 10,000 droplets per batch this time

so will the following only take 10 times longer to run? (which would be 15 minutes) 

or will it be even worse?

In [36]:
%%time
simulation8.group_droplets()

CPU times: user 17min 9s, sys: 12.3 s, total: 17min 21s
Wall time: 17min 24s


and then keep in mind too that the cost is not just in/about the droplet grouping -- you then actually have to go and compute the log ratio coefficients (and now also other coefficients) too, and so even though grouping the droplets will amortize you some time, if that's really slow (like in this case), each log ratio variant computation is going to be really slow and expensive too, so...

In [37]:
%%time
simulation8.reads.get_fitness_coefficients()

CPU times: user 21.6 s, sys: 4.9 s, total: 26.5 s
Wall time: 27 s


array([[[ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ],
        [        nan,         nan,         nan,         nan,
                 nan],
        [        nan,         nan,         nan,         nan,
                 nan],
        ...,
        [        nan,         nan,         nan,         nan,
                 nan],
        [        nan,         nan,         nan,         nan,
                 nan],
        [        nan,         nan,         nan,         nan,
                 nan]],

       [[        nan,         nan,         nan,         nan,
                 nan],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ],
        [        nan,         nan,         nan,         nan,
                 nan],
        ...,
        [        nan,         nan,         nan, -2.39587673,
                 nan],
        [ 0.50572676,         nan,         nan,         nan,
                 nan],
        [        nan,         nan,         nan, 

wow dang, that's actually not bad at all, especially considering that we are talking about _one million_ log ratio coefficients (and then five times) like wow

1/32 the compute time of the droplet grouping, impressive

Let's also check that the accuracy of these is what we expect from using such a small time step:

In [38]:
np.max(simulation8.cells.counts)

10000.000000000005

### 100,000 droplets

In [40]:
simulation9 = dropletSimulation(number_species=11, number_droplets=100000, number_batches=5, 
             copy_numbers=random_copy_numbers(11), frequency_vector=random_frequency_vector(11),
             glv_interaction_coefficients=2*(np.random.random((11,11))-0.5), glv_baserate_coefficients=np.random.random(11), 
                                noise_scale=8,timestep = 0.0001, batch_window=200)

In [41]:
%%time
simulation9.run_simulation()

CPU times: user 1min 4s, sys: 11.2 s, total: 1min 15s
Wall time: 51min 11s


Let's also check that the accuracy of these is what we expect from using such a small time step:

In [42]:
np.max(simulation9.cells.counts)

10000.874344376047

Increase number of droplets, decrease number of strains...

In [43]:
%%time
simulation9.group_droplets()

CPU times: user 1.08 s, sys: 66.3 ms, total: 1.15 s
Wall time: 1.17 s


In [44]:
%%time
simulation9.reads.get_fitness_coefficients()

CPU times: user 206 ms, sys: 19 ms, total: 225 ms
Wall time: 223 ms


array([[[ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ],
        [-0.01825097, -0.0103619 , -0.03298558, -0.15337361,
         -0.17263177],
        [-0.03135004,  0.01516   ,  0.03386091, -0.17329343,
         -0.12162302],
        [ 0.00517588,  0.02271118,  0.02730199,  0.09451613,
          0.15043179],
        [ 0.0232786 ,  0.04171229,  0.07555189,  0.0844107 ,
          0.1625998 ],
        [-0.06693257, -0.03944812,  0.16756002,  0.33471494,
         -0.03309491],
        [-0.00179352, -0.02499574, -0.0277908 , -0.05063811,
         -0.00851819],
        [-0.00135798, -0.02903899, -0.13542305, -0.01616914,
         -0.01249851],
        [ 0.0093042 , -0.02840098,  0.0815181 ,  0.15453214,
          0.16477393],
        [-0.00250099,  0.05505745,  0.0502911 ,  0.0260532 ,
          0.10427131],
        [ 0.004338  , -0.03608141, -0.00548705,  0.03409703,
         -0.11218546]],

       [[-0.00172497,  0.0639816 ,  0.01531354,  0.05848579,
          0.1

Allow us to no longer compromise when it comes to the number of strains

In [45]:
simulation10 = dropletSimulation(number_species=1000, number_droplets=100000, number_batches=5, 
             copy_numbers=random_copy_numbers(1000), frequency_vector=random_frequency_vector(1000),
             glv_interaction_coefficients=2*(np.random.random((1000,1000))-0.5), glv_baserate_coefficients=np.random.random(1000), 
                                noise_scale=8,timestep = 0.0001, batch_window=200)

Computer keeps on dying before/instead of running these, which seems like a bad sign...

**Update:** it's because my computer keeps running out of memory apparently. I guess this means I wrote bad code that has a major memory leak? uh oh I'm not sure

**2nd Update:** Well the code may be bad, but because it's 1000 species with 100,000 droplets with 5 batches we have 500 million entries in the numpy arrays, then two numpy arrays that size, so 1 billion entries, each entry is a float and so probably more than 8 bytes, so works out to more 8GB RAM. I'm guessing/vaguely remember last time I was only doing 10,000 droplets with 1000 species, which would correspond to 1/10 the memory requirement, so that this doesn't run probably isn't too surprising.

In [46]:
%%time
simulation10.run_simulation()

CPU times: user 2min, sys: 1min 32s, total: 3min 33s
Wall time: 52min 35s


Let's also check that the accuracy of these is what we expect from using such a small time step:

In [47]:
np.max(simulation10.cells.counts)

10000.457629454848

is the following only ~10 times as long as previous `group_droplets`? or ~100? 

(this killed my kernel at least once, so I'm not sure whether it will actually really run or not...)

In [None]:
%%time
simulation10.group_droplets()

In [None]:
%%time
simulation10.reads.get_fitness_coefficients()