# 2022-01-17 Problems running probabilistic fits on ARC4

After trying to run full-scale probabilistic fits on ARC4, I have found that these probabilistic fits are seeing nowhere near the 10x speedup over my laptop that I have seen for the deterministic fits - even a slowdown? 1000 parameterisations for `oVAoEAoAN` took 6 h 11 min on ARC4, i.e., about 22 s per parameterisation, whereas according to my 2022-01-05b notes, on my computer I have seen about 10 s per parameterisation for these? 

I now verified this again, running `do_5_probabilistic_fitting.py` for 8 parameterisations of `oVAoEAoAN` on my computer, which took 55 s (using parallelisation), i.e., about 7 s per parameterisation.


## Testing some different things on my computer

(The tests below were made using temporary edits to `do_5_probabilistic_fitting.py`, not committed to the repository.)

The most obvious possible culprit that comes to mind is the random draws which happen in the probabilistic fits but not the deterministic ones - maybe they become a bottleneck somehow in the parallel implementation? To investigate this I am trying some runs with and without random draws on my computer, in both parallel and serially:

Running :
1. 8 parameterisations `oVAoEAoSNvoPF`
    * Serial
        * 3 min 16 s
        * 3 min 30 s
    * Parallel - a 2.55x speedup:
        * 1 min 18 s
        * 1 min 22 s
2. 8 parameterisations `oVAoEAoSNvoPF` but with zero noise magnitude for all parameterisations (turns off the noisy perception in the model, so no random draws)
    * Serial
        * 2 min 57 s
    * Parallel - a 2.60x speedup
        * 1 min 8 s
3. 8 parameterisations `oVA`
    * Serial 
        * 3 min 6 s
    * Parallel - a 2.70x speedup:
        * 1 min 9 s

So at least on my computer the random draws don't seem to add much runtime at all - the runtimes are quite similar between models with and without random draws, both in parallel and serial simulation mode. If there had been some form of serial bottleneck from the random draws, I would have expected a smaller speedup from parallelisation in experiment 1 above than from 2 or 3, but the speedups are quite similar.

## Pre-drawing noise?

To see if that makes any difference, I am also now testing a modification of `sc_scenario_perception.Perception` to get full noise vector at construction time rather than one noise value per time step (actually in this test I am only doing this for the draw from unidimensional normal distribution, for the noisy observation of other agent position, but not for the draw from the two-dimensional normal distribution, for generating a perception output from the `oPF` model - if there were any improvements I should still see them though):
1. 8 parameterisations `oVAoEAoSNvoPF`
    * Serial
        * 3 min 18 s
    * Parallel
        * 1 min 19 s
        * 1 min 29 s
        
So no speed improvement at all really over the previous implementation.

There is still the possibility though that when I run 80 workers in parallel, all asking for random draws, this becomes a bottleneck for some reason on the ARC4 node. Hence:

## Setting up a test on ARC4

I am setting up a test `test_random_draws.py`, which is just a version of `do_5_probabilistic_fitting.py`, testing `oVAoEAoSNvoPF` with and without parallelisation, and with and without zeros throughout for the noise magnitude, and also manually I have created copies of the repo on ARC4 with and without an alternative version of `sc_scenario_perception`, which pre-draws noise. In all these runs I have now also modified `sc_scenario_perception` so that the `oPF` model no longer draws from the Kalman posterior, so when it says pre-drawn noise below it is all truly pre-drawn. (Edit: This was the intention, but I am realising now that for the noisy non-pre-drawn runs on ARC4 the `oPF` model did draw from the Kalman posterior, so those two rows below are not 100% identical across my computer and ARC4.)

On my computer (testing 8 parameterisations):

| Noise         | Parallel | My computer (8 parameterisations) | ARC4 (125 parameterisations) |
|---------------|----------|-----------------------------------|------------------------------|
| N             | N        | 2 min 11 s (16.4 s / param.)      | 52 min (25.0 s / param.)     |
| N             | Y        | 1 min 15 s (9.4 s / param.)       | 2 min 11 s (1.0 s / param.)  |
| Y             | N        | 3 min 7 s (23.4 s / param.)       | 1 h 6 min (31.7 s / param.)  |
| Y             | Y        | 1 min 17 s (9.6 s / param.)       | 2 min 44 s (1.3 s / param.)  |
| Y - pre-drawn | N        | 3 min 49 s (28.6 s / param)       | 58 min (27.8 s / param.)     |
| Y - pre-drawn | Y        | 1 min 37 s (12.1 s / param)       | 2 min 26 s (1.2 s / param.)  |

(I have saved the ARC4 logs for the runs above in the normal place. I have also saved `test_random_draws.py` and the alternative `sc_scenario_perception` implementation, as well as the corresponding ARC4 logs in `SCPaper/tests/test random draws`.)

Now, weirdly, I see the expected $\approx$10x speedup over my computer on ARC4 with parallelisation. So either this has all been some kind of fluke, or somehow the temporary version of `sc_scenario_perception` here which doesn't draw from the Kalman posterior fixed the problem, even though in my understanding of my code the non-noisy model in the table above shouldn't come near that part of the code, and same for the `oAN` models which also have been slow on ARC4 in the same way. (Edit: My mistake mentioned above is actually helpful here, because the non-pre-drawn noisy runs were actually mistakenly with the original version of `sc_scenario_perception`, yet still showed the 10x speedup with parallelisation.) But if it is a fluke, it is a remarkably consistent one, because it has happened over three consecutive ARC4 runs of `do_5_probabilistic_fitting.py`. Maybe some sort of temporary problem on ARC4?

Looking closer at the ARC4 completed/aborted status emails for my past `do_5...` runs, I can see that the CPU time usage reported has been nearly identical to the wallclock time of the run (rather than the $\approx$40x factor I would expect with full parallelisation, and which I have seen for example for my latest `do_1...` ARC4 run), so this suggests that for some reason these runs have been non-parallelised. That aligns with the $\approx$2x slowdown I see for ARC4 compared to my computer in the table above, for non-parallelised runs, because that's the kind of ARC4 performance I have been seeing for these slow `do_5...` runs also. So maybe an ARC4 fluke/problem that doesn't assign CPUs as it should? 

I have now started further `do_5...` runs on ARC4, but they continue to be running at serial speed. I also attempted one run for just 125 parameterisations (like in the test tabularised above), to see if that had anything to do with it, but still it's running at serial speed. I really don't understand the difference between `do_5...` and `test_random_draws.py`. Maybe I should nuke the COMMOTIONS repo on ARC4 and start afresh? For now I will run the probabilistic fits on my computer, since it is feasible, but for the combined fits etc I would like to figure this out. 

Finally, wrt the specific question of whether to pre-draw noise or not, there are no indications above of notable speedups from pre-drawing noise. On ARC4 there is a hint of a small speedup and on my computer a somewhat stronger hint of a slowdown, but with the variability inherent in these probabilistic runs I would say that this is inconclusive, and that it makes sense to keep the code unchanged.

## Conclusion

I still don't know why `do_5...` runs at serial speed on ARC4. I will run the probabilistic fits on my computer now, and then nuke the COMMOTIONS repo on ARC4 and start afresh and see if I get the same problems for the combined fits.