Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup loop_ecology that was too slow for the evol simulation #58

Merged
merged 1 commit into from Apr 29, 2022

Conversation

ZHG2017
Copy link
Contributor

@ZHG2017 ZHG2017 commented Apr 28, 2022

Hi, I have made the exact improvement in the loop_ecology() function from the ecology.R, which will improve the performance for the evol simulations where the last for loop to construct species_list was identified as a bottleneck.

ohagen
ohagen previously approved these changes Apr 28, 2022
Copy link
Member

@ohagen ohagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seams to describe the described speed up and be compatible with reported unit test on species list and richness plotting objects.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

Thank you for looking into this and contributing to the code.

Unfortunately the code does not fully recreate the species object class (Admittedly that's not properly covered by the unit tests):

source("R/gen3sis_species.R")
source("R/divergence.R")
library(microbenchmark)

data <- list()
data$all_species <- readRDS("your_species_object_here")

test <- function(data){
  species_list <- list()
  for (species in data$all_species) {
    cells <- names(species[["abundance"]])[species[["abundance"]] != 0]
    updated_species <- limit_species_to_cells(species, cells)
    species_list <- append(species_list, list(updated_species))
  }
  return(species_list)
}

test2 <- function(data){
  species_list <- vector(mode = "list", length = length(data$all_species))
  species_list <- lapply(1:length(data$all_species), function(i) {
    append(species_list[[i]], limit_species_to_cells(data$all_species[[i]], names(data$all_species[[i]][["abundance"]])[data$all_species[[i]][["abundance"]] != 0]))
  })
  return(species_list)
}

View(test(data))
View(test2(data))
identical(test(data), test2(data))
microbenchmark(test(data), test2(data), times = 10)

While it's a bit faster than the current implementation I'm not convinced that this is an actual bottleneck? Test with a 5828 species file I had at hand said ~700ms vs 420ms (median) for the current vs new implementation. Do you have profiling data that shows this to be the bottleneck?

Nevertheless thank you again for looking into this.

@benj919 benj919 requested a review from ohagen April 28, 2022 13:42
@benj919 benj919 dismissed ohagen’s stale review April 28, 2022 13:43

see code/comment on non-compatibility

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

@benj919 Hi, Sorry to have forgotten to mention that only several evol simulations(~20) are really slow with the original implementation, the modified part becomes somehow a bottleneck only with one of those evol landscape configurations, like the l12421th andscapes config file, and it becomes really slow only after about 200 iterations, at the very beginning it is still fast enough.
The identical() function seems to fail for some special cases like the following one:

>  set.seed(5)
>  y <- rnorm(3)
> y
[1] -0.8408555  1.3843593 -1.2554919
> z <- c(-0.8408555, 1.3843593, -1.2554919)
> z
[1] -0.8408555  1.3843593 -1.2554919
> identical(y, z)
[1] FALSE
> typeof(y)
[1] "double"
> typeof(z)
[1] "double"
> 

Wheras I printed out the content of species_list for both the current implementation and the improved one into 2 different text files then compared the 2 files with linux command diff and I see no difference.

PS: The evol simulation with the 12421th landscapes configuration could not end in 48 hours. For the current implementation, after about 200 steps, the concerned code segment took about 40 seconds meanwhile the improved implementation takes about 6 seconds.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

Obviously the identical function fails for the doubles here, printing them rounds and truncates values from the true value. If you compare the View() calls in the sample code above you'll see the missing class attributes.

set.seed(5)
x <- rnorm(3)
> print(x, digits = 5)
[1] -0.84086  1.38436 -1.25549
> print(x, digits = 10)
[1] -0.8408554808  1.3843593435 -1.2554918626
> print(x, digits = 15)
[1] -0.840855480786298  1.384359343478584 -1.255491862627667
> print(x, digits = 22)
[1] -0.8408554807862984592504  1.3843593434785843587775 -1.2554918626276665438724

y <- c(-0.8408554807862984592504,  1.3843593434785843587775, -1.2554918626276665438724)
> identical(x,y)
[1] TRUE

How many species do you have in your simulations? I don't now what your enumeration of simulations stands for. Do you have a profiling trace for them?

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

The one I used for the number 12421 has a parameter as max_number_of_species = 200000, I used the files provided by ohagen, and I do not have a profiling trace for the current implementation because the concerned simulation does not seem to be able to finish in 48 hours.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

I have added the explicit return statement to test2. If there's none R will (silently) return the last object accessed in a function.

Do you have some profiling data for any of your simulations? Simulations with a lot of species routinely run for a long time. How did you identify this part of the code as the bottleneck?

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

I submitted a lot of simulations on a cluster and found a few that did not finish in 48 hours. I then picked up one simulation with all its config files and landscapes files, I locally keep running it with some timing code inserted in the package. Then I found out the ecology_loop function becomes slower and slower, I then put more timing instructions in that function to figure out which lines are the cause. If we keep running the package for the evol simulation 12420 provided by ohagen, the current implementation will become slower at each iteration, after 200 iterations, the concerned code segment will take up to 40 seconds. The total runtime becomes bigger at each step, that's how I figured out this as a bottleneck. WIth the modified implementation, the simulation will not become too slower, at step 200th, the concerned code with modification only takes about 6 seconds. By the way, I compared also the resulting species plots generated at each iteration between the current implementation and the modified one, I see no difference.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

Also you can check the objects to themself before and after applying the function:

> identical(data$all_species, test(data))
[1] TRUE
> identical(data$all_species, test2(data))
[1] FALSE

Copy link
Member

@ohagen ohagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZHG2017 , I forwarded @benj919 the last optimization report you sent me, which might contain some hints to the question he asked. I will merge this after @benj919 questions are replied.
@benj919, I will add an further issue if you think that adding the tests that @ZHG2017 developed could be handy for our checking routine.

@ohagen ohagen mentioned this pull request Apr 28, 2022
@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

For optimized code please ensure that the results are bitwise identical to the previous results. Unless there is a proven reason to have the code change the result.

For identifying bottlenecks please familiarize yourself with the profiling functions in R to get a reliable overview on what's going on.

I have forwarded @ohagen the scripts I used during the initial optimizations to check for correctness and a calling script with optparse that has a switch available to enable and run the profiling data collection.

@ohagen
Copy link
Member

ohagen commented Apr 28, 2022

thanks @benj919! I will FW this to @ZHG2017

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

I don't think I could use any script to identify the bottleneck because the concerned simulation will last for more than 48 hours that I have never waited until its end to identify this bottleneck. At the moment, I have no idea how to get the bitwise identical result as previous one. So the issue becomes either we do not have the bitwise identical result or abandon those slow simulations.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

You don't need those simulations to finish to start looking for trends. You can stop them after 48h and evaluate the profiling traces already. How many time steps did those simulations finish in 48hrs? Out of those how many hours were spent in the above optimized code fragment?

We know that simulations with many species have an escalating cost associated with them. That's why we originally implemented those abort conditions on too many species being alive.

You should get the code for the bitwise comparison results from ohagen soon.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

Let me provide a bit more details, there is a folder for the config files(more than one file inside), there is another folder for the landscape files(more than one file inside), the R script will use both folders for one specific simulation. My question is then how could we benchmark and identify the bottleneck of this simulation that will never finish ?

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

By not finishing it. The logfiles should tell you roughly how many timesteps a simulation can finish within 48hrs. By adjusting either the start or end timestep you can limit the simulation to run for a shorter amount of time. Or you just abort the simulation after 48hrs and compare the partial outputs.

Once you have identified and optimized the bottlenecks present in those partial runs you can repeat the process and progress further into the simulation in 48hrs. Depending on the input size you should be able to figure out the trends within 12hrs or even shorter runs already.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

I need to mention more details, keep running this simulation locally, we could see that at each step, the simulation runs slower than previous step, the printed out timing I ever used shows that, the only code segment that will increase its runtime at each step is the one I tried to improve. And locally, I could see obvious speedup with the modification, so I wonder what would be the necessity to run another benchmark for identification of bottleneck. By the way, I am reading your script, I see there is the Rprof() used for the benchmark purpose that I have ever considered before, but I personally don't like it since it shows too many irrelevant informations , like those intermediate functions calls, which will be troublesome.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

You need to show the overall behaviour of the code. If the simulation gets slower with every time step the base assumption is that every piece of code gets slower. The interesting part is which piece gets disproportionally slower. For that you need to either do the profiling, or if you want to go with the timing you need to fully time all the major parts of the code. One you have that information you can, for example runtime vs time step, where every bar is split into the time per component/function. Do you have graphs for this showing the behaviour of all code parts?

If this is the only code segment that gets slower the simulation would have to go through ~4000+ time steps at 40sec/iteration?

Rprof can be used to profile by code line as well, which usually promotes the high level functions to the top of the list; Rprof(..., line.profiling = TRUE) and summaryRprof(..., lines="show"). I agree that it shows a lot of things, but they are needed to decide on what is actually important and what is not.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

I first put some timing for each function in the main loop of the package and saw only the loop_ecology() had this behavior, the other functions like loop_dispersal() or loop_evolution() does not have such a behavior. Then I did some timing inside the loop_ecology() and figured out the code segment I identified as bottleneck will increase its runtime from less one second to about 40 seconds and after another 50 steps, it will take up to 60 seconds.
By the way, the current implementation first defines an empty list then at each iteration, some new data as element would be appended to it, where I think there will be extra cost like allocating new memory then copy the existing data to the new memory and add the new data to the new memory, if each iteration will have this extra cost, it is quite logical to have this code segment slower and slower for large dataframe appending. My implementation first preallocates the number of elements then append real data to the right location, where I think there will be no extra cost to allocate any new memory at each iteration.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

Hi, I have got another question, I have just tried the following code:

  species_list <- list()
  for (species in data$all_species) {
    cells <- names(species[["abundance"]])[species[["abundance"]] != 0]
    updated_species <- limit_species_to_cells(species, cells)
    species_list <- append(species_list, list(updated_species))
  }

cat("----View(species_list)---->", "\n")
View(species_list)

But got an error like the following:

Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class ‘"gen3sis_species"’ to a data.frame
Calls: run_simulation ... <Anonymous> -> as.data.frame -> as.data.frame.default

So I wonder how could I get all the attributs of species_list either using View() or anything else ? Could you please give me any suggestion ?

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

here's a one-line lapply implementation that preserves the attributes:

test3 <- function(data) {
  species_list <- lapply(data$all_species, function(i){limit_species_to_cells(i, names(i[["abundance"]])[i[["abundance"]] != 0])})
  return(species_list)
}

> microbenchmark(test(data), test2(data), test3(data), times = 10)
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
  test(data) 457.5162 575.8027 695.6613 688.5292 840.1387 997.7278    10
 test2(data) 337.8429 379.9543 466.8935 421.4749 491.9027 874.7930    10
 test3(data) 286.2038 302.7601 352.8637 365.8548 386.3862 429.8704    10

> identical(data$all_species, test3(data))
[1] TRUE

View() is an interactive function that can't usually be printed. the outer species_list is just a list, you can get the attributes of elements in it with the attributes function:

> attributes(data$all_species[[1]])
$names
[1] "id"         "abundance"  "traits"     "divergence"

$class
[1] "gen3sis_species"

that can be print(attributes(data$all_species[[1]]))'ed but not be cat()'ed.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

OK, I see, thank you ! Let me run a partial benchmark locally to see if it still works. I did not know this syntax, I need to learn more syntax about R later.

@benj919
Copy link
Collaborator

benj919 commented Apr 28, 2022

on the View error, are you using Rstudio? View() is very helpful for manually/visually inspecting elements. As is the environment browser it offers.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

on the View error, are you using Rstudio? View() is very helpful for manually/visually inspecting elements. As is the environment browser it offers.

Oh, I see. In fact, I used the command Rscript to launch the code, I do not use Rstudio or any IDE.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

The runtime looks good:

step = 1914 , species alive = 2002 , species total = 15377 
------>length(data$all_species): 15681 
------>The for loop at the end of loop_ecology() took: 4.048102 
------>The for loop2 at the end of loop_ecology() took: 0.8134422 

I use Sys.time() et the beginning and at the end of this loop to get the runtime for the current implementation as the first timing message then another timing is for the optimized one, I keep both implementations in the same script just to compare the 2 implementations. By the way, the simulation is running in the reversed order from 2000 to 0.
And it starts to slow down:

step = 1880 , species alive = 2067 , species total = 24796 
------>length(data$all_species): 25067 
------>The for loop at the end of loop_ecology() took: 10.09261 
------>The for loop2 at the end of loop_ecology() took: 1.393903 
And just after 20 more iterations:
------>The for loop at the end of loop_ecology() took: 16.61986 
------>The for loop2 at the end of loop_ecology() took: 1.422436 
step = 1862 , species alive = 2118 , species total = 29920 
------>length(data$all_species): 30217 
------>The for loop at the end of loop_ecology() took: 17.90114 
------>The for loop2 at the end of loop_ecology() took: 1.848372 
step = 1861 , species alive = 2131 , species total = 30217 
------>length(data$all_species): 30486 
------>The for loop at the end of loop_ecology() took: 14.44838 
------>The for loop2 at the end of loop_ecology() took: 1.406939 
step = 1860 , species alive = 2135 , species total = 30486 
------>length(data$all_species): 30781 
------>The for loop at the end of loop_ecology() took: 15.86668 
------>The for loop2 at the end of loop_ecology() took: 1.732291 

After about 200 iterations, the runtime shows that the current implementation becomes a bottleneck:

step = 1800 , species alive = 2164 , species total = 47550 
------>length(data$all_species): 47824 
------>The for loop at the end of loop_ecology() took: 41.90425 
------>The for loop2 at the end of loop_ecology() took: 2.345014
step = 1799 , species alive = 2114 , species total = 47824 
------>length(data$all_species): 48091 
------>The for loop at the end of loop_ecology() took: 51.74762 
------>The for loop2 at the end of loop_ecology() took: 2.577074 
step = 1798 , species alive = 2105 , species total = 48091 
------>length(data$all_species): 48369 
------>The for loop at the end of loop_ecology() took: 53.185 
------>The for loop2 at the end of loop_ecology() took: 2.387722 

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 28, 2022

Hi benj919, I did a few plots checking as well as bitwise species_list checking between the current implementation and the improved one using identical(), so I have adopted your suggested modification and pushed again for merging. Thank you for the suggested modification to get bitwise correctness !

@ohagen
Copy link
Member

ohagen commented Apr 29, 2022

Thanks @benj919 and @ZHG2017 ! I guess we are now bitwise \o/ I am going ahead.

@ohagen ohagen merged commit c4aad85 into project-gen3sis:master Apr 29, 2022
@benj919
Copy link
Collaborator

benj919 commented Apr 29, 2022

Allocating the species_list is superfluous but otherwise this should work. I'd recommend merging the future changes into the development branch though for keeping track on what's in the released package and what's still under development.

@ZHG2017
Copy link
Contributor Author

ZHG2017 commented Apr 29, 2022

Allocating the species_list is superfluous but otherwise this should work. I'd recommend merging the future changes into the development branch though for keeping track on what's in the released package and what's still under development.

OK, I thiought it was just a small modification so I asked to merge directly to the master branch, the next time I will ask for merging into the development branch first.

@ohagen
Copy link
Member

ohagen commented Apr 29, 2022

Agree with Benji!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants