-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker node memory errors #861
Comments
It would be good to have an open issue for investigating approaches to return to the smaller nodes. I imagine there is a significant compute cost of the upgrade that we might not want to bear forever. |
@MattHJensen Yes, I agree. Once this version is deployed, we can do some profiling and figure out what the issue is. My hunch is that it has to do with building the tables. For now, we could keep the number of nodes down to 5 and see how much that would cost. |
@MattHJensen said in PolicyBrain issue #861:
Once the age tables are implemented, and we go from 10 dist/diff tables to 30 dist/diff tables, TaxBrain will need the |
This could be the case, but I pulled a result I ran locally to get an idea of how big the final 10 year file is. Below I show the process of pulling the results (for clarity, I removed a couple commands that threw errors):
and size of the JSON file containing the results:
The final size of the results is pretty small at 390 K. I think there is another issue with how the CPS file is treated when the tables are created. There are a lot of ways to do different things in pandas. This benchmarking repo shows that some ways can be faster or use less memory than others. I think that before we resign to using a more powerful computer we should take a look at whether we are using the pandas functions in an optimal way and whether there is a less memory intensive approach to building these tables. |
@hdoupe said:
Of course, the JSON objects that hold the table results are tiny. It is all the work involved in constructing the JSON object that takes memory. For example, sorting a dataframe containing over 400,000 rows takes a lot of memory. Before you start changing the Tax-Calculator |
@hdoupe, In your discussion of the AWS instances, you don't mention how many Tax-Calculator processes are running on each AWS instance. Is it just one process per instance? Or are you running two or more to use the multiple CPUs in each instance? |
@martinholmer said:
Ah, right. Do you think it's possible that these dataframes are being duplicated somehow? To me, it just seems excessive that a data set that I would think is just a few hundred megabytes even after all of the new tax variables are added is blowing up to 15 GB. However, there could be no way around that.
Thanks for the advice. I don't have very much knowledge about spot instances, but my impression of them is that they are useful for non-urgent computing. Thus, they are great for scientific computing because you can run these models overnight and during non-peak hours. On the otherhand, PolicyBrain wants to kick off the simulation immediately and return the results relatively quickly. That's why I wasn't sure if they would be a good fit. I could be totally wrong though. Please correct me if I am. I'll do some further research into this, and we can talk about whether they are a good fit for our needs.
We are running one process per machine. Granted, this was all set up before I started, and I was not given any information as to why we chose these types of machines. I just sort of deduced that we chose instances with a high memory to CPU ratio because we were running independent processes with relatively large memory requirements. I'm completely open to re-thinking what type of machine we want to run these processes on. It could be that a setup with less instances but with more memory and CPU's is more cost-effective then more instances with less CPU's and memory. @martinholmer Thank you for the thoughtful questions and advice. This is an area where I have very little (none prior to starting at OSPC) experience, but I'm trying to pick things up as quickly as I can. |
Resolved via PSLmodels/Tax-Calculator#1942. The test app is currently running on r4-large instances with no problems. We are still getting a memory usage spike around 12.5 GB. That gives us about a 2 GB margin of error. This is something we should keep in mind. One way to test this in the future is running Tax-Calculator in Docker containers. You can toggle the available memory for a container up to your machine's memory limits. For example, you could simulate only having 5 GB of RAM, and your process will be killed if it goes over just like if you only had 5 GB of RAM on your machine. However, my knowledge of this area is pretty limited. So, if anyone else has ideas on how to do this, then I am interested in discussing them. Thanks for quickly fixing this @martinholmer. |
Tax-Calculator 0.17.0 is using too much memory on our AWS machines for reforms using the CPS file. Our current servers are memory optimized Ubuntu 16.04 r3-large AWS machines which means that they have 2 CPU's and 16 GB of RAM. The stack-trace from the memory error looks like:
I wanted to make sure that there wasn't an issue with a particular server. So, I detached one of the production nodes, attached it to the test server and updated it to use Tax-Calculator 0.17.0 and PB v1.5.0_rc6. However, the same issue occurred. Next, a new memory optimized version, r4, has replaced the current version, r3. Instead of going through the trouble of re-building the current worker-node environment, I did some further testing and minor tweaks of the containerized worker node environment from PR #832 and deployed that on the r4-large server. Again, I ran into memory constraints. Next, I spun up a the next largest machine, r4-xlarge, which has 4 CPU and 30 GB RAM. This did the trick. The test-server is currently running on this server.
Note that we do not need 30 GB of RAM. However, this is the next size up. The highest observed memory usage on this server is about 15 GB. Thus, we were just over the limit for the r4-large machines. The memory usage sits at around 4-8 GB RAM until the tables are built. At that point, the memory sits at around 10-15+ GB.
I plan to spin up 7-10 of the r4-xlarge machines for the production app. I will then manually install docker on each machine and spin up the containers as built from the Docker files in PR #832. The computation time for each year has also increased to about 90 seconds. I will update the job processing time variable in a follow up PR. This is similar to issue to #750.
The text was updated successfully, but these errors were encountered: