Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoSAFE crash when dealing with very large vector layers. #538

Open
lucernae opened this issue Mar 27, 2019 · 1 comment
Open

GeoSAFE crash when dealing with very large vector layers. #538

lucernae opened this issue Mar 27, 2019 · 1 comment

Comments

@lucernae
Copy link
Collaborator

lucernae commented Mar 27, 2019

Problem

This issue might relate with other issue because of it's generic symptoms.

Saravana tried to run analysis with a modest size of building layers (60MB of shapefile).
Even if the extent is big, the analysis was complete in around 20minutes or so, after my recent patch to GeoSAFE.
However the problem happens when he is trying to run a second analysis. The server crashed. This bug is reproducible consistently.

Here's the excerpt of the conversation in the email (of my conclusion):

Hi Bala, sorry, just got a chance to work on it now.

So, if I understand correctly. The behaviour can be reproduced consistently.
Your logs and screenshot is very helpful.
I found out these several conclusions:

1. uWSGI crashed because it was unable to keep up with the request. Your droplet have 4 CPU, this should not be a problem, so the problem lies elsewhere.
2. To generate map tiles on the fly for newly created/uploaded layers, we uses QGIS Server backend. By default we uses 4 containers to help offload the job to generate these tiles. I can see that all layers have thumbnails. So it’s not a problem. QGIS is able to generate the tiles.
3. However for the impact layers, I noticed the size is around 100 MB. QGIS has failed to render all the tiles for a given extent. We actually have bigger layers in the past around gigabytes of data. Not sure why it can’t handle this. But seeing the logs (rancher logs) and the memory spike, I’m confident this is the cause. QGIS tried to render a tile > then failed > but it has memory leak.
4. Now, the CPU spike is caused by the swap process (because eventually the memory leak will be higher than available memory). But the swap also has limits. The stagnant RAM usage you saw is the effort of swap trying to move the memory away. But since it’s a leak, it will increase until the container is deleted or the server eventual crash.


Based on this information, I copied all your data (database and media files) from your droplet to replicate this in my machine.
From there, we will decide on how to handle this crash. I’m not sure at the moment because the cause is possibly come from QGIS Server which is a complete package on it’s own. So we are going to probably try to find a workaround to avoid the memory spike.

That concludes my report as of now. :)
I will post again in the slack channel after trying this out.

Regards,


-- 
Rizky Maulana Nugraha
Senior Software Engineer
Kartoza
rizky@kartoza.com





On 25 Mar 2019, at 06.41, Da CodeKid <damacusr@gmail.com> wrote:

Hi Rizky,

Unfortunately I've destroyed that server (the one spiked to 100%).  Just to confirm I created another server during this weekend and ended with same result. 

I just created a new droplet (with same configuration - 8GB / 4CPU) and running the first analysis. The spike occurs only when I run the analysis for the second time.

Below attached server info in case if you'd like to collect the log (I'll destroy the server 24 hours from now).  I've enabled Password login for the server and added your github a/c to rancher login as well (use the same password for /admin portal as well).

The eventual conclusion is QGIS-Server backend was the cause and not optimized to handle the memory leak.

Proposed solution will follow after further investigations.

@lucernae lucernae added the bug label Mar 27, 2019
@lucernae lucernae self-assigned this Mar 27, 2019
@lucernae
Copy link
Collaborator Author

Update on the investigations:

I replicated the behaviour on my own machine with the following specs:

CPU: Skylake 4 core, 4.0 GHz
RAM: 16 GB
Disk: plenty/not a problem
OS: Ubuntu, but GeoSAFE run on Rancher, the same way with prod environment.

The crash did happen after the second analysis. The second analysis itself was finished with no problem, but the whole computer crash when I tried to view the layer.
Thus, I conclude the problem lies with QGIS-Server rendering.

It turns out the reason of the memory leak is because for each container, there are several Apache worker thread on it (pretty normal actually). However for this kind of file, which is 100MB geojson layer, a thread takes time to open the file and render it. While this thread is currently working, another thread tried to access the layer (to render a different tile location), it somehow raised permission denied and died without cleanup. This process happens, again and again accumulating dead memory until the container itself is destroyed.

To check if this is coming from apache or QGIS own code. I tried to load the layer on QGIS desktop and memory it uses is a whopping 8GB for a mere 100MB geojson layer.

Wow....

It is only using around 500 MB if the geojson is converted to shapefile. So I guess the problem comes from the data format. Maybe strings type is not handled properly.

Possible Solution

Because the problem lies with the data format and how QGIS handled it, we can use the following alternatives for short term solutions:

  1. Do not upload big vector layer in geojson format. Shapefile will do to optimize the size. (I know, I also hate it. But the memory consumptions don't lie).
  2. Even if we use input layers with shapefiles, the analysis will be in GeoJSON format... Which is a problem for now. I don't have any workaround other than to limit the memory consumptions by Apache config. So the plan is to limit for 1 worker/thread per containers and disable KeepAlive (so the thread will die and created anew, clean memory). The scale settings can be configured in Rancher depending on how many containers the host can handle. This will make the rendering a little bit slow, but at least (I hope) it doesn't crash and the site will self heal (if container crash, it will create a new one).

Long term solutions:

  1. Use geopackage, of course, and move on.

Short term solution no. 2 seems practical, but we need to build/test the new QGIS-Server backend container optimized for rancher like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant