New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mapserver clustering performance #5503

Open
szekerest opened this Issue Oct 27, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@szekerest
Member

szekerest commented Oct 27, 2017

Though the current clustering algorithm implemented in mapserver is accurate, but it doesn't perform very well if large number of points should be clustered. Clustering 100k points may take several minutes to be completed and this may be inconvenient for an online application. Because of the stateless nature of the rendering we cannot preserve pre-processed information between the sessions (like the points in the quadtree) to increase performance.

It was a requirement to provide a more basic variation of the algorithm which provides a faster (but less accurate) result, where we don't expect to get the same result as soon as the order of the points from the source is changing. After some investigations the following approach would be sufficient for the purpose:

  1. We read the source points sequentially in the same way as the current method and store only the cluster points in the quadtree (not all the points).
  2. For each point read from the database, we check if that can be assigned to any of the existing clusters. If yes we store that point in a linked list for later use.
  3. If we cannot find a cluster for the incoming point, we form a new cluster from this point and store that in the quadtree.
  4. When we complete reading points from the source, we assign the points from the linked list to the nearest cluster.

To switch on this appoach, the PROCESSING "CLUSTER_ALGORITHM=SIMPLE" should be used.

With this approach the real positions of the first clustered point is retrieved by default, average position is not calculated for the cluster. The average positions may have caused that some of the clusters fall within the specified distance eventually.

NOTE: The simplified algorithm will igore the FILTER parameter because it would have the same effect as setting the FILTER at layer level. This algorithm will support the GROUP parameter in the same way as with the FULL algorithm

szekerest added a commit to szekerest/mapserver that referenced this issue Oct 27, 2017

@jmckenna

This comment has been minimized.

Member

jmckenna commented Oct 27, 2017

I'm curious of your 100k points test, and how this new approach does. What speeds are you seeing with this new approach with that 100k point test?

@szekerest

This comment has been minimized.

Member

szekerest commented Oct 29, 2017

The change has been uploaded to https://github.com/szekerest/mapserver.git (branch simplecluster) so it can be tested. Clustering my test data of 70k points was 3 seconds (MAXDISTANCE=100, REGION= 'ellipse')

@pelord

This comment has been minimized.

pelord commented Oct 30, 2017

Would it be appropriate, for the sake of performance, to include a minimum threshold of display scale? For example, under the 1: 2000 scale, clustering stops. This could be an optional parameter and this would help to present correctly for superposed entities. Currently only the first clustered point gives its attributes to the symbolized point, the other does not have attributes.

@szekerest

This comment has been minimized.

Member

szekerest commented Oct 30, 2017

@pelord Fixed an issue related to GetFeatureInfo for this new algorithm, probably this is what you mention at the second part. Regarding to the first part if this solution doesn't work, I think we should somehow trigger clustering only for specific classes, where the scale dependency can be set as usual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment