Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise tagLocations #9946

Open
daemon1024 opened this issue Jul 25, 2021 · 16 comments · Fixed by #10028
Open

Optimise tagLocations #9946

daemon1024 opened this issue Jul 25, 2021 · 16 comments · Fixed by #10028
Assignees
Labels
enhancement explains that the issue is to improve upon one of our existing features optimization reducing load times and increasing code efficiency through refactoring

Comments

@daemon1024
Copy link
Member

daemon1024 commented Jul 25, 2021

When we first open up the map endpoint, first and default markers that load on maps are tag locations.

But when we start interacting with the map, it starts sending loads of requests to the API.

pl-tagloc.mp4

Which causes throttles eventually

We could debounce queries on these events. ( cc @jywarren )

Potential source for these event triggers

map.on('load viewreset resize zoomend moveend', contentMap);

function contentLayerParser(map, markers_hash, map_tagname) {

Ref #9698

@daemon1024 daemon1024 added enhancement explains that the issue is to improve upon one of our existing features optimization reducing load times and increasing code efficiency through refactoring labels Jul 25, 2021
@daemon1024 daemon1024 self-assigned this Jul 25, 2021
daemon1024 added a commit that referenced this issue Jul 25, 2021
@daemon1024
Copy link
Member Author

Well, it may help somewhat but I don't think debounce is helping as much as I thought it would.

On a wider map, the endpoint itself is slow.

image

But unlike our findings in #7556 ,
image
I don't see any particular SQL bottlenecks. More like I don't see anything useful in Skylight.

On switching to allocation insights,

image

I see a lot of new events which are hogging up a lot of time, will check it out.

Link for skylight info of the screenshots - https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1627225620/5m/endpoints/Srch::Search%20%5BGET%5D%20srch%2Ftaglocations?mode=memory

@jywarren
Copy link
Member

One clue to figure out is, when you change the limit does it vastly affect the speed, as @17sushmita noted in the call today? That might make sense, but if it doesn't much help, a question is, does limit actually reduce query time or are we just adding a limit near the end of a chain of queries and we aren't getting much benefit from it?

If a low limit does help a lot, we could think about hitting the endpoint with a low limit, but either randomizing it or maybe better yet paginating it by recency or some other sorting function? Then as time goes on we hit the same endpoint but with page=1, page=2 that kind of thing, so we slowly build up with shorter queries. That could look a bit weird but would be like debouncing - spacing out queries giving the DB time to recover memory? Testing would tell if this actually helps overall.

I would like to see @17sushmita and you both try out this idea of truncating longitude/latitude queries to see if this improves our load time. The window for data we fetch would be a bit different then the viewed region. You can learn a little about this kind of truncation for a different purpose (privacy) here: https://github.com/publiclab/leaflet-blurred-location but it also results in better database optimization with some caveats. We could discuss this approach in a new issue if you both prefer?

@daemon1024
Copy link
Member Author

daemon1024 commented Aug 2, 2021

One clue to figure out is, when you change the limit does it vastly affect the speed, That might make sense, but if it doesn't much help, a question is, does limit actually reduce query time or are we just adding a limit near the end of a chain of queries and we aren't getting much benefit from it?

Limits are helping for larger boundary boxes, It's somewhere between 2x-4x slower when I removed the limit.

One thing that intrigues me is
https://unstable.publiclab.org/api/srch/taglocations?nwlat=52&selat=-52&nwlng=-150&selng=150

I tried this query, which took around 47s to load but since there's no limit this should be somewhere near the max amount of taglocations we can get?

image

and we have a fully loaded map now which is more loaded than I have ever seen :P

I am not sure how to put it, but there's a big time tradeoff initially but this is the most content we have ever seen. What if we can cache this specific content and send it out initially and later on update the cache in the background, that should eliminate the time tradeoff

@jywarren
Copy link
Member

jywarren commented Aug 3, 2021

Comparison: https://publiclab.org/unearthing-pvd

@daemon1024
Copy link
Member Author

daemon1024 commented Aug 3, 2021

Provided we cache things, this much of data won't be easy to navigate and might as well cause some client side slowness ( like you noted about for the link you referred which needed webgl? I am not sure if I understood that correctly), to mitigate that we can try map clustering. An Example for that.

@jywarren
Copy link
Member

jywarren commented Aug 3, 2021

Handling 15k responses wasn't too bad with the gl based layer. I don't honestly think we'll get that high given the optimizations possible, but you can see how it's implemented in the code here - not too bad!

https://github.com/publiclab/leaflet-environmental-layers/blob/7782b57a8ae0a08134d32b7c29a09f0e5923f518/src/unearthing.js#L35-L61

@daemon1024 daemon1024 mentioned this issue Aug 5, 2021
5 tasks
@daemon1024
Copy link
Member Author

Here's what the current unstable looks like which loads the markers under 20ms with 0b5080e

simplescreenrecorder-2021-08-06_02.20.57.mp4

What I did is, I am storing the json data based on response from https://publiclab.org/api/srch/taglocations?nwlat=52&selat=-52&nwlng=-150&selng=150&limit=1000 under static/tagLocStore.json and referencing the raw github content in our map. ( We can update the box boundary to cover max of the map and maybe increase our limit if needed )

But we can't really keep the data once and forget it, so I made a github scheduled action in #9997 which will get the data from our endpoint every hour and commit the changes if any directly onto our main branch. We can choose to keep it on a seperate branch if it's not suitable to commit onto main branch, since we can fetch the raw json content anyway.

What do you think about this?

@jywarren
Copy link
Member

jywarren commented Aug 6, 2021

This looks pretty cool! Could we achieve something similar with action caching? https://guides.rubyonrails.org/caching_with_rails.html

I know it's precaching though so I guess we could have a script that prefetches the cache daily...

Also, would there be scaling issues as the number of markers increases?

@jywarren
Copy link
Member

jywarren commented Aug 6, 2021

And if we should fetch for different regions as it becomes too many markers, couldn't we precache by lat/lon boxes?

@daemon1024
Copy link
Member Author

daemon1024 commented Aug 6, 2021

And if we should fetch for different regions as it becomes too many markers, couldn't we precache by lat/lon boxes?

That's where the microservice fits in I think 👀👀
at a more granular level, storing individual elements as objects making up clusters as required. Our current ruby endpoint is needing to deal with a lot of things, but since we will be precaching the processed part these elements won't take up much time to load up.
As for the scraping/precaching part, the github actions should provide the proof of concept for that! Instead of storing it in the repository, we can send a post request to our microservice to store it.

@daemon1024
Copy link
Member Author

Could we achieve something similar with action caching? https://guides.rubyonrails.org/caching_with_rails.html

I tried to do things on the ruby side but didn't proceed well.

I know it's precaching though so I guess we could have a script that prefetches the cache daily...

Like updating the file every day instead of hour that I did right now?

Also, would there be scaling issues as the number of markers increases?

I don't think so 🤔 We would need to update the limit number though maybe. Currently we have 1152 records I believe. I am not sure if there's a provision to remove limit on our endpoint right now.

@daemon1024
Copy link
Member Author

daemon1024 commented Aug 6, 2021

@jywarren I am kind of unclear, what my next steps should be here.

@daemon1024
Copy link
Member Author

Map Cluster Demo based on the above precached json data

simplescreenrecorder-2021-08-10_21.31.37.mp4

@daemon1024
Copy link
Member Author

daemon1024 commented Aug 12, 2021

I believe you raised a query about we limiting the points to 1000, I realised we have around 1152 point right now in total, so we can maybe set the limit to 1500? for the api call we make in #9997. Else we can make another endpoint which just returns things without any limit.

@jywarren
Copy link
Member

Hi @daemon1024 - we have a few different issues here. In the near term, displaying the most recent 1500 and caching them seems reasonable.

But over the long term, we might expect this to grow to 10k, 20k, right? And we assume we wouldn't want to display just the most recent or most popular. So we need a way to a) prioritize which to show in the overall view, which for now is complete but as the dataset grows will become a partial view (so, by recency? popularity? a switch between, with some minimal UI to toggle, like on a Tag page -- https://publiclab.org/tag/water-quality) and b) a way to make an efficient sub-query of a zoomed in view. That is, as you zoom in, we are not fetching data for the whole globe, so we can show more of a "complete" view as you zoom in. That'll mean a query that is geographically confined, but also one which is cache-able. We can cache by geographic region but you want the cache key to be very standard, even if people don't zoom in exactly the same way. So we can cache by latitude/longitude range, for example -- cache by integer grid squares, say. Then we could limit to some number (500? 1000?) for each lat/lon grid square, and cache those. But we'd have to ensure Leaflet knows it can fetch by those - like:

  • /api/nearby______/?bbox=71,-144,72,-143

for example. It would have to fetch all of the grid squares that intersect with the current view, and each one can be individually cached. There may be a standard way to do this. Or we may need to write code to find all grid squares that intersect the current view box and manually fetch them all. This code could be modeled on the Leaflet Blurred Location grid square code, because actually it already does this dividing in order to draw grid squares (although they're not always integer lat/lon - they vary based on zoom).

Finally, we'd have to think about the schedule for caching these, as we wouldn't want to send requests for 100 grid squares and expire their caches all at once, as that would hit the database hard as well, just as we wanted to avoid. So what are some strategies for a rolling expiration of these tiles?

Anyhow, i know this is a lot. You may want to read up a bit on how geographic systems manage this. There might be some in Leaflet plugins: https://leafletjs.com/plugins.html or there may be an assumption that people will use pre-made geo servers like, i dunno, ESRI, or GeoServer or MapServer. There may be a lighter weight one that we could consider as a microservice. Or we could just come up with a good rolling Rails cache expiration strategy. It may be that Leaflet has some plugins that help it fetch marker data by grid square. Or we may need to write that part ourselves.

Finally, do we "trash" loaded data as you continue to browse around? What if you build up like 50,000 points by moving the map and continuing to load more data as you zoom and pan? Do we need to "expire" the markers you loaded before, so we don't overload the interface? This could be thought about later, as we don't yet have that many, but it might eventually become a problem.

This is a lot! But I hope it helps you structure the research and planning around this. I'm happy to answer some questions!

@jywarren jywarren reopened this Aug 17, 2021
daemon1024 added a commit to daemon1024/plots2 that referenced this issue Aug 18, 2021
jywarren pushed a commit that referenced this issue Aug 21, 2021
* debounce changes in map content

Ref #9946

* increase limit of taglocations

Since we have an optimised api now, we can increase the  limit to increase content

* add query limit as a search parameter

* minor refactor
reginaalyssa pushed a commit to reginaalyssa/plots2 that referenced this issue Oct 16, 2021
* debounce changes in map content

Ref publiclab#9946

* increase limit of taglocations

Since we have an optimised api now, we can increase the  limit to increase content

* add query limit as a search parameter

* minor refactor
billymoroney1 pushed a commit to billymoroney1/plots2 that referenced this issue Dec 28, 2021
* debounce changes in map content

Ref publiclab#9946

* increase limit of taglocations

Since we have an optimised api now, we can increase the  limit to increase content

* add query limit as a search parameter

* minor refactor
@jywarren
Copy link
Member

jywarren commented Jan 4, 2022

I just wanted to say that in #10123 we are doing quite well now. I think it's possible we can call this "solved" but if we want to really push performance, we could keep it open and/or try other strategies. But maps are super useful and plenty dense now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement explains that the issue is to improve upon one of our existing features optimization reducing load times and increasing code efficiency through refactoring
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants