New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce duplication of rendering effort #101

Open
zerebubuth opened this Issue Aug 13, 2016 · 14 comments

Comments

Projects
None yet
6 participants
@zerebubuth
Copy link
Collaborator

zerebubuth commented Aug 13, 2016

From openstreetmap/chef#85, with apologies to anyone following the breadcrumbs...

The rendering machines are, currently, completely independent. This is great for redundancy and fail-over, as they are effectively the same. However, it means duplication of tiles stored on disk and tiles rendered. Duplication of tiles on disk is somewhat desirable in the case of fail-over, but duplicating the renders is entirely pointless.

Adding a 3rd server, therefore, is unlikely to reduce load by 1/3rd on the existing servers from rendering. However, a lot of the load comes from serving still-fresh tiles off disk to "back-stop" the CDN, which would be split amongst the servers (sort of evenly).

What would be great, as @pnorman and I were discussing the other day, is a way to "broadcast" rendered tiles in a PUB-SUB fashion amongst the rendering servers so that they can opportunistically fill their own caches with work from other machines. At the moment, it's no more than an idea, but it seems like a feasible change to renderd.

Currently the two servers are independent, and clients go to one based on geoip. This means that the rendering workload is not fully duplicated between the two servers, as users in the US tend to view tiles in the US and users in Germany tend to view tiles in Germany. This has been tested by swapping locations and seeing an increase in load.

Unfortunately, this doesn't scale well to higher numbers of servers.

Yesterday, yevaud rendered 963,135 distinct metatiles and orm rendered 859,036 of which 303,923 were the same. If only one copy of each of those were rendered, that would be an overall saving of 17%, which is nowhere near as large as I'd have hoped.

If a single server has a capacity of 1, then two have a capacity of 1.66.

If you assume that the statistics remain the same and that when rendering a tile there is a 17% chance that a specific other server has the tile, if you go to three servers then for a request there is a 31% chance one of the two other servers has the tile. With each server spending 31% of its capacity duplicating work, the total capacity is 2.07, an increase of 25%. If everything was distributed ideally it would be an increase of 50% (2x actual)

My gut tells me that the statistics will not be the same and 3 servers will be slightly better than this model, but it gives us a place to start.

With four servers, it is a 43% chance of duplicating work and a total capacity of 2.29, an increase of 11% instead of 33% (3x actual).

So we could set up three servers and not be too badly off for duplication, but beyond that it gets worse.

@apmon

This comment has been minimized.

Copy link
Member

apmon commented Aug 13, 2016

Those are some interesting numbers for the overlap of rendering! Thanks.

Given a (perhaps) lower than expected overlap, pushing rendered tiles out to other servers might then actually have an overall negative effect. Given that the tile storage size is potentially also a significant limitation to overall rendering load, if tiles from a different server gets pushed onto the other tiles storages, the cleanup mechanism would likely need to be even more aggressive, and thus potential delete tiles needed for the region served by that server.

Do we have any statistics of how many tiles get rendered, then deleted by the cleanup process and then shortly later, re-rendered because they were needed again?

The alternative to a push model, would be a pull model. I.e. if a tile storage doesn't have a (up-to-date) tile already rendered, it asks the other tile stores first to check if they have it and pulls it from there before attempting rendering it self. That way only needed tiles get pulled across, minimizing unnecessary duplication in storage. Two potential downsides (apart for human labor efforts) would be that adding an extra query step into the process would add latency (which (at least network latency) if the tile stores are all in the same country might not be significant compared to rendering duration), and secondly, it might add tile storage lookup io load, which currently already seems to be a mayor issue with large latency spikes in tile serving.

Going to something like Ceph or other shared clustering tiles storage, would probably minimize the amount of development work needing to be done, on the other hand, I am not sure how well any of them are optimized for having different storage nodes in different data centers with "low bandwidth, high latency" connections between some of the storage nodes in the cluster.

If there is consensus on any software development work needed in the mod_tile/renderd domain, I could probably help out with the work.

@tomhughes

This comment has been minimized.

Copy link
Member

tomhughes commented Aug 13, 2016

Going to ceph will minimise the amount of work in mod_tile sure, but maximise the amount of operations work as we have no experience of it.

That said we want to get ceph (or something equivalent going) for other reasons so this might be a good starting point at least once we have the third tile server (I'm told three is the practical minimum for ceph).

What are the overheads for retrieval with something like ceph though?

@jburgess777

This comment has been minimized.

Copy link
Member

jburgess777 commented Aug 13, 2016

I was wondering whether adding another proxy cache directly in front of each of Yevaud and Orm might help. If I remember correctly, we already allow the squid proxies to fetch from each other so we might be able to get squid to do the sharing for us.

Another option might be to convince the existing proxies to select which origin server based on the requested tile. Assuming both origins are working then there should be no duplicate rendering. If squid isn't capable of picking the origin based on the path then an http load balancing layer could probably achieve the same.

Perhaps a simple algorithm like splitting the odd/even zoom levels between the two servers.

@tomhughes

This comment has been minimized.

Copy link
Member

tomhughes commented Aug 13, 2016

Err, that's what we been discussing. Obviously we could direct traffic based on the tile which would avoid duplication, but the downside is that if a server fails the other one will be guaranteed not to have half the tiles.

@apmon

This comment has been minimized.

Copy link
Member

apmon commented Aug 13, 2016

One other potentially comparatively straight forward to implement option could be as follows: The tile storage directory of each of the tile servers gets mounted read only via e.g. nfs on the other tile server. In the rendering threads, after pulling a request from the queue, it checks if a meta tile is available in any of the other tiles stores (In the rendering threads we don't have to worry as much about blocking and latency). If yes, it copies the meta tile over to its own tile storage and skips the rendering, otherwise it proceeds are normal. That could be implemented with a "few lines of code" and probably not too much operations work to configure the nfs. The question there would likely be, what will that do to IO load on the other tile stores and the band with requirement between sites. The conceptual downside to this is, that the de-duplication check is done post queuing. I.e. a request can sit in the rendering queue for a long time even though the tile already exists on a different server and thus could be dealt with quickly without need for queueing.

@pnorman

This comment has been minimized.

Copy link
Collaborator

pnorman commented Aug 13, 2016

I suppose the big question is: how much bandwidth can we send between the different rendering servers at different datacenters?

@tomhughes

This comment has been minimized.

Copy link
Member

tomhughes commented Aug 13, 2016

@apmon NFS is not appropriate outside of a local network.

@pnorman I don't think bandwidth is a major issue between the data centres were we have tile serves currently.

@jburgess777

This comment has been minimized.

Copy link
Member

jburgess777 commented Aug 13, 2016

What I'm really saying is that I think we could probably find a solution which does not involve shared filesystems or copying the tiles to both places. The servers are already short on disk space as it is.

@apmon

This comment has been minimized.

Copy link
Member

apmon commented Aug 13, 2016

@jburgess777 how would the intermediary proxy know if one of the tile servers/renderers have a tile? A normal http request to the tile server would also trigger a rendering if it doesn't have that tile alreadt. Or are you talking about basically moving the tile storage into that proxy (i.e. have a proxy cache of 1 - 2 TB upwards and then a much smaller storage on the rendering servers)?

@tomhughes Which aspect of NFS are you particularly concerned about in this respect? But it doesn't have to be NFS, any way of getting easy read-only access to the underlying meta-tile storage would work fine. E.g. one could likely equally easily program it to just use something like https or scp.

@jburgess777

This comment has been minimized.

Copy link
Member

jburgess777 commented Aug 13, 2016

@apmon A little of both. I don't think the intermediary needs to be particularly smart. If we start by assuming there are caches in front on Yevaud (SquidY) and Orm (SquidO). If a request arrives at SquidY then one of several things can happen:

  • SquidY has the tile already and so serves it itself
  • SquidY doesn't have the tile so asks SquidO if it has it
    • If yes, the tile comes from SquidO and is then cached & returned by SquidY
    • If no, SquidY requests it from an origin server. If we are trying to keep duplicate rendering to a mimimum then it could choose Orm or Yevaud based on some policy (e.g. odd/even zoom). Or if that is not a concern, it could always fallback to its local origin, Yevaud in this case.

This is all theory, I have no idea if squid work as I described. If it can't then maybe something else like varnish could do it instead. What I'm suggesting is that http caching and redirection is pretty much a solved problem and I am not convinced we need to spend development effort creating something new.

If we had fixed rules for which origin is responsible for rendering a given tile then just some redirections using something like haproxy would do. These rules would immediately tell us which origin server would likely have the tile. If the default origin is down then we would have to fallover to a machine that is unlikely to have the tile but that is no different to what we do today.

Some thing like haproxy could run on Orm+Yevaud. On the other hand, adding other servers, or enhancing the roles of the existing proxy caches would let us expand the effective caching beyond the disk space available in the current machines.

@pnorman

This comment has been minimized.

Copy link
Collaborator

pnorman commented Aug 14, 2016

  • SquidY doesn't have the tile so asks SquidO if it has it
    • If yes, the tile comes from SquidO and is then cached & returned by SquidY
    • If no, SquidY requests it from an origin server. If we are trying to keep duplicate rendering to a mimimum then it could choose Orm or Yevaud based on some policy (e.g. odd/even zoom). Or if that is not a concern, it could always fallback to its local origin, Yevaud in this case.

I believe where there are two caches at the same datacenter they fall back to each other like this, then fall back to the rendering server they are assigned to. If they checked another cache in a different datacenter it might cause problems. a) which cache would you check, b) you're adding more latency, possibly a significant amount for the more remote caches.

If the default origin is down then we would have to fallover to a machine that is unlikely to have the tile but that is no different to what we do today.

It'd be worse. Right now when the other render server goes down the hit rate drops on the traffic that is moved, but it's not 0%. With this it would be close to that.

@jburgess777

This comment has been minimized.

Copy link
Member

jburgess777 commented Aug 14, 2016

a) which cache would you check, b) you're adding more latency, possibly a significant amount for the more remote caches.

In the example given there is only a single alternate cache. The latency between the two existing UK hosts is <10ms, a lot less than the 100ms+ disk latencies we have been seeing recently.

Right now when the other render server goes down the hit rate drops on the traffic that is moved, but it's not 0%. With this it would be close to that.

I think 50% would be more realistic, we are still serving requests that would have been served locally. Based on the previous data, we are already doing a fairly good job of reducing the duplicate renderings so we are probably only talking about dropping from 60-70% hits to 50%.

@pnorman

This comment has been minimized.

Copy link
Collaborator

pnorman commented Aug 14, 2016

a) which cache would you check, b) you're adding more latency, possibly a significant amount for the more remote caches.

In the example given there is only a single alternate cache. The latency between the two existing UK hosts is <10ms, a lot less than the 100ms+ disk latencies we have been seeing recently.

Ah, you're talking about inserting another layer of caching between the CDN and the render hosts?

@jekader

This comment has been minimized.

Copy link

jekader commented Aug 18, 2016

I want to comment on optimizing storage of identical sea tiles. A deduplication script can be implemented in userspace to replace identical files with hardlinks. In this case just a single inode will be allocated and data will be stored just once (and most likely in OS cache which will drive I/O down). This can even be done right after rendering - the renderer could compare the checksum of a generated PNG with that of a known empty tile and if they match - create a hardlink instead of writing the file. Yet another approach is to use a filesystem that has block-level deduplication built in: then userspace doesn't have to be altered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment