Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish an object store #169

Closed
zerebubuth opened this issue Aug 2, 2017 · 31 comments

Comments

Projects
None yet
6 participants
@zerebubuth
Copy link
Collaborator

commented Aug 2, 2017

From the planning document:

Currently, there are two “services” / planet servers (ironbelly and grisu), each of which stores data for the website (user profile images and GPX traces) as well as planet and replication information. When switching between sites, synchronising this data is onerous, and can cause unexpected behaviour (e.g: skipped or duplicate replication files).

In order to make this more reliable, we intend to implement a storage cluster, probably based on Ceph and CephFS.

The next step is to put hardware into “core” sites to support this. We have some hardware “rescued” from Imperial College, which may be suitable. One open question is about the reliability of such hardware, and possibly we will want to have a plan B in case it proves unreliable. The costs of both plan A and plan B should be set out in the overall plan.

The "rescued" hardware is the tiamat-NN machines hosted by UCL which are up at the time of writing, and are in Chef, but don't have any special roles.

I'm not sure what step we're at, but we need to do some or all of:

  • Find a way to incorporate the public Ceph cookbook into our Chef system, preferably including it from the public system, although we might have to vendor it in?
  • Build a Ceph cluster across the tiamat-NN nodes.
  • Run some benchmarks to verify that we can write to it at the rates we want to.
  • Either use CephFS or work out software modifications to Rails and GPX daemon to support S3-like access through rados-gw.
  • Write the next step of the plan.

@Firefishy, I remember we discussed getting together to do some Chef work to make it easier to use public cookbooks, but I've forgotten all the details. Please could you refresh my memory?

@grinapo

This comment has been minimized.

Copy link

commented Aug 2, 2017

We are heavily using ceph clusters (mainly through rdb, no cephfs), so if you need real-life experiences feel free to ask. If you're already using it elsewhere then ignore my interjection.

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented Aug 30, 2017

@grinapo Thanks! I'm using it elsewhere, but only for radosgw. If you have any experience with using ceph across multiple sites that would be particularly interesting for us.

FWIW, I think it's best for us to use ceph via radosgw rather than cephfs, but that discussion can come later on since we don't have a cluster available at all yet!

@grinapo

This comment has been minimized.

Copy link

commented Aug 31, 2017

Well, what do you want to know? :-) We're running some using multiple datacenters, with various settings. Nearly all of the use is rbd based, either mapped and mounted into a linux device (for containers) using the kernel device or used by qemu/kvm with librbd.

There are multiple potential issues which could be easily avoided, like you ought to have more nodes than the planned replica number, or that similarly sizes OSDs are much easier to handle since large differences in size is really not well handled by CRUSH.

The new release(-to-be) uses bluestore backend, and tests showed significantly better delay times than the XFS backend, and it seems to be stable even under stress.

There seem to be massive difference between SSD and HDD IOPS preformance, even with journaling (both filestore and bluestore): my current hdd (sata, not sas, using ssd journaling) based cluster performs ~3000 iops with rw mixed random use (it can be pushed over 6000 for artificial loads), while people are talking about ten times of those for ssd.

IO delays on the MAN do not cause problems but definitely limit the maximal IOPS (and much less the maximal throughput), on the lab cluster 15ms was about the limit which separated "useful" and "useless". I haven't tested packet loss: if you have a lossy connection nothing really works well. (I can test it, though, I have a lab cluster where I can play devil with the network settings, by the means of 'netem' [network emulation] qdisc.)

We had various (machine) crashes, power outages, admin errors, and ceph simply survived them all with 3 replicas. When the availability went below the quorum everything stick into iowait but when the cluster is together again things move on. You would plan this iowait and ensure that machines are accessible despite these.

Replication is not backup, some people tend to mix them. I guess you see what I mean. ;-)

Never tried to fill any of the clusters to the brim, but sometimes too small OSDs got full disks, and ceph handled it well (but complained a lot). It's really easy to expand it, though.

You should have separate OSDs, and possibly MONs, mixing them with any other load isn't playing out well.

If you use multiple locations (whether it's racks our countries) possible quorum state should be investigated: if you have two locations and the network breaks, if you have two rooms and one loses its power, etc. Generally odd numbers are good, whether hosts, rooms or countries. ;-) 3 is bare minimum, 5 is usually better (since there ought to be at least 3 replicas around).

Generally I am very happy with ceph, we had lots of problems with various components but ceph was the least visible there. It usually just works.

@kosfar

This comment has been minimized.

Copy link

commented Sep 20, 2017

@Firefishy if you want to use Ceph across different sites, you should better avoid the parts of Ceph that implement strong consistency (RBD, CephFS) and stick to RGW that operates under eventual consistency thus asynchronous ACKs across writes. Otherwise, IO latency may climb to numbers that are not acceptable for you plus you put between your cluster nodes a bunch of points of failures. It is not impossible but it depends on your use case and how much complexity you want to put in your configuration.

I mostly agree with what @grinapo has described above. I would strongly suggest going with the latest LTS release that uses Bluestore and get all the performance but mostly the integrity benefits (embedded checksumming) of it. The other important thing to take into account is your failure domains and your gear quality (both at server and network level). If you plan correctly this part, then you can build a CRUSHMAP that will offer you the desired availability. For performance, just multiply your drives throughput/IOPS, divide by replica size and calculate a 20-30% overhead for Ceph. There is a lot of configuration settings both in Ceph and OS that will lead you to this "maximum" point, but in general, Ceph defaults are acceptable. Of course your network should work as expected and have the capacity to sustain the bandwidth you can squeeze out of your servers.

One side note. Ceph is like a swiss knife in the ecosystem, it provides block storage, filesystems, object storage. If you plan to use a software defined storage technology for more than one of these use cases, then Ceph is probably what you want to invest on. But if you plan for only one use case from the above, GlusterFS for filesystems or Swift for object storage may prove themselves more simple to learn and implement, at least for the beginning.

@grinapo

This comment has been minimized.

Copy link

commented Sep 20, 2017

We had terrible IOPS experiences with GlusterFS, to the point of unusable.

Also, ceph defaults are pretty much usable without tweaking (while there are enormous amounts of configuration possible by any means) and the installation is usually smooth, given the harware thown at it is well thought out. Also, since I have upgraded plenty of ceph clusters at various versions recently I can comment that upgrades also went rather smooth, and the release and upgrade notes are well written.

(However I have no experience with object stores, Swift or otherwise, and it may fit some workloads better.)

@kosfar

This comment has been minimized.

Copy link

commented Sep 20, 2017

GlusteFS is known for its poor defaults and needs tuning. However it is more mature than CephFS, which was announced stable one year ago. For most of the Ceph clusters I administrate, Ceph defaults are unacceptable too. It only takes 4 tunables to mutliply your IOPS and throughput by 5X-10X and that's only simple plain rados benchmarks, no clients, no caches, no filesystems. And even then, you simple cannot avoid the overhead of Ceph code doing all its magic (if you want to avoid the overhead, you just need another SDS technology). What I am saying is that we cannot talk about performance in the right way if we do not define the baseline performance of the hardware we have in hands and our use case's requirements.

IMHO, Ceph has a steeper learning curve than Swift and GlusterFS, but it is a very good investment in time and knowledge, if you plan to provide more than one type of service over your commodity storage nodes. Plus you get very good docs as @grinapo said, a bunch of official automation tools to help you with the installation, troubleshooting and management and a vibrant community to talk about your concerns.

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Sep 20, 2017

I think one particular advantage with Ceph over other options is that more map-related software has been designed to interact with it than other object stores.

@grinapo

This comment has been minimized.

Copy link

commented May 3, 2018

Just noting back that we have been running multiple bluestore based Cephs, both mixed (spinning disks with nvme cache) and ssd ones, without any problem throughout some machine crashes, electrical problems and such. Haven't really had any problems, upgrades went all flawlessly, which is a huge plus.
(We're using "strongly consistent" rbd so it's not as fast as it could possibly be with object storage, still I can't complain.)
(Oh, and there are tons of operations staistics. And I mean it. :-))

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented May 3, 2018

I can't speak for ops to be absolutely certain, but I believe the current issue is a lack of OWG/sysadmin time to set up ceph, and other more urgent matters

@grinapo

This comment has been minimized.

Copy link

commented May 9, 2018

As a sidenote if anyone is interested I can configure a ceph system to test (installation time of ceph is really negligible; my time may be limited, but possibly can do it faster than the months we wait for ops people to be free :-)) if you have the required hardware online and working, but I will not touch Chef. :-) It's okay if you test it then scratch the whole system. We usually did the same: pull up the labs cluster, make it available for the devs' clients and then the devs can play with it as time permits. Contact me privately if you're interested, or if I shall contact the ops group. (I cannot offer my permanent time though, I'm pretty busy from time to time, but since I play with ceph anyway, it may be okay for now.)
Some design questions shall be answered before doing anything, though.

@pnorman pnorman added the service:new label Feb 27, 2019

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Mar 7, 2019

I've been thinking about ceph since object store came up in discussions about vector tiles. All the vector tile implementations we might run are designed to have an object store for storing vector tiles.

With our architecture of rendering servers at sites with nothing else, I think we want two ceph clusters if we end up hosting a vector tile based layer - one general purpose, and one specific for tiles.

(see also #214)

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented May 30, 2019

Right now we're facing a chicken and egg problem, and this issue has been stalled for some time. We could consider buying object store space from one of the commercial providers to allow us to start refactoring systems (e.g. openstreetmap-website, aerial imagery, etc) to make use of it.

With the open source policy, the obvious choice is something supporting the openstack API (swift), which would include something powered by ceph. The openstack list of providers may be useful here.

Note: I've disclosed this internally, but my employer offers an object store, but I am not involved in that part of the company.

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented May 30, 2019

the openstack API (swift)

What advantages would using the swift API provide? In my experience, there's a lot more support for S3-compatible API client libraries, for example Rails' ActiveStorage or the various tilelive libraries. I haven't come across many things that only support Swift APIs without also supporting S3 APIs.

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented May 31, 2019

What advantages would using the swift API provide? In my experience, there's a lot more support for S3-compatible API client libraries, for example Rails' ActiveStorage or the various tilelive libraries. I haven't come across many things that only support Swift APIs without also supporting S3 APIs.

I was basing it on our open source policy. If we want to go with a S3 API I'm okay with that, but the call really needs to be made by someone other than me.

@tomhughes

This comment has been minimized.

Copy link
Member

commented May 31, 2019

Does the open source policy really talk about the APIs we use, as against the software that implements a particular instance of an API?

What does it even mean for an API to be "open source" exactly?

That said the whole point of this conversation is about using a commercial managed solution and it's unlikely that any such solution will be completely open source although some may be built primarily on open source components.

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented May 31, 2019

If we're fine with S3, should I draw up a request for submissions? We can also evaluate the providers identified internally, but I think we should also make it open. Hopefully there are some providers interested in supporting OSM who would respond.

@tomhughes

This comment has been minimized.

Copy link
Member

commented May 31, 2019

Sounds good to me.

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Jun 1, 2019

So we have some concrete numbers for evaluating costs, could you check the number of files and total apparent size (du --apparent-size) for user profile images, GPX files, and tiles?

@pnorman pnorman changed the title Experimental Ceph cluster on tiamat nodes Establish an object store Jun 1, 2019

@tomhughes

This comment has been minimized.

Copy link
Member

commented Jun 2, 2019

These numbers were gathered with an rsync trick (to get count and size at the same time) but I believe the size is the actual file size as requested..

First up the user images:

Number of files: 874,334
Total file size: 40,946,136,146 bytes

...then the GPX traces:

Number of files: 2,452,596
Total file size: 547,901,732,659 bytes

...and the images generated for the GPX traces:

Number of files: 4,902,960
Total file size: 13,425,544,997 bytes

I'm not sure which tiles exactly you're interested in? While we would probably want something this for vector tiles I'm not aware of any plan to use it for the current bitmap tiles?

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Jun 3, 2019

Thanks.

I'm not sure which tiles exactly you're interested in? While we would probably want something this for vector tiles I'm not aware of any plan to use it for the current bitmap tiles?

I've got the data I need for tiles. Although not perfect, it gets me an estimate for GET and PUT requests per second, and something for size and bandwidth.

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

Based on the above numbers and munin, I have three scenarios for evaluating pricing. These scenarios are not precise - the point is to have numbers to compare providers on. By the time we get to some of these projects, the numbers will have changed.

Short-term (nfs)

10M objects (reported usage)
500GB storage (reported usage)
40M GET/month (estimate from suspect NFS graphs)
1M PUT/month (guess)
2TB/month outbound (calculated from above)
200GB/month inbound (calculated from above)

Medium term (planet.osm.org)

5M objects (Calculated from # diffs + planet dumps + state files for 3 years)
12TB storage (below)
150M GET/month (Munin)
100k PUT/month (Calculated from # diffs + planet dumps + state files)
100TB/month outbound (Munin)
200GB/month inbound (Estimate)

Long-term (some form of tile serving)

360M objects (z0-z14 tiles)
1TB storage (render.osm.org numbers + experience)
7B GET/month (render.osm.org traffic)
2B PUT/month (render.osm.org throughput converted to tiles, scaled down somewhat based on experience)
50TB/month outbound (render.osm.org traffic)
10TB/month inbound (render.osm.org traffic)

@tomhughes

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

Do you really think every user image and trace will be PUT every month? That seems... unlikely...

What are you counting in the 32Tb for planet?

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

Do you really think every user image and trace will be PUT every month? That seems... unlikely...

I got the numbers from ironbelly's NFS munin graphs. NFS isn't my area, so it's possible I'm mis-reading it.

32TB is based on the current /store volume on ironbelly being 32TB, >90% full, and rounding up.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

I'm not sure which NFS numbers you were looking at but it makes no sense - the GPX related files are basically only written once, when they are created; and the user images only as and when somebody changes there image.

I'm guessing that whatever "writes" you're seeing are not real and are just some NFS artefact that wouldn't be present with a proper object store.

I think the 32Tb is rather misleading because /store is a mish mash of stuff, only some of which is likely to be a candidate for an object store:

Directory Size Notes
/store/backup 3.3Tb Backups of various services
/store/elasticsearch 63Gb Storage for logstash elasticsearch instance
/store/logs 14Tb Archived logs from planet, tile and www [should be massively trimmed]
/store/planet 12Tb This is the real candidate for object storage
/store/planetdump 384Gb Temporary storage used during planet dump generation
/store/rails 487Gb Already accounted for (user images and GPX traces )
@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Jun 13, 2019

I've updated the numbers above and tweaked some of my other estimates. For some of the tile server numbers I'm relying on experience to scale them up or down from what munin alone would suggest - there are so many unknowns with that that I'm just trying to pick something within the range of possibilities.

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented Jun 13, 2019

What does it even mean for an API to be "open source" exactly?

Yeah, that's something that would be up for debate. I think for OSMF the question could be "is there an open-source self-hosted option compatible with this commercially provided API". Which there is for S3-compatible APIs, but not for certain other object stores. So with S3 we're not building a proprietary lock-in by using a third party service.

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented Jun 13, 2019

should I draw up a request for submissions?

FWIW, I never envisioned a full RFP process as the first step. These object stores are routinely commercially available, so I figured the first step would be to just pick one, sign up with an OSMF account, and try it out. For small expenditures (e.g. < £100 per month) then I think it's fine to try it and see later. And for the first project, it doesn't need to be seen as a long-term commitment to that provider, since we can easily move stuff between providers - the harder work is probably reconfiguring our services to make use of the object store, regardless of provider.

So I'm saying don't worry about the long term too much, better to sign up for any one and start reconfiguring!

@tomhughes

This comment has been minimized.

Copy link
Member

commented Jun 15, 2019

I'm broadly with @gravitystorm here in that I think we're overthinking/engineering this - in particular I think trying at this point to specify something that can cope with "tiles" when we absolutely no plans at the moment to use it for tiles, or any idea what that might actually mean is not very sensible.

Given that the cost for the initial use case of the user images and gpx tiles seems very small (bear in my we're already paying just over £100 a month for our existing use storing database logs) there seems no obvious reason not to go ahead.

@pnorman

This comment has been minimized.

Copy link
Collaborator

commented Jun 28, 2019

OWG has decided to go ahead with S3 as an object store on a trial basis.

It's now up to the developers of the website to decide how to use it.

cc @gravitystorm

@pnorman pnorman closed this Jun 28, 2019

@pnorman pnorman unpinned this issue Jun 28, 2019

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented Jul 3, 2019

Thanks, @tomhughes has already started work on this.

Remember that there's plenty of other non-website-related things in the OWG remit that could also benefit from being moved to the object store, and these can be worked on in parallel!

@tomhughes

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

The initial go ahead is for the user images and GPX data - once we have some experience with that, and with how the costs pan out, we can think about expanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.