optimise small files performance: store small files in "superchunks" [feature] #370

onlyjob · 2020-05-21T06:50:56Z

onlyjob
May 21, 2020

MooseFS have a young (still immature) rival - SeaweedFS which have some good design ideas.
One particularly good idea is to store multiple small files in "volumes" and replicate per volume.

It would be great to implement such design in MooseFS by introducing a new type of chunk file and store small files within such chunks.

MooseFS performance in regards to small files is far from optimal. I've been storing 30 million files (average size ~26 kb) in my cluster and found numerous problems such as replication few orders of magnutude slower, etc. Large number of chunk/files kills performance of rotational HDDs so I've moved small files to SSDs to see how much it would help but ultimately was still disappointed with results.

I've decided to compare MooseFS performance to SeaweedFS in regards to small files.
I've copied all 30 million small files to SeaweedFS and then measured time taken by rsync to copy small files to empty 100GB SSD. It took 874 minutes (14h 34m) to fill the SSD with small files from MooseFS while it took 356 minutes (3h 56m) to fill the SSD from SeaweedFS.

SeaweedFS accomplished the task roughly 2.5 times faster than MooseFS but there is a catch: I've placed small files to SSD-backed chunkservers while SeaweedFS had the data on rotational HDDs. That's a serious evidence of efficient design when one can demonstrate 2+ times better performance on HDD versus SSD, isn't it?

bash99 · 2020-05-30T03:32:13Z

bash99
May 30, 2020

Support on this. we have a new project last year, need store about 800 M files with avg 380KB size, and increased by 250M / year, so MooseFS (and LizardFS) is not suited.

BTW, we had to resort on test some commerce supported solution. One of them is customized from Ceph FS, has a feature can be turn on to support billions small file.
But we got data mismatch when run a 600M file test on the cluster, which only occurs on 300M~600M files. After that flag turn off, we retry that test, the result is fine without errors.

0 replies

onlyjob · 2020-05-30T07:02:48Z

onlyjob
May 30, 2020
Author

Interesting experience, @bash99. Thanks. I'd say on properly specked hardware MooseFS can probably handle 800M files but you might consider using a block device to accommodate them to reduce load on metadata (master) server.

IMHO CephFS is an absolute rubbish. I regret wasting too much time on it. Ceph have (or had) no concept of data integrity when OSDs merely detects inconsistencies (eventually) but not even repair them automatically.
Bitrot reproducibly propagates from HDDs all the way through client to application without even a warning about corrupted data. Probabilistically on large data set corruption is a certainty and Ceph does nothing to protect against that. To "serve incorrect data fast" is such a profoundly wrong idea... Anyway my 11 months assessment of Ceph in 2014-2015 concluded that it is utterly unreliable and badly engineered software with problematic implementation.

4 replies

Blackpaw Mar 23, 2021

@onlyjob

IMHO CephFS is an absolute rubbish. I regret wasting too much time on it. Ceph have (or had) no concept of data integrity when OSDs merely detects inconsistencies (eventually) but not even repair them automatically.
Bitrot reproducibly propagates from HDDs all the way through client to application without even a warning about corrupted data.

A bit off-topic, but I'm not sure if that holds true anymore - ceph has made major changes since 2014. Bluestore osd's are much faster and can be transparently backed up with SSD's - no need for tiers now.

chunks are checksummed and checked for bit rot on a schedule, also compression is an option as well.

I also tried Ceph back in 2014 as well and eventually migrated to lizardfs for similar reasons. But recently migrated our inhouse production servers back to Ceph and it has been a totally different experience - much faster and easier to manage (using proxmox). Has survived several disk and server failures without a hiccup.

Having said that, totally happy with moosefs for my home media server setup. And would certainly consider the pro version for business use.

onlyjob Mar 27, 2021
Author

A bit off-topic, but I'm not sure if that holds true anymore - ceph has made major changes since 2014

No doubts there must be some positive changes since 2014. But I believe it does not matter, it hardly changed anything and Ceph is just as bad as it used to. You might think that's because I've been burned so badly by Ceph (and there is some truth in it) but my disappointment was not just bugs (they can be dealt with), but with poor design decisions and upstream attitude -- that is something very difficult to change. If you are not concerned with Ceph bloatware, size of the codebase and surface for bugs I'll mention just few things that should be a red flags to anyone who considers to use Ceph for anything of importance.
Back in 2015 one problem I've encountered was cluster-wide issue caused by clock drift on one isolated OSD and upstream response was just "meh"... Another problem was OSD poisoning when replication between OSDs sent something that caused OSD to segfault subsequently causing cascading failures of all OSDs and ultimately loss of all data in cluster. Upstream response: meh, shit happens. That incident was the last straw for me with Ceph and it was the third incident of catastrophic data loss on Ceph - a something I did not have on other distributed storage systems.

Around that time Sourceforge was down for weeks because of outage on Ceph. Within a year since the incident I was attending a RehHat conference where I've asked RedHat people for comments. They've replied something like "Sourceforge was using unsuported Open Source edition" and that tells you where their priorities are...
I asked them how many "supported" installations of Ceph do they have in Australia and they've said that there are only two...

I think Ceph is not worthy of trust. These days there are better alternatives. MooseFS is much better than Ceph, and SeaweedFS is a rapidly developing competitor with a very good dynamics.

Blackpaw Mar 28, 2021

Point taken re the complexity, a lot of people's problems seem to stem from trying to over tweak it or pushing the crush algorithms into weird contortions.

But I must admit, when it goes pear-shaped, it's a nightmare to recover. Simpler is better.

SeaweedFS - not so sure about that, no posix layer, its failover is manual? replication has to be scheduled via chron jobs?

bash99 Mar 29, 2021

Point taken re the complexity, a lot of people's problems seem to stem from trying to over tweak it or pushing the crush algorithms into weird contortions.

But I must admit, when it goes pear-shaped, it's a nightmare to recover. Simpler is better.

SeaweedFS - not so sure about that, no posix layer, its failover is manual? replication has to be scheduled via chron jobs?

SeaweedFS: master server is raft-based and automatic failover.
But to compare it with moosefs/ceph，you need a filer layer，which is rely on third-party metadata storage (or local leveldb/rocksdb）such as Cassandra.
Only replication fix need cron jobs, which is by design, as volume server reconnect or offline temporarily should not cause re-balance.

marcomilano · 2020-06-13T20:35:57Z

marcomilano
Jun 13, 2020

Dmitry, So how do you know that the slowness is coming from the replication? One way to figure out will be do a test with say with one million files, and make it single copy. After you get the total time for this test, you can repeat the same test with 2 copies, 3 copies, 4 copies, etc. If the slowness is not related that much to the replication but instead coming from the metadata server operations, I don't think this new method is going to help much. In addition, what you are proposing doesn't fit the existing moosefs architecture either.(IMHO)

…

-- Marco

On 5/21/20 2:51 AM, Dmitry Smirnov wrote: MooseFS have a young (still immature) rival - SeaweedFS <https://github.com/chrislusf/seaweedfs> which have some good design ideas. One particularly good idea is to store multiple small files in "volumes" and replicate per volume. It would be great to implement such design in MooseFS by introducing a new type of chunk file and store small files within such chunks. MooseFS performance in regards to small files is far from optimal. I've been storing 30 million files (average size ~26 kb) in my cluster and found numerous problems such as replication few orders of magnutude slower, etc. Large number of chunk/files kills performance of rotational HDDs so I've moved small files to SSDs to see how much it would help but ultimately was still disappointed with results. I've decided to compare MooseFS performance to SeaweedFS in regards to small files. I've copied all 30 million small files to SeaweedFS and then measured time taken by |rsync| to copy small files to empty 100GB SSD. It took 874 minutes (14h 34m) to fill the SSD with small files from MooseFS while it took 356 minutes (3h 56m) to fill the SSD from SeaweedFS. SeaweedFS accomplished the task roughly 2.5 times faster than MooseFS but there is a catch: I've placed small files to SSD-backed chunkservers while SeaweedFS had the data on rotational HDDs. That's a serious evidence of efficient design when one can demonstrate 2+ times better performance on HDD versus SSD, isn't it? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/moosefs/moosefs/issues/370>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALNUYCDOSKBO5S7CRJUGRZDRSTFOFANCNFSM4NGS5YOQ>.

0 replies

onlyjob · 2020-06-14T00:21:31Z

onlyjob
Jun 14, 2020
Author

So how do you know that the slowness is coming from the replication?

This is not my conclusion. I've said that small files replication is slower and that comes from observation of operating a cluster where small files are segregated to dedicated chunkservers.
Just try to replace HDD and see that replication speed is proportional to number of chunks.

Slowness is not from number of replicas but from number of chunks because HDD performance significantly degrades when file system accommodates millions of files.

what you are proposing doesn't fit the existing moosefs architecture either.

Nonsense. It can be implemented in chunkservers alone, without modifying any other components. Aggregation of smaller chunks into chunk files can be retro-fitted without changing architecture. In the essence it is merely a batch replication of several chunks together as one archive/meta-chunk.

0 replies

borkd · 2020-07-06T05:23:44Z

borkd
Jul 6, 2020
Collaborator

"Hide" gobs of small objects by storing up to X chunks in a fixed size pod, and replicate entire pods in a single stream. Such architectural decision could greatly speed up replication of data when moosefs is used for WORM / CDN / archival type of workloads, but at the expense of extra complexity. Not sure how fragmented the pods would become on really busy systems with unpredictable workload sizes.

@marcomilano - transactional overhead is a thing. Streaming replication of entire pods, vs CHUNKS_READ_REP_LIMIT, would make a cross-DC migration of 1.5B+ chunks less of a PITA.

0 replies

nickb937 · 2020-08-24T09:56:47Z

nickb937
Aug 24, 2020

You could look to storing the files in a relational database or graph database instead because DBMS tend to handle small objects much better than filesystems do. You could consider using memcached (with extstore) for storage of the files.

I don't think that the problem stated is particular a flaw with MooseFS since the problem is common to all filesystems.

0 replies

4Dolio · 2020-10-31T06:27:45Z

4Dolio
Oct 31, 2020

I work around this by storing space files in a zfs built from xTB loopback files backed by lizardfs 2.6.0 (probably moving back to moosefs soon). Such a potential feature would be great given that metadata were also encapsulated into the SuperChunk. Used this trick on and off for years, doing it live now with imap service mailbox storage as we speak as a method to move to a more agile data stack for this service.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimise small files performance: store small files in "superchunks" [feature] #370

{{title}}

Replies: 7 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

optimise small files performance: store small files in "superchunks" [feature] #370

It would be great to implement such design in MooseFS by introducing a new type of chunk file and store small files within such chunks.

Replies: 7 comments · 4 replies

onlyjob May 30, 2020 Author

onlyjob Mar 27, 2021 Author

onlyjob Jun 14, 2020 Author

borkd Jul 6, 2020 Collaborator

Replies: 7 comments 4 replies

onlyjob
May 30, 2020
Author

onlyjob Mar 27, 2021
Author

onlyjob
Jun 14, 2020
Author

borkd
Jul 6, 2020
Collaborator