Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another scalability question #7986

Closed
CyberCyclone opened this issue Jul 29, 2019 · 8 comments

Comments

@CyberCyclone
Copy link

@CyberCyclone CyberCyclone commented Jul 29, 2019

I've read this in the docs and highlighted the main points:

As with MinIO in stand-alone mode, distributed MinIO has a per tenant limit of minimum of 2 and maximum of 32 servers. There are no limits on number of disks across these servers. If you need a multiple tenant setup, you can easily spin up multiple MinIO instances managed by orchestration tools like Kubernetes, Docker Swarm etc.

Note that with distributed MinIO you can play around with the number of nodes and drives as long as the limits are adhered to. For example, you can have 2 nodes with 4 drives each, 4 nodes with 4 drives each, 8 nodes with 2 drives each, 32 servers with 64 drives each and so on.

Initially I read this to mean you can adjust the nodes and disk at any point, but after trying and failing, I've read multiple people having the same issue and finding out it's not possible. After reading the issues posted here, I'm still not understanding the logic behind them.

According to this issue: #4712 someone pointed out that it's a bad design choice for AWS S3 to stick with 1 bucket. According to the S3 docs to me it seems like they're encouraging us to limit the number of buckets that can be created. Especially since there's no upper limit in object creation or performance loss, but there is a limit in bucket creation (1000 hard limit).

We know AWS doesn't enforce a restriction of "once a bucket has been created, you're stuck with whatever storage space". So why does Minio stick with this logic? It seems like a step backwards to me, or a discouragement from using Minio as a true production alternative to S3.

I can understand from a technical point of view it's easier just to lock it and pass it to the developer to handle, but it's now taken the concept of SIMPLE Storage Service and added an extra layer of vulnerability (e.g, having ANOTHER service that has to be managed and maintained on top of Minio just to behave how AWS S3 does). ElasticSearch and RethinkDB can effortlessly scale / shard their data and RethinkDB has a proxy built in to manage large scale.

So building the logic to "know" where an object lives and in what bucket now has to live with the application and be stored within the application. While this isn't to much "trouble", it's once again taken the idea of "Simple" Storage Service and forcing the users to add logic to their application that they didn't use to need with S3 in general. I.e, this isn't a simple replacement to AWS S3.

I'm also dreading what's going to happen when I want to decommission a server / cluster as it's going to require updating the application database to point to the new domain / bucket. Plus writing logic into the app to stop it trying to use the old cluster.

While this is all doable, it takes the fun out of making applications and having to maintain the infrastructure, which is the whole reason I'm using RethinkDB instead of MongoDB (a real pain to manage).

@harshavardhana harshavardhana self-assigned this Jul 29, 2019
@CyberCyclone

This comment has been minimized.

Copy link
Author

@CyberCyclone CyberCyclone commented Jul 29, 2019

I've had a bit more time to think about it and I'm still puzzled as to why it's designed not to be scalable. I don't see how it makes a difference starting off with 4 nodes and increasing to 32 nodes over time compared to starting out directly with 32 nodes.

From a technical point of view I know it's not as simple as that, but I still don't understand why it's not a core feature.

@teury

This comment has been minimized.

Copy link

@teury teury commented Aug 1, 2019

Someone has used

OpenStack Storage (Swift)

This would be a real alternative to s3 with infinite scalability

@harshavardhana

This comment has been minimized.

Copy link
Member

@harshavardhana harshavardhana commented Aug 1, 2019

MinIO considers its fundamentals from our early days at GlusterFS (most of us are from that early team) where we had to build a scalable NAS, we successfully built that but through our arduous experience realized that at scale such as around 1PiB or more - safely running clusters as soon as added disks with rebalancing led to instabilities in the system. This can be carefully observed that a direct function of the complexity, through this experience we realized that complexity breeds bugs.

Almost all the systems after GlusterFS have also been prone to similar complexities either in deployment and management at scale. This led to the conclusion that the primary problem of all storage systems is complexity. So we choose to avoid rebalancing problem instead convert this as a deployment problem.

With the scale MinIO is talking about for today's needs, rebalancing would become a disproportionately complex problem and wouldn't scale for the amount of data we are talking about. By assuming practical deployment scenarios MinIO came up with the following model.

What you essentially do in production is deploy in such scalable units of 16, 32 servers that define your deployment unit. Once you have this setup all future setups/clusters can be federated together at the bucket level ad infinitum. With the high-density hardware today we can deploy like servers which have 106 drives each 8TiB which comes to be a usable capacity by default of up to 13PiB in the single cluster.

MinIO doesn't restrict itself with the number of drives so you can in-fact provide MinIO as many drives as you want.

MinIO believes in the simplicity of deployment and long term management of storage clusters - focus on building an ecosystem of libraries, tools - all of this learned albeit painfully managing distributed filesystems in the past.

I hope this explains properly on why we took this approach @CyberCyclone - thanks for opening the issue - feel free to join our https://slack.min.io - we will be more than happy to explain any other questions that you may have.

@CyberCyclone

This comment has been minimized.

Copy link
Author

@CyberCyclone CyberCyclone commented Aug 4, 2019

Thanks for the explanation @harshavardhana. I'd like to keep the conversation in here so that if someone else has the same situation as me hopefully they can understand it from here.

From what I'm understanding from the explanation, if we're talking about large data sets (close to the PiB), then adding nodes to the cluster is simply not practical? E.g, it has to reconfigure the existing data? I guess at that size, days / weeks maybe.

I guess that's where my misunderstanding may have come from as the docs explain that data is recovered at the object level, not the drive level. So my guess was that when adding nodes, it simply used those new nodes for new objects. I'm pretty sure that's wrong as it's based on N/2, so the N would be re-configuring the data to accommodate the new nodes?

My use case is a lot smaller. I.e, starting off with 4 Odroid HC1 with 500GB drives each and when that's full, I was wanting to add another 4 to the cluster. So I'm never going to hit the Petabytes.

So based on that, creating new clusters with their own URL (cluster1.domain.com, cluster2.domain.com) is the only practical way to scale?

@harshavardhana

This comment has been minimized.

Copy link
Member

@harshavardhana harshavardhana commented Aug 4, 2019

I guess that's where my misunderstanding may have come from as the docs explain that data is recovered at the object level, not the drive level. So my guess was that when adding nodes, it simply used those new nodes for new objects. I'm pretty sure that's wrong as it's based on N/2, so the N would be re-configuring the data to accommodate the new nodes?

It isn't that simple, what you are suggesting is a hashing less model which has lookup(N) issue when you need to figure out where really the file exists, being software we have no way of knowing where the object exists in a namespace, so it becomes lookup(N). In GlusterFS we used elastic hashing which is where we had to rebalance the hashes when disks get added but this had other complications where we had to rebalance the objects to get the hashing right, without that it incurred performance due to double lookup for each object there is a new hash location to an existing older hash location.

We rely on consistent hashing to keep the namespace lookup requirements simple, with the fundamental principle that we need to keep the complexity out of our system, instead move this into a deployment problem and solve it differently such as using federation.

My use case is a lot smaller. I.e, starting off with 4 Odroid HC1 with 500GB drives each and when that's full, I was wanting to add another 4 to the cluster. So I'm never going to hit the Petabytes.

There is no problem in using in this situation @CyberCyclone

So based on that, creating new clusters with their own URL (cluster1.domain.com, cluster2.domain.com) is the only practical way to scale?

Yes, that is the practical way and less error-prone in practice. Your data is never moved from one cluster to another. Now to help facilitate let's say you want to shuffle data between two different buckets on two different clusters - we are planning to provide mc admin command which helps you shuffle buckets, so purely as an opt-in admin driven behavior.

@harshavardhana

This comment has been minimized.

Copy link
Member

@harshavardhana harshavardhana commented Aug 9, 2019

Closing this issue as answered. Let us know if you have further questions @CyberCyclone

@abessifi

This comment has been minimized.

Copy link

@abessifi abessifi commented Aug 12, 2019

Thanks @harshavardhana for these insights !

MinIO considers its fundamentals from our early days at GlusterFS (most of us are from that early team) where we had to build a scalable NAS, we successfully built that but through our arduous experience realized that at scale such as around 1PiB or more - safely running clusters as soon as added disks with rebalancing led to instabilities in the system.

What about comparing to other S3 implementations like Swift and Ceph Object GW where horizontal scalability still a core feature from what I understood ?

What you essentially do in production is deploy in such scalable units of 16, 32 servers that define
your deployment unit. Once you have this setup all future setups/clusters can be federated together at the bucket level ad infinitum.

Do bucket lookup using federation (etcd +coredns) is production ready ?

@harshavardhana

This comment has been minimized.

Copy link
Member

@harshavardhana harshavardhana commented Aug 12, 2019

What about comparing to other S3 implementations like Swift and Ceph Object GW where horizontal scalability still a core feature from what I understood ?

They are really no comparison, both are built for a different time and different needs. MinIO is horizontally scalable too but not in a traditional sense as in through federation. Our idea to break away from the traditional model was to avoid the complexity and keep MinIO deployments predictable.

Do bucket lookup using federation (etcd +coredns) is production ready ?

Of course, has been for a couple of years already with multiple deployments in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.