Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
I've read this in the docs and highlighted the main points:
Initially I read this to mean you can adjust the nodes and disk at any point, but after trying and failing, I've read multiple people having the same issue and finding out it's not possible. After reading the issues posted here, I'm still not understanding the logic behind them.
According to this issue: #4712 someone pointed out that it's a bad design choice for AWS S3 to stick with 1 bucket. According to the S3 docs to me it seems like they're encouraging us to limit the number of buckets that can be created. Especially since there's no upper limit in object creation or performance loss, but there is a limit in bucket creation (1000 hard limit).
We know AWS doesn't enforce a restriction of "once a bucket has been created, you're stuck with whatever storage space". So why does Minio stick with this logic? It seems like a step backwards to me, or a discouragement from using Minio as a true production alternative to S3.
I can understand from a technical point of view it's easier just to lock it and pass it to the developer to handle, but it's now taken the concept of SIMPLE Storage Service and added an extra layer of vulnerability (e.g, having ANOTHER service that has to be managed and maintained on top of Minio just to behave how AWS S3 does). ElasticSearch and RethinkDB can effortlessly scale / shard their data and RethinkDB has a proxy built in to manage large scale.
So building the logic to "know" where an object lives and in what bucket now has to live with the application and be stored within the application. While this isn't to much "trouble", it's once again taken the idea of "Simple" Storage Service and forcing the users to add logic to their application that they didn't use to need with S3 in general. I.e, this isn't a simple replacement to AWS S3.
I'm also dreading what's going to happen when I want to decommission a server / cluster as it's going to require updating the application database to point to the new domain / bucket. Plus writing logic into the app to stop it trying to use the old cluster.
While this is all doable, it takes the fun out of making applications and having to maintain the infrastructure, which is the whole reason I'm using RethinkDB instead of MongoDB (a real pain to manage).
I've had a bit more time to think about it and I'm still puzzled as to why it's designed not to be scalable. I don't see how it makes a difference starting off with 4 nodes and increasing to 32 nodes over time compared to starting out directly with 32 nodes.
From a technical point of view I know it's not as simple as that, but I still don't understand why it's not a core feature.
MinIO considers its fundamentals from our early days at GlusterFS (most of us are from that early team) where we had to build a scalable NAS, we successfully built that but through our arduous experience realized that at scale such as around 1PiB or more - safely running clusters as soon as added disks with rebalancing led to instabilities in the system. This can be carefully observed that a direct function of the complexity, through this experience we realized that complexity breeds bugs.
Almost all the systems after GlusterFS have also been prone to similar complexities either in deployment and management at scale. This led to the conclusion that the primary problem of all storage systems is complexity. So we choose to avoid rebalancing problem instead convert this as a deployment problem.
With the scale MinIO is talking about for today's needs, rebalancing would become a disproportionately complex problem and wouldn't scale for the amount of data we are talking about. By assuming practical deployment scenarios MinIO came up with the following model.
What you essentially do in production is deploy in such scalable units of 16, 32 servers that define your deployment unit. Once you have this setup all future setups/clusters can be federated together at the bucket level ad infinitum. With the high-density hardware today we can deploy like servers which have 106 drives each 8TiB which comes to be a usable capacity by default of up to 13PiB in the single cluster.
MinIO doesn't restrict itself with the number of drives so you can in-fact provide MinIO as many drives as you want.
MinIO believes in the simplicity of deployment and long term management of storage clusters - focus on building an ecosystem of libraries, tools - all of this learned albeit painfully managing distributed filesystems in the past.
I hope this explains properly on why we took this approach @CyberCyclone - thanks for opening the issue - feel free to join our https://slack.min.io - we will be more than happy to explain any other questions that you may have.
Thanks for the explanation @harshavardhana. I'd like to keep the conversation in here so that if someone else has the same situation as me hopefully they can understand it from here.
From what I'm understanding from the explanation, if we're talking about large data sets (close to the PiB), then adding nodes to the cluster is simply not practical? E.g, it has to reconfigure the existing data? I guess at that size, days / weeks maybe.
I guess that's where my misunderstanding may have come from as the docs explain that data is recovered at the object level, not the drive level. So my guess was that when adding nodes, it simply used those new nodes for new objects. I'm pretty sure that's wrong as it's based on N/2, so the N would be re-configuring the data to accommodate the new nodes?
My use case is a lot smaller. I.e, starting off with 4 Odroid HC1 with 500GB drives each and when that's full, I was wanting to add another 4 to the cluster. So I'm never going to hit the Petabytes.
So based on that, creating new clusters with their own URL (cluster1.domain.com, cluster2.domain.com) is the only practical way to scale?
It isn't that simple, what you are suggesting is a hashing less model which has lookup(N) issue when you need to figure out where really the file exists, being software we have no way of knowing where the object exists in a namespace, so it becomes lookup(N). In GlusterFS we used elastic hashing which is where we had to rebalance the hashes when disks get added but this had other complications where we had to rebalance the objects to get the hashing right, without that it incurred performance due to double lookup for each object there is a new hash location to an existing older hash location.
We rely on consistent hashing to keep the namespace lookup requirements simple, with the fundamental principle that we need to keep the complexity out of our system, instead move this into a deployment problem and solve it differently such as using federation.
There is no problem in using in this situation @CyberCyclone
Yes, that is the practical way and less error-prone in practice. Your data is never moved from one cluster to another. Now to help facilitate let's say you want to shuffle data between two different buckets on two different clusters - we are planning to provide
Thanks @harshavardhana for these insights !
What about comparing to other S3 implementations like
Do bucket lookup using federation (etcd +coredns) is production ready ?
They are really no comparison, both are built for a different time and different needs. MinIO is horizontally scalable too but not in a traditional sense as in through federation. Our idea to break away from the traditional model was to avoid the complexity and keep MinIO deployments predictable.
Of course, has been for a couple of years already with multiple deployments in production.