New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical scaling of TiKV and PD #191
Comments
Creating a new stateful seems a promising approach if users don't care about data migration during vertical scaling and they only want to reduce the cost. I think we should keep this low priority until users really want this feature. |
We are putting new users in a very difficult situation because there is no way they can know what size instance they want. Can we document the recommended approach right now? Would it be to go offline and do a backup + restore? |
Our official documentation already has the recommendations https://github.com/pingcap/docs/blob/master/op-guide/recommendation.md I'll add this link to the user guide documentation. Using a new stateful set, the old pods have to go offline and the data on it has to be migrated to new stateful set pods because the PVC and PV are fresh new ones. This approach will keep the TiDB service online. While the backup + restore approach will create a new cluster and the TiDB service has to be switched manually which involves out of service for a little time. |
That states the minimum recommendations for large data sets. Users may discover that they need more resources and then would like to scale up vertically. Additionally, many users can reduce cost by using fewer resources when they are starting off. |
In theory if we use cloud disk (PD on GCP) we should be able to vertically scale with relative ease. |
closed via pingcap/docs#1468 vertically scaling of TiKV pods which exceeds the resource capacity of current nodes is considered as an migration: https://pingcap.com/docs/stable/tidb-in-kubernetes/maintain/kubernetes-node/ |
Scaling horizontally is not always a substitute for scaling vertically.
For example, local SSD storage on GKE is limited to 1.5 TB. If you are near that limit with TiKV, you can scale out to get more CPU/memory (even if the cost structure is not optimal). However, if you want to reduce CPU/memory usage, the only way is to scale down vertically.
A more frequent workflow would be when just starting: it is ideal to keep your instances as small as possible if your workload is still quite small. However, you may eventually run into performance issues if you don't have enough RAM on one machine.
Another related workflow is just to change your instance type because the current one has less network capacity, etc.
In general, there is an optimal cost structure for a particular workload that is satisfied by a particular instance size.
I think of this problem in terms of TiKV, but it is very applicable to PD. With PD I assume one never actually wants to scale horizontally as data increases.
I think the ideal scaling workflow would be to add a new node pool with the new instances of the desired size, deploy new TIKV processes to them (perhaps a new stateful set), wait for the new set to catch up, evict leaders from the old, and then remove the old. As I understand it there is a big problem with avoiding over-loading the cluster during these operations.
The text was updated successfully, but these errors were encountered: