Skip to content
This repository has been archived by the owner on May 11, 2024. It is now read-only.

Multiple KVC resources copying data to the same node #38

Closed
dmsuehir opened this issue May 1, 2018 · 9 comments
Closed

Multiple KVC resources copying data to the same node #38

dmsuehir opened this issue May 1, 2018 · 9 comments
Labels
urgency/high high urgency - needs to address now

Comments

@dmsuehir
Copy link
Contributor

dmsuehir commented May 1, 2018

Yesterday when using KVC to copy data from gcs (via s3) to 6 replicas in the cluster. We saw the pods start running, and then noticed that 3 of them were running on the same node. It doesn't seem like it makes sense for KVC to copy the same data 3 times to the same node, and this caused our node to be out of disk space.

@karthikvadla
Copy link

karthikvadla commented May 1, 2018

NAME                                                READY     STATUS    RESTARTS   AGE       IP             NODE
kvc-845d468d84-gd4m5                                1/1       Running   0          25m       10.72.13.156   gke-dls-us-n1-standard-4-1ba3a893-gvr1
kvc-resource-8360b237-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.0.227    gke-dls-us-n1-highmem-8-skylake-82af83b4-7gc4
kvc-resource-836203d3-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.0.228    gke-dls-us-n1-highmem-8-skylake-82af83b4-7gc4
kvc-resource-83630e86-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.21.145   gke-dls-us-n1-highmem-8-skylake-82af83b4-w7jt
kvc-resource-83642634-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.19.150   gke-dls-us-n1-highmem-8-skylake-82af83b4-8nvh
kvc-resource-8365b972-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.0.229    gke-dls-us-n1-highmem-8-skylake-82af83b4-7gc4
kvc-resource-83675c8b-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.16.149   gke-dls-us-n1-standard-4-1ba3a893-wvlw

@dmsuehir dmsuehir changed the title Multiple resources copying data to the same node Multiple KVC resources copying data to the same node May 1, 2018
@ashahba
Copy link
Member

ashahba commented May 1, 2018

We probably need a check to make sure if replicas is greater than number of hosts provided in nodeAffinity, then reduce the replicas to match the number of hosts.

@balajismaniam
Copy link
Contributor

@ashahba I don't understand your suggestion here? can you explain?

@ashahba
Copy link
Member

ashahba commented May 1, 2018

@balajismaniam I think if user asked for 6 replicas but when populating nodeAffinity during Scheduling if we only found 3 nodes that meet the criteria (this could be due a several reasons like for example: The rest of nodes don't have enough space left on the disk), then we should not deploy the pods multiple times on the same node just to meet the requested replicas.

We need to decide what action we take in that scenario before implementing a solution.
My suggestion is: Print an error message to users notifying them that they can only ask for (for example) 3 replicas at the most.

@balajismaniam
Copy link
Contributor

@ashahba That is not a good idea. There is a fix for this in #37. There was an error with how pod anti-affinity was setup.

@Ajay191191
Copy link
Contributor

Ajay191191 commented May 1, 2018

We already do what you said @ashahba in a way here but in this case it was not because of that.

@ashahba
Copy link
Member

ashahba commented May 1, 2018

Thanks @balajismaniam and @Ajay191191 .

One last question:
What if /mnt/stateful_partition is 100% full on half of the nodes and user asks for exactly same replicas as len(nodeList)? do we end up replicating some pods across the same node which in turn end up copying data to the same node twice?

@Ajay191191
Copy link
Contributor

Ajay191191 commented May 1, 2018

ATM disk pressure is not taken into consideration while scheduling the pod. But that's one of the things we'd like to include and it should be possible once we implement the reconciler. But right now, the pods are scheduled on the nodes even with high disk pressure which could result in pod and CR failures.

@dzungductran dzungductran added the urgency/high high urgency - needs to address now label May 1, 2018
@dzungductran dzungductran added this to the v0.1.0 milestone May 1, 2018
@balajismaniam
Copy link
Contributor

Fixed by #37.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
urgency/high high urgency - needs to address now
Projects
None yet
Development

No branches or pull requests

6 participants