Multiple KVC resources copying data to the same node #38

dmsuehir · 2018-05-01T16:29:47Z

Yesterday when using KVC to copy data from gcs (via s3) to 6 replicas in the cluster. We saw the pods start running, and then noticed that 3 of them were running on the same node. It doesn't seem like it makes sense for KVC to copy the same data 3 times to the same node, and this caused our node to be out of disk space.

karthikvadla · 2018-05-01T16:35:41Z

NAME                                                READY     STATUS    RESTARTS   AGE       IP             NODE
kvc-845d468d84-gd4m5                                1/1       Running   0          25m       10.72.13.156   gke-dls-us-n1-standard-4-1ba3a893-gvr1
kvc-resource-8360b237-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.0.227    gke-dls-us-n1-highmem-8-skylake-82af83b4-7gc4
kvc-resource-836203d3-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.0.228    gke-dls-us-n1-highmem-8-skylake-82af83b4-7gc4
kvc-resource-83630e86-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.21.145   gke-dls-us-n1-highmem-8-skylake-82af83b4-w7jt
kvc-resource-83642634-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.19.150   gke-dls-us-n1-highmem-8-skylake-82af83b4-8nvh
kvc-resource-8365b972-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.0.229    gke-dls-us-n1-highmem-8-skylake-82af83b4-7gc4
kvc-resource-83675c8b-4cce-11e8-9e1e-0a580a480d9c   1/1       Running   0          1m        10.72.16.149   gke-dls-us-n1-standard-4-1ba3a893-wvlw

ashahba · 2018-05-01T17:09:11Z

We probably need a check to make sure if replicas is greater than number of hosts provided in nodeAffinity, then reduce the replicas to match the number of hosts.

balajismaniam · 2018-05-01T17:35:13Z

@ashahba I don't understand your suggestion here? can you explain?

ashahba · 2018-05-01T17:48:17Z

@balajismaniam I think if user asked for 6 replicas but when populating nodeAffinity during Scheduling if we only found 3 nodes that meet the criteria (this could be due a several reasons like for example: The rest of nodes don't have enough space left on the disk), then we should not deploy the pods multiple times on the same node just to meet the requested replicas.

We need to decide what action we take in that scenario before implementing a solution.
My suggestion is: Print an error message to users notifying them that they can only ask for (for example) 3 replicas at the most.

balajismaniam · 2018-05-01T18:06:20Z

@ashahba That is not a good idea. There is a fix for this in #37. There was an error with how pod anti-affinity was setup.

Ajay191191 · 2018-05-01T18:11:38Z

We already do what you said @ashahba in a way here but in this case it was not because of that.

ashahba · 2018-05-01T18:45:22Z

Thanks @balajismaniam and @Ajay191191 .

One last question:
What if /mnt/stateful_partition is 100% full on half of the nodes and user asks for exactly same replicas as len(nodeList)? do we end up replicating some pods across the same node which in turn end up copying data to the same node twice?

Ajay191191 · 2018-05-01T18:52:38Z

ATM disk pressure is not taken into consideration while scheduling the pod. But that's one of the things we'd like to include and it should be possible once we implement the reconciler. But right now, the pods are scheduled on the nodes even with high disk pressure which could result in pod and CR failures.

balajismaniam · 2018-05-10T18:50:01Z

Fixed by #37.

dmsuehir changed the title ~~Multiple resources copying data to the same node~~ Multiple KVC resources copying data to the same node May 1, 2018

dzungductran added the urgency/high high urgency - needs to address now label May 1, 2018

dzungductran assigned balajismaniam May 1, 2018

dzungductran added this to the v0.1.0 milestone May 1, 2018

balajismaniam closed this as completed May 10, 2018

jlewi unassigned balajismaniam May 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple KVC resources copying data to the same node #38

Multiple KVC resources copying data to the same node #38

dmsuehir commented May 1, 2018

karthikvadla commented May 1, 2018 •

edited

Loading

ashahba commented May 1, 2018

balajismaniam commented May 1, 2018

ashahba commented May 1, 2018 •

edited

Loading

balajismaniam commented May 1, 2018

Ajay191191 commented May 1, 2018 •

edited

Loading

ashahba commented May 1, 2018 •

edited

Loading

Ajay191191 commented May 1, 2018 •

edited

Loading

balajismaniam commented May 10, 2018

Multiple KVC resources copying data to the same node #38

Multiple KVC resources copying data to the same node #38

Comments

dmsuehir commented May 1, 2018

karthikvadla commented May 1, 2018 • edited Loading

ashahba commented May 1, 2018

balajismaniam commented May 1, 2018

ashahba commented May 1, 2018 • edited Loading

balajismaniam commented May 1, 2018

Ajay191191 commented May 1, 2018 • edited Loading

ashahba commented May 1, 2018 • edited Loading

Ajay191191 commented May 1, 2018 • edited Loading

balajismaniam commented May 10, 2018

karthikvadla commented May 1, 2018 •

edited

Loading

ashahba commented May 1, 2018 •

edited

Loading

Ajay191191 commented May 1, 2018 •

edited

Loading

ashahba commented May 1, 2018 •

edited

Loading

Ajay191191 commented May 1, 2018 •

edited

Loading