New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balanced resource allocation priority to include volume count on nodes. #60525

Merged
merged 2 commits into from Mar 31, 2018

Conversation

@ravisantoshgudimetla
Contributor

ravisantoshgudimetla commented Feb 27, 2018

Scheduler balanced resource allocation priority to include volume count on nodes.

/cc @aveshagarwal @abhgupta

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #58232

Release note:

Balanced resource allocation priority in scheduler to include volume count on node 
@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 1, 2018

/sig scheduling

variance := float64(((cpuFraction - mean) * (cpuFraction - mean)) + ((memoryFraction - mean) * (memoryFraction - mean)) + ((volumeFraction - mean) * (volumeFraction - mean)))
// If variance is higher, inverse would be smaller. So inverting it.
// TODO: There is a chance of overflow. Ensure that it is less than MaxPriority.
return int64((float64(1) / variance))

This comment has been minimized.

@wgliang

wgliang Mar 2, 2018

Member

Variance's value may be zero, there may crash, you'd better verify it.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

The only way for it to happen is requested memory, cpu etc are 0's or < 0, in either case we won't be reaching any of the priority functions but I understand your point. I will add a check.

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 2, 2018

@k8s-ci-robot k8s-ci-robot requested review from bsalamat and k82cn Mar 2, 2018

@ravisantoshgudimetla ravisantoshgudimetla changed the title from [WIP] Balanced resource allocation priority to include volume count on nodes. to Balanced resource allocation priority to include volume count on nodes. Mar 3, 2018

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 4, 2018

/retest

// alpha: v1.11
//
// Include volumes on node to be considered for balanced resource allocation.
VolumesOnNodeForBalancing utilfeature.Feature = "VolumesOnNodeForBalancing"

This comment has been minimized.

@bsalamat

bsalamat Mar 5, 2018

Contributor

nit: I would rename to BalanceAttachedNodeVolumes and would update the comment to say that this impacts scheduling decisions and scheduler tries to place a Pod that requires an attached volume on a Node with fewer volumes.

This comment has been minimized.

@ravisantoshgudimetla
// 0-10 with 0 representing well balanced allocation and 10 poorly balanced. Subtracting it from
// 10 leads to the score which also scales from 0 to 10 while 10 representing well balanced.
diff := math.Abs(cpuFraction - memoryFraction)
return int64((1 - diff) * float64(schedulerapi.MaxPriority))

This comment has been minimized.

@bsalamat

bsalamat Mar 5, 2018

Contributor

Please remove the blank line.

func getExistingVolumeCountForNode(pods []*v1.Pod, maxVolumes int) int {
volumeCount := 0
for _, pod := range pods {
volumeCount += len(pod.Spec.Volumes)

This comment has been minimized.

@bsalamat

bsalamat Mar 5, 2018

Contributor

Can't we have shared volumes?

cc/ @msau42

This comment has been minimized.

@msau42

msau42 Mar 5, 2018

Member

Do you care about counting bind mounts? There is always a bind-mount (done by the container runtime) for all the volumes referenced in the container spec.

At the pod level, you can share between pods, but the number of mounts depends on the volume type. Sharing between pods on the same node is not as common. Typically, scheduler tries to spread pods across nodes.

  • In the case of NFS, each pod has its own NFS mount.
  • For something like GCE PD (sharing should be a rare use case), you will have only one global node mount, but a bind mount per pod.
  • Volume types like EmptyDir, ConfigMap, Secrets, downward api are not mounted
  • Hostpath volumes can be mounted or not, we can't tell from the spec.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 5, 2018

Contributor

This is a function used in testing, thats why I did not bother checking that, if you think its important, I can add a check.

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

Thanks Ravi for pointing out. I missed the fact that it is a test only function. We don't need to change it.

@@ -451,7 +451,10 @@ func (c *MaxPDVolumeCountChecker) predicate(pod *v1.Pod, meta algorithm.Predicat
// violates MaxEBSVolumeCount or MaxGCEPDVolumeCount
return false, []algorithm.PredicateFailureReason{ErrMaxVolumeCountExceeded}, nil
}
// Computing them here for lack of better way to share things between predicates and priorities.

This comment has been minimized.

@bsalamat

bsalamat Mar 5, 2018

Contributor

NodeInfo is a part of scheduler cache, but we update the information in this predicate. This is not a good idea. If the prediate is not executed, the information in the cache gets stale. This could lead to subtle bugs. Why don't we update the values in NodeInfo.AddPod() and .RemovePod()?

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 5, 2018

Contributor

Actually, we are not caching anything now. For example, say we chose node n1 during pod p1's scheduling. When we are scheduling next pod p2, we are going to recompute the volumes attached, max number of volumes again. As I mentioned in the comment, for lack of proper data structure through we can share information across predicates and priorities, I am using it. I don't say that this clear or better. I am open to suggestions.

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

Actually, we are not caching anything now.

I understand you don't need to cache anything and this is just a way of passing information around, but the nodeInfo pointer is a part of scheduler's cache. When you update the fields of the struct, you are effectively keeping the information in the scheduler's cache. The existance of such information in the scheduler cache would be confusing for someone who reads the code later.
Ideally, we should keep this in some temporary state that is valid in a single schedule cycle and is destructed at the end of the cycle, but we don't have such a structure and we need to add it if needed.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

Ok let me think of some alternate ways to do this and get back to you. Thanks again for the review.

This comment has been minimized.

@k82cn

k82cn Mar 6, 2018

Member

Agree with Bobby, prefer not to update NodeInfo outside of cache.

}
// Compute variance for all the three fractions.
mean := (cpuFraction + memoryFraction + volumeFraction) / float64(3)
variance := float64(((cpuFraction - mean) * (cpuFraction - mean)) + ((memoryFraction - mean) * (memoryFraction - mean)) + ((volumeFraction - mean) * (volumeFraction - mean)))

This comment has been minimized.

@bsalamat

bsalamat Mar 5, 2018

Contributor

variance could be larger than 1, causing 1 - variance (below) to be negative. I guess you have forgotten to divide by 3.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 5, 2018

Contributor

I guess you have forgotten to divide by 3.

🤣 Lol, thanks for catching it :)

variance could be larger than 1, causing 1 - variance (below) to be negative.

I thought about that, but it seems our sample set always has fractions and

  • fraction-mean(which is a fraction) would be fraction again.
  • square-root of (fraction-mean) will be further lower than original fraction

Not sure if I missed any edge case.

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

You are right, but given that you have not divided by 3, it could be larger than 1. For example, if each component is 0.9, then variance will be 0.81 + 0.81 + 0.81. If you divide by 3, variance will be less than 1.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

So, you were commenting on my version of variance(without division by 3) I thought you were commenting on variance in general that it could be greater than 1(which is true as well). Sure, I will make that change.

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

Yes, I was referring to this implementation.

}
// Compute variance for all the three fractions.
mean := (cpuFraction + memoryFraction + volumeFraction) / float64(3)
variance := float64(((cpuFraction - mean) * (cpuFraction - mean)) + ((memoryFraction - mean) * (memoryFraction - mean)) + ((volumeFraction - mean) * (volumeFraction - mean)))

This comment has been minimized.

@k82cn

k82cn Mar 6, 2018

Member

Here's the explanation of variance, not sure why we use it?

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

So, we wanted to know how close CPU, memory utilization, no of volumes on nodes are to each other. The closer means the node is well balanced. I tried explaining the same here-#58232 (comment)

I could have used standard deviation as well but I thought of stopping at variance.

This comment has been minimized.

@k82cn

k82cn Mar 6, 2018

Member

Great, got that.

This comment has been minimized.

@k82cn

k82cn Mar 6, 2018

Member

Maybe standard deviation is better; it'll also address Bobby's comments above :)

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

Just deviding by 3 addresses my comment and since we want to compare one node against others, standard deviation and variance both work equally well here. Given that taking square root is computationally expensive, I would just use variance.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

Yes, computational cost was the reason that I stopped going towards standard deviation.

This comment has been minimized.

@k82cn

k82cn Mar 7, 2018

Member

ok to me :)

// Computing them here for lack of better way to share things between predicates and priorities.
// TODO: @ravig Remove this.
nodeInfo.AllocatableVolumes = c.maxVolumes - numExistingVolumes

This comment has been minimized.

@msau42

msau42 Mar 6, 2018

Member

This volume predicate only counts for 3 volume types, and only one of them can run in a single cluster.

  • GCE PD
  • AWS EBS
  • Azure disk

Did you also want to count mounts for other volume types not included here, like NFS, Gluster, Ceph?

Also, the limits here are more about how many disks can be attached to a single node, not how many could be mounted. So if you had multiple pods sharing the same volume on the same node, this predicate would only count 1 of them, even though there will be a mount for each pod. But multiple pods sharing the same volume on a single node is not the common use case.

This comment has been minimized.

@msau42

msau42 Mar 6, 2018

Member

In the future, it may be possible that you can have multiple volume types in the same cluster that each have their own max limit as well.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

Thanks for the information @msau42. I thought we are going in the direction of using a configmap which consists of max number of volumes per node based on cloud provider and CSI plugin, after which we can use configmap to get the information.

This comment has been minimized.

@msau42

msau42 Mar 6, 2018

Member

I think the main challenge is that this predicate is only meant to restrict max number of attached volumes, not overall number of mounts. If you want to spread number of attached volumes, then this predicate will work.

But it sounds like you want to spread out number of mounts, including for non-attachable plugins.

If that's the case, then you may need to consider a different method of counting mounts, such as just counting the total number of volumes in assigned pods. But that still won't be completely accurate because each volume type may create a different number of mounts, like I mentioned earlier.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

But it sounds like you want to spread out number of mounts, including for non-attachable plugins.

So, we wanted to find nodes that have balanced number of volumes here. Not sure where this PR gave the impression of spreading mounts. While finding node that has balanced mounts will be even more useful, I think it has to be dependent on lower level API talking to cloud provider or listing /etc/mtab in case of linux etc and is beyond the scope of what we want to do here.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

Maybe improving the wording in the comments here to explicitly say we're balancing attachable volume cpunts would make it more clear.

Apologies if the comments are causing confusion. I thought I made it clear atleast from the start of PR, if you think otherwise, I can edit the existing comments. Please let me know, where the confusion is. Even if it is in code, I am happy to modify it.

This comment has been minimized.

@msau42

msau42 Mar 6, 2018

Member

Maybe we can be more explicit in some of the variable names. Like, instead of AllocableVolumes, something like AttachableVolumesLimit.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

I don't think, AttachableVolumesLimit actually reflects the allocatable volumes on the node. My intention there was to have a variable which holds the value for number of allocatable volumes currently on the node. AttachableVolumesLimit may be more suited for c.maxVolumes. I don't have strong opinion on it though.

This comment has been minimized.

@msau42

msau42 Mar 6, 2018

Member

Oh sorry, yes, what about something like "CurrentAttachedVolumesCount"? I want to be explicit that we are talking about attachable counts here. There is another completely unrelated volume feature that uses the term "allocatable", so it would be better to align with the terminology that we use in the documentation and API fields.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

Sure. I can do that.

@@ -36,21 +37,36 @@ var (
BalancedResourceAllocationMap = balancedResourcePriority.PriorityMap
)
func balancedResourceScorer(requested, allocable *schedulercache.Resource) int64 {
func balancedResourceScorer(requested, allocable *schedulercache.Resource, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64 {

This comment has been minimized.

@k82cn

k82cn Mar 6, 2018

Member

is it possible to also merge volumes into Resource? so no interface changed.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

I thought about it but it comes mostly under scalar resource and I think there is no definitive way on kubelet side to introspect node and get that information but it would be ideal to have that kind of information there. @bsalamat WDYT?

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

is it possible to also merge volumes into Resource? so no interface changed.

That would require updating the scheduler cache with the volume information and that in turn would require Kubelets to report volume counts, etc. as a part of allocatable resources. This would be a larger project requiring involvement of the node team. I won't go this path, if we can solve the problem without changes to the kubelet.

This comment has been minimized.

@msau42

msau42 Mar 6, 2018

Member

@gnufied is working on a design for dynamic node reporting of attachable volume limits per plugin. So I think when that is implemented, the limits will actually will be in the Node object.

This comment has been minimized.

@bsalamat

bsalamat Mar 6, 2018

Contributor

I think when that is implemented, the limits will actually will be in the Node object.

That will be great and will make our logic cleaner.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 6, 2018

Contributor

So I think when that is implemented, the limits will actually will be in the Node object.

Thanks @msau42 . That will be awesome. Ohh btw, then we could actually cache in nodeinfo :).

This comment has been minimized.

@k82cn

k82cn Mar 7, 2018

Member

That may related to our previous comments: did not update nodeInfo outside of cache :)

I think when that is implemented, the limits will actually will be in the Node object.

That will be great and will make our logic cleaner.

Great, @gnufied any related doc/issue?

This comment has been minimized.

@k8s-ci-robot k8s-ci-robot added size/XL and removed size/L labels Mar 8, 2018

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 8, 2018

@bsalamat @k82cn I have attempted to create a transient data structure which should live only for duration of a single scheduling cycle. Commit 7359fe2 has it.

Sample usage of structure could be found at https://github.com/ravisantoshgudimetla/kubernetes/blob/c6b9f133fe381f91394df02b6cf0f4c41dd5d6fe/pkg/scheduler/algorithm/priorities/resource_allocation.go. Please let me know your thoughts on the approach.
cc @aveshagarwal

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 16, 2018

ping @bsalamat @k82cn. I will fix the test cases but before that want your opinions on the approach taken and if it makes sense. This could wait till 1.10 is out.

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 16, 2018

@ravisantoshgudimetla
Two issues that I see with the current approach:

  1. It needs passing the transient node info to predicate and priority functions. This needs changes to many places including modules other than the scheduler. Unfortunately, I can't think of a clean option at the moment. One not-so-clean option is to add the new structure under NodeInfo with "Transient" in the name and a clear comment that the information in this struct is valid only during one schedule cycle.
  2. Predicate and priority functions run in parallel for different nodes. So, reading and writing the transient node info requires synchronization.
@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 16, 2018

Thanks @bsalamat

It needs passing the transient node info to predicate and priority functions. This needs changes to many places including modules other than the scheduler.

Thats why PR has grown to 39 files and counting :(.

One not-so-clean option is to add the new structure under NodeInfo with "Transient" in the name and a clear comment that the information in this struct is valid only during one schedule cycle.

I have started on similar lines but putting this struct in nodeinfo struct would be cleaner. I will use this approach.

Predicate and priority functions run in parallel for different nodes. So, reading and writing the transient node info requires synchronization.

Good point. Till now, most of fields in struct are being updated only at one location. I will make sure that we have locking in place for transient node info.

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 22, 2018

@bsalamat I have added a transient structure with comments in node info struct as discussed earlier. Please let me know, if this is what you wanted.

// TransientInfo holds the information pertaining to a scheduling cycle. This will be destructed at the end of
// scheduling cycle.
// TODO: @ravig. Remove this once we have a clear approach for message passing across predicates and priorities.
TransientInfo *transientSchedulerInfo

This comment has been minimized.

@bsalamat

bsalamat Mar 23, 2018

Contributor

Yes, this is what I suggested. Thanks.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 23, 2018

Contributor

Ok, Thanks. I have almost all tests fixed. PTAL.

nodeInfo.TransientInfo.TransientLock.Lock()
defer nodeInfo.TransientInfo.TransientLock.Unlock()
nodeInfo.TransientInfo.TransNodeInfo.AllocatableVolumesCount = c.maxVolumes - numExistingVolumes
nodeInfo.TransientInfo.TransNodeInfo.RequestedVolumes = numNewVolumes

This comment has been minimized.

@bsalamat

bsalamat Mar 24, 2018

Contributor

Please remove the blank line

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 25, 2018

Contributor

Thanks. Updated.

schedulerapi "k8s.io/kubernetes/pkg/scheduler/api"
"k8s.io/kubernetes/pkg/scheduler/schedulercache"
"math"

This comment has been minimized.

@bsalamat

bsalamat Mar 24, 2018

Contributor

Please put go standard libraries at the top, like it was before.

This comment has been minimized.

@ravisantoshgudimetla
@@ -451,7 +450,13 @@ func (c *MaxPDVolumeCountChecker) predicate(pod *v1.Pod, meta algorithm.Predicat
// violates MaxEBSVolumeCount or MaxGCEPDVolumeCount
return false, []algorithm.PredicateFailureReason{ErrMaxVolumeCountExceeded}, nil
}
if nodeInfo != nil && nodeInfo.TransientInfo != nil {

This comment has been minimized.

@bsalamat

bsalamat Mar 24, 2018

Contributor

Could you check the flag gate here as well?

@@ -112,6 +112,8 @@ func (cache *schedulerCache) UpdateNodeNameToInfoMap(nodeNameToInfo map[string]*
cache.mu.Lock()
defer cache.mu.Unlock()
for name, info := range cache.nodes {
// Transient scheduler info is added here.
info.TransientInfo = newTransientSchedulerInfo()

This comment has been minimized.

@bsalamat

bsalamat Mar 24, 2018

Contributor

Please check the flag gate here too. I wonder how much is impact of this on performance, particularly the impact of allocating/garbage collecting memory for all the nodes in every scheduling cycle. I would initialize globally once (may be with the NodeInfo initialization) and I would just clear it here to avoid frequent allocation/garbage collection.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 25, 2018

Contributor

Thanks Bobby for your review.

I wonder how much is impact of this on performance, particularly the impact of allocating/garbage collecting memory for all the nodes in every scheduling cycle

I had a similar concern so I added the comment saying that add fields to the transient structure only if necessary. I now added a benchmark for 1knode and 30 pods on each node for updateNodeNameToInfo function.

But for the following run, I changed the node count to 10k with 30 pods on each node to stress it further.

Without transient info structure.

 /usr/local/go/bin/go test -v k8s.io/kubernetes/pkg/scheduler/schedulercache -bench ^BenchmarkUpdate1kNodes30kPods$ -run=^$ -cpuprofile=cpu_with10knodeswithout.out
goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/pkg/scheduler/schedulercache
BenchmarkUpdate1kNodes30kPods-8   	     100	  16567535 ns/op
PASS
ok  	k8s.io/kubernetes/pkg/scheduler/schedulercache	5.613s

With transient info structure.

/usr/local/go/bin/go test -v k8s.io/kubernetes/pkg/scheduler/schedulercache -bench ^BenchmarkUpdate1kNodes30kPods$ -run=^$ -cpuprofile=cpu_with10knodeswith.out
goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/pkg/scheduler/schedulercache
BenchmarkUpdate1kNodes30kPods-8   	     100	  16487893 ns/op
PASS
ok  	k8s.io/kubernetes/pkg/scheduler/schedulercache	5.601s

So, there is not much of difference between both the runs. Not sure if my analysis is wrong but it could be because of these structures being allocated on stack as pointers are being passed around. I am also attaching call graph pdfs with and without transient info struct allocation. The functions I have looked at include runtime's malloc.go and runtime's mgcmark.go

cpu_with10knodeswith.pdf
cpu_with10knodeswithout.pdf

I would initialize globally once (may be with the NodeInfo initialization) and I would just clear it here to avoid frequent allocation/garbage collection.

The problem is we are calling updateNodeNameToInfo function at the start of every scheduling cycle and we need to initialize these values everytime. Also, I am not sure what clearing them means in the context of garbage collection, resetting the needed values to 0?

This comment has been minimized.

@bsalamat

bsalamat Mar 26, 2018

Contributor

Thanks for your analysis. Looks like adding the transient info struct has not had much impact on performance.

it could be because of these structures being allocated on stack as pointers are being passed around.

I doubt if they are stack allocated when built with the rest of the code. The pointer escapes the allocating function, but when you run UpdateNodeNameToInfoMap by itself in a tight loop, then there is chance that the pointer is stack allocated. That could be the reason that the benchmark numbers are so close.

Also, I am not sure what clearing them means in the context of garbage collection, resetting the needed values to 0?

Yes. I meant setting the struct members to an invalid value when clearing them. The invalid value could be 0 or a negative value that they never get when they are used.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 27, 2018

Contributor

The pointer escapes the allocating function, but when you run UpdateNodeNameToInfoMap by itself in a tight loop, then there is chance that the pointer is stack allocated.

Thanks. That makes sense too..

Yes. I meant setting the struct members to an invalid value when clearing them. The invalid value could be 0 or a negative value that they never get when they are used.

I now added a reset function which resets the value to 0 instead of allocating new structs again. PTAL.

@@ -112,6 +114,10 @@ func (cache *schedulerCache) UpdateNodeNameToInfoMap(nodeNameToInfo map[string]*
cache.mu.Lock()
defer cache.mu.Unlock()
for name, info := range cache.nodes {
// Transient scheduler info is added here.
if utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes) && info.TransientInfo != nil {
info.TransientInfo.TransNodeInfo.reset()

This comment has been minimized.

@bsalamat

bsalamat Mar 27, 2018

Contributor

Shouldn't you allocate TransientInfo when it is nil? Also, I think you should be using resetTransientSchedulerInfo instead of reset.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 27, 2018

Contributor

Shouldn't you allocate TransientInfo when it is nil

I added it initially and then removed it after looking at the code, it seems if we have to reach here, we need should have called SetNode which initializes this struct.

I think you should be using resetTransientSchedulerInfo instead of reset.

Yes. Thanks for catching it. Too much reliant on my IDE these days :(. Made changes.

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 27, 2018

/retest

2 similar comments
@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 29, 2018

/retest

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 29, 2018

/retest

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 29, 2018

@bsalamat I addressed your comments. PTAL.

// Transient scheduler info is reset here.
if info.TransientInfo != nil {
info.TransientInfo.resetTransientSchedulerInfo()
} else if info.TransientInfo == nil {

This comment has been minimized.

@bsalamat

bsalamat Mar 30, 2018

Contributor

You do not need the "if" part.

In fact, given that TransientInfo is initialized with nodeInfo, please remove the whole else part and not initialize it here.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 30, 2018

Contributor

Done Bobby. PTAL.

}
// reset resets the nodeTransientInfo.
func (transNodeInfo *nodeTransientInfo) reset() {

This comment has been minimized.

@bsalamat

bsalamat Mar 30, 2018

Contributor

One last thing. This method is not needed. I would remove and inline the body in resetTransientSchedulerInfo to prevent future mistakes of using it directly (without locking).

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 30, 2018

Contributor

Updated. Thanks. I did not want to have a big function which resets all the transient values, that was the reason, I separated that block into different function. But I understand your point of calling this function directly without acquiring lock.

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 30, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 30, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Mar 31, 2018

/retest

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 31, 2018

Automatic merge from submit-queue (batch tested with PRs 54997, 61869, 61816, 61909, 60525). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 9847c8e into kubernetes:master Mar 31, 2018

14 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation ravisantoshgudimetla authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@ravisantoshgudimetla ravisantoshgudimetla deleted the ravisantoshgudimetla:scheduler-pvc branch Mar 31, 2018

@gnufied

This comment has been minimized.

Member

gnufied commented May 3, 2018

@k82cn @msau42 @ravisantoshgudimetla @bsalamat I am sorry I did not get a chance to comment on this before, but one big problem with what we merged in this PR is - it will only work for Azure, AWS and GCE volume types. It will not work for any other volume type.

Also it sounds like this proposal set out to fix "balance PVCs on node", but in reality because it requires a maximum limit of volumes, it will only EVER work for volume types that have such upper limits. In a nutshell - this proposal will perhaps not work for something like glusterfs, iscsi etc for foreseeable future.

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented May 4, 2018

@gnufied I think this works for any volume type as long as max limit on volumes parameter is set.

it will only EVER work for volume types that have such upper limits.

I think the limit is coming from max number of volumes that could be attached to a node irrespective of volume types like glusterfs, iscsi etc. IOW, even if glusterfs provides unlimited volumes, there is a limit on the number of volumes that could be attached to a machine depending on Operating system running and filesystem used.

@gnufied

This comment has been minimized.

Member

gnufied commented May 4, 2018

The problem I was talking about is - since Kubernetes have no knowledge of applicable limits for volume types like glusterfs etc, the balanced volume allocation will not work for it. Even though, there indeed will be a theoretical limit for glusterfs etc (before it saturates the network), there is no way that information is exposed inside kubernetes. There is no way for admin to set those limits.

Discussed this offline with @jsafrane , @childsb and we think that - it may be possible to return a "dummy" upper limit for those volumes types (a high number such as 1000 lets say) - so as fractions that this proposal requires can still be calculated. Do we really care about real maximum volume count for purpose of calculating the fractions? It appears to me that - any high number should do the job and still this design will work? @ravisantoshgudimetla correct me if I am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment