Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nil pointer dereference fix #1031

Merged
merged 1 commit into from
Jan 14, 2022
Merged

Nil pointer dereference fix #1031

merged 1 commit into from
Jan 14, 2022

Conversation

nealormsbee
Copy link
Contributor

@nealormsbee nealormsbee commented Jan 13, 2022

What does this PR change?

  • Fixes a comment typo.

  • Causes GPU idle coefficient to be calculated with GPU costs instead of RAM costs.

  • Formats a key composition for checking a map correctly. Was trying to
    access map using node name as a key when sharing idle by node, but
    needs to access using <cluster>/<node>.

  • Checks to make sure that the key is present in
    allocTotalsAfterFilters, as well as allocTotals. I cannot fathom what
    would cause a key to be missing from this map, but that seems to be the
    error reported, and I do know that a missing key likely means that the
    subsequent coefficient we are attempting to compute should simply be 0.

Does this PR rely on any other PRs?

No

How does this PR impact users? (This is the kind of thing that goes in release notes!)

Fixes a bug where requesting an allocation summary with shareIdle=false&idleByNode=true would crash the pod.
Also guards against another potential nil pointer dereference in the case where two maps are desynced.

Links to Issues or ZD tickets this PR addresses or fixes

N/A

How was this PR tested?

An integration test is being written in our integration test suite.
Additionally, manually by querying /allocation/summary with different combinations of filters, shareIdle, and idleByNode options. Notably:

  • All combinations of shareIdle and idleByNode true/false values
  • Filtering by namespace, cluster, and node

Have you made an update to documentation?

No

- Formats a key composition for checking a map correctly. Was trying to
  access map using node name as a key when sharing idle by node, but
needs to access using <cluster>/<node>

- Checks to make sure that the key is present in
  allocTotalsAfterFilters, as well as allocTotals. I cannot fathom what
would cause a key to be missing from this map, but that seems to be the
error reported, and I do know that a missing key likely means that the
subsequent coefficient we are attempting to compute should simply be 0.
Copy link
Contributor Author

@nealormsbee nealormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few clarifying comments about the changes

key = ia.Properties.Node
key = fmt.Sprintf("%s/%s", ia.Properties.Cluster, ia.Properties.Node)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was the cause of the reported error. cluster/node is used in all of the other key creation in this function if idleByNode is true. Here, it was just node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, yes, good catch.

Comment on lines +834 to +839
filteredAlloc, ok := allocTotalsAfterFilters[key]
if ok {
cpuFilterCoeff = filteredAlloc.CPUCost / allocTotals[key].CPUCost
} else {
cpuFilterCoeff = 0.0
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to create a situation where allocTotals had a key for an idle allocation that allocTotalsAfterFilters did not. Planning to ask Niko about it when he gets back, because maybe it's possible.

In any case, the initial stack trace we saw lead us to believe that key may be present in allocTotals but not allocTotalsAfterFilters. If that is the case, then I believe the allocation in question would have been filtered out, and the correct action is to set this ratio to 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to talk about this one if you want to. In general, the fix looks good 😎

Comment on lines -838 to +844
if allocTotals[key].RAMCost > 0.0 {
gpuFilterCoeff = allocTotalsAfterFilters[key].RAMCost / allocTotals[key].RAMCost
if allocTotals[key].GPUCost > 0.0 {
filteredAlloc, ok := allocTotalsAfterFilters[key]
Copy link
Contributor Author

@nealormsbee nealormsbee Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpuCoeff was being determined using RAMKeys, which I think is a typo or a copy/paste miss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow good catch! Thank you!

Copy link
Contributor

@michaelmdresser michaelmdresser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the overview in review comments, it helped a lot.

@nealormsbee
Copy link
Contributor Author

Thanks for the review, @michaelmdresser !

@AjayTripathy this PR is against develop, but there's another one open against master. Just let me know whether you want to do a patch release and I'll merge one & close the other.

@nealormsbee nealormsbee merged commit c01025d into develop Jan 14, 2022
@michaelmdresser michaelmdresser deleted the neal/nil-pointer-fix branch June 23, 2023 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants