Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GC uid races and handling of conflicting ownerReferences #92743

Merged
merged 13 commits into from Nov 17, 2020

Commits on Nov 17, 2020

  1. Configuration menu
    Copy the full SHA
    09bdf76 View commit details
    Browse the repository at this point in the history
  2. Add GC integration race test

    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    14f7f32 View commit details
    Browse the repository at this point in the history
  3. Switch GC absentOwnerCache to full reference

    Before deleting an object based on absent owners, GC verifies absence of those owners with a live lookup.
    
    The coordinates used to perform that live lookup are the ones specified in the ownerReference of the child.
    
    In order to performantly delete multiple children from the same parent (e.g. 1000 pods from a replicaset),
    a 404 response to a lookup is cached in absentOwnerCache.
    
    Previously, the cache was a simple uid set. However, since children can disagree on the coordinates
    that should be used to look up a given uid, the cache should record the exact coordinates verified absent.
    This is a [apiVersion, kind, namespace, name, uid] tuple.
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    445f20d View commit details
    Browse the repository at this point in the history
  4. Avoid marking virtual nodes as observed when they haven't been

    Virtual nodes can be added to the GC graph in order to represent objects
    which have not been observed via an informer, but are referenced via ownerReferences.
    
    These virtual nodes are requeued into attemptToDelete until they are observed via an informer,
    or successfully verified absent via a live lookup. Previously, both of those code paths
    called markObserved() to stop requeuing into attemptToDelete.
    
    Because it is useful to know whether a particular node has been observed via
    a real informer event, this commit does the following:
    
    * adds a `virtual bool` attribute to graph events so we know which ones came from a real informer
    * limits the markObserved() call to the code path where a real informer event is observed
    * uses an alternative mechanism to stop requeueing into attemptToDelete when a virtual node is verified absent via a live lookup
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    30eb668 View commit details
    Browse the repository at this point in the history
  5. Refactor identityFromEvent

    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    cb7b9ed View commit details
    Browse the repository at this point in the history
  6. Replace virtual node with observed node if identity differs

    If the graph contains a virtual node (because some child object referenced it in an OwnerRef),
    and a real informer event is observed for that uid at different coordinates,
    we want to fix the coordinates of the node in the graph to match the actual coordinates.
    
    The safe way to do this is to clone the node, replace the identity in the clone,
    then replace the node with the clone.
    
    Modifying the identity directly is not safe because it is accessed lock-free from many code paths.
    
    Replacing the node in the graph from processGraphChanges is safe because it is the only graph writer.
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    cae56be View commit details
    Browse the repository at this point in the history
  7. Short-circuit attemptToDelete loop for virtual nodes that are removed…

    … or observed
    
    Virtual nodes are added to the attemptToDelete queue, and continue getting requeued
    until they are successfully verified absent or are observed via informer.
    
    In the meantime, if the real object associated with that UID is observed via informer,
    or is observed to be deleted via informer, the graph node for that UID can be removed
    or marked as observed. In that case, we should stop retrying to get the virtual node coordinates.
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    78317ed View commit details
    Browse the repository at this point in the history
  8. Enqueue dependents for deletion when their ownerReference does not ma…

    …tch observed parent coordinates
    
    When adding a dependent to the graph, we ensure there is a node representing each owner reference,
    and add the dependent to each parent node.
    
    If the parent node already exists, and the dependent's ownerReference
    coordinates disagree with the verified coordinates, add the dependent to the attemptToDelete queue.
    
    This queue will check the dependent's ownerReferences using the coordinates specified by the dependent.
    If all of the owners can be verified absent, the dependent will be deleted.
    If some are still present, or if there are errors looking them up, the dependent will not be deleted.
    
    If the parent node has been observed via informer event (so we know the coordinates are accurate),
    and the verified owner is namespaced, and the dependent is not in the same namespace,
    an event will be recorded for user visibility, since cross-namespace ownerReferences are not supported.
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    ac8d419 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    b8d7ecf View commit details
    Browse the repository at this point in the history
  10. Handle virtual delete events when children don't agree on owner coord…

    …inates
    
    If a virtual delete event is received for a node whose dependents disagree on the parent's coordinates:
    1. propagate the delete to children that matched the verified absent coordinates
    2. if the existing node is virtual, select a new set of coordinates from the remaining dependents
    3. do not delete the parent node from the graph if the parent node is non-virtual,
       or if there are dependents that do not agree with the virtual delete event coordinates
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    b655f22 View commit details
    Browse the repository at this point in the history
  11. Queue non-matching children for deletion when a virtual node is marke…

    …d as observed
    
    When we observe valid coordinates for a previously virtual node,
    if there are dependents that do not agree with those coordinates,
    add them to the attemptToDelete queue.
    
    This queue will check the dependent's ownerReferences using the coordinates specified by the dependent.
    If all of the owners can be verified absent, the dependent will be deleted.
    If some are still present, or if there are errors looking them up, the dependent will not be deleted.
    
    If the verified owner is namespaced, and the dependent is not in the same namespace,
    an event will be recorded for user visibility, since cross-namespace ownerReferences are not supported.
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    221e4aa View commit details
    Browse the repository at this point in the history
  12. Log cluster-scoped owners referencing namespaced owners, avoid retryi…

    …ng lookups forever
    
    If a cluster-scoped dependent references a namespace-scoped owner,
    this is an invalid relationship, and the lookup will never succeed in attemptToDelete.
    
    Short-circuit requeueing in attemptToDelete and log.
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    603a0b0 View commit details
    Browse the repository at this point in the history
  13. Add GC unit tests

    Adds unit tests covering the problematic scenarios identified
    around conflicting data in child owner references
    
                          Before   After
    package level         51%      68%
    garbagecollector.go   60%      75%
    graph_builder.go      50%      81%
    graph.go              50%      68%
    
    Added/improved coverage of key functions that had lacking unit test coverage:
    
    * attemptToDeleteWorker
    * attemptToDeleteItem
    * processGraphChanges (added coverage of all added code)
    liggitt committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    e491c3b View commit details
    Browse the repository at this point in the history