Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in handling of reserved resources #1228

Merged
merged 2 commits into from
Apr 8, 2021

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Apr 2, 2021

@dralley dralley force-pushed the tasking-race branch 3 times, most recently from 268672b to 9e3b9ff Compare April 2, 2021 17:23
@pulpbot
Copy link
Member

pulpbot commented Apr 2, 2021

Attached issue: https://pulp.plan.io/issues/8352

@dralley dralley force-pushed the tasking-race branch 5 times, most recently from 643bb52 to d9b481f Compare April 2, 2021 18:04
@daviddavis
Copy link
Contributor

The code looks good to me. I am going to try to test out/debug the code some tomorrow with some sleep statements so I can understand it a bit better.

# no worker is ready so we need to wait
time.sleep(0.25)
continue
with transaction.atomic():
Copy link
Contributor Author

@dralley dralley Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about 80% certain we still need the select_for_update() here also to make it "complete" (these queries don't touch the actual worker so there's nothing for the lock to block otherwise)... however, apparently it doesn't get along with the distinct() on line 46 (they are apparently incompatible constructs?). Any ideas? What is that distinct() actually doing and do we need it?

Copy link
Contributor Author

@dralley dralley Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "simple" option might be to just ignore locking during acquire_worker and just drop a worker = Worker.objects.select_for_update().get(pk=worker.pk) on line 127 below so that just the resource reservation is covered by the lock. I don't like the double-query but it would work. And it might be "good enough" if we want a fast patch with minimal risk.

Would still like to do better if possible though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented ^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dralley select_for_update is 100% needed here to prevent related tables (resource reservation) from modification if another transaction holds a lock. An alternative to double query can be rewriting the query on line 48 to eliminate DISTINCT. However as a workaround and immediate solution it is acceptable in my opinion.

@daviddavis daviddavis requested a review from mdellweg April 6, 2021 14:20
@fao89
Copy link
Member

fao89 commented Apr 6, 2021

@cutwater

Comment on lines 138 to 149
except Worker.DoesNotExist:
# from _acquire_worker
# no worker is ready so we need to wait
time.sleep(0.25)
continue
except IntegrityError:
# we have a worker but we can't create the reservations so wait
time.sleep(0.25)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except Worker.DoesNotExist:
# from _acquire_worker
# no worker is ready so we need to wait
time.sleep(0.25)
continue
except IntegrityError:
# we have a worker but we can't create the reservations so wait
time.sleep(0.25)
continue
except (Worker.DoesNotExist, IntegrityError):
time.sleep(0.25)
continue

# no worker is ready so we need to wait
time.sleep(0.25)
continue
with transaction.atomic():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dralley select_for_update is 100% needed here to prevent related tables (resource reservation) from modification if another transaction holds a lock. An alternative to double query can be rewriting the query on line 48 to eliminate DISTINCT. However as a workaround and immediate solution it is acceptable in my opinion.

try:
worker = _acquire_worker(resources)

worker = Worker.objects.select_for_update().get(pk=worker.pk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is still susceptible to a race condition?

  1. Worker is returned from _acquire_worker()
  2. Worker is cleaned up
  3. Worker is fetched here with get() and assigned reservations

Copy link
Contributor

@cutwater cutwater Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes, it will. Worker status is not checked here.

Copy link
Contributor

@daviddavis daviddavis Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I believe this should work:

Worker.objects.select_for_update().filter(pk__in=Worker.objects.filter(reservations__resource__in=resources))

It performs a single query by using a subselect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from this question, the rest looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daviddavis Is this meant to go on line 132 or on line 48? It has different semantics from line 48.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could potentially work for line 48:

return Worker.objects.select_for_update().filter(pk__in=Worker.objects.filter(reservations__resource__in=resources)).get()

@dralley dralley force-pushed the tasking-race branch 2 times, most recently from 4971456 to eb445bb Compare April 7, 2021 15:14
Copy link
Contributor

@cutwater cutwater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dralley
Copy link
Contributor Author

dralley commented Apr 7, 2021

@daviddavis This change appears as though it might be causing deadlocks? If I run the tests locally it hangs on the very first one.

And the CI has been running for nearly an hour

[noissue]
@dralley
Copy link
Contributor Author

dralley commented Apr 7, 2021

@cutwater @mdellweg Please re-review the last patch

@dralley dralley force-pushed the tasking-race branch 3 times, most recently from 746e14e to 777c476 Compare April 7, 2021 18:35
try:
# A randomly-selected Worker instance that has zero ReservedResource entries
# associated with it.
return workers_qs_with_counts.filter(reservations__count=0).order_by("?")[0]
return Worker.objects.select_for_update().filter(pk__in=workers_qs_no_res).first()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the selected worker isn't necessarily random anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix that. What if we did this:

workers_qs_no_res = (
        workers_qs.annotate(models.Count("reservations"))
        .filter(reservations__count=0)
        .order_by("?")[0:1]
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this is better:

Suggested change
return Worker.objects.select_for_update().filter(pk__in=workers_qs_no_res).first()
return Worker.objects.select_for_update().filter(pk__in=workers_qs_no_res[0:1]).first()

.order_by("?")
.first()
)
except IndexError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If query fetches no rows, .first() will not throw IndexError, but return None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limit(1).get() ?

Copy link
Contributor

@daviddavis daviddavis Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limit() method for queryset. You have to use [0:1].get().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my experience says, that get raises the MultipleReturned exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to get an exception here instead of simply checking first() return value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also .get() doesn't raise IndexError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdellweg yes, my bad. .get() raises MultipleReturned indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early morning drowsiness :) That's probably the simplest solution, I can do that.

@dralley dralley force-pushed the tasking-race branch 2 times, most recently from c4b3b0c to 33c1694 Compare April 8, 2021 13:49
Comment on lines 68 to 75
try:
# A randomly-selected Worker instance that has zero ReservedResource entries
# associated with it.
return (
Worker.objects.select_for_update()
.filter(pk__in=workers_qs_no_res)
.order_by("?")[0:1]
.get()
)
except IndexError:
# If all Workers have at least one ReservedResource entry.
raise Worker.model.DoesNotExist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
# A randomly-selected Worker instance that has zero ReservedResource entries
# associated with it.
return (
Worker.objects.select_for_update()
.filter(pk__in=workers_qs_no_res)
.order_by("?")[0:1]
.get()
)
except IndexError:
# If all Workers have at least one ReservedResource entry.
raise Worker.model.DoesNotExist()
# A randomly-selected Worker instance that has zero ReservedResource entries
# associated with it.
return (
Worker.objects.select_for_update()
.filter(pk__in=workers_qs_no_res)
.order_by("?")[0:1]
.get()
)

Copy link
Contributor

@cutwater cutwater Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this approach (queryset[0:1].get()) is offered by official Django docs, it looks odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cutwater I went with your recommendation to just check the value of what is returned by first()

@dralley dralley force-pushed the tasking-race branch 2 times, most recently from 1285b09 to 32a2742 Compare April 8, 2021 13:53

if not worker:
# If all Workers have at least one ReservedResource entry.
raise Worker.model.DoesNotExist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Worker.DoesNotExist()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an alias - this is copied over so if it didn't work it would have broken before now. But I'll change it anyway

Worker.objects.select_for_update().filter(pk__in=workers_without_res).order_by("?").first()
)

if not worker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: I'd rather use an explicit check here
if worker is not None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to if worker is None

@dralley dralley force-pushed the tasking-race branch 2 times, most recently from 61b4564 to fe73534 Compare April 8, 2021 13:59
worker=worker, resource=resource
)
TaskReservedResource.objects.create(resource=reservation, task=task_status)
except (Worker.DoesNotExist, IntegrityError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having break statement at the end of the loop looks odd.

IMO this is more readable:


except (Worker.DoesNotExist, IntegrityError):
    # if worker is ready, or we have a worker but we can't create the reservations, wait
    time.sleep(0.25)
else:
    break

)
TaskReservedResource.objects.create(resource=reservation, task=task_status)
except (Worker.DoesNotExist, IntegrityError):
# if worker is ready, or we have a worker but we can't create the reservations, wait
time.sleep(0.25)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that continue is not needed here anymore.

Copy link
Contributor

@cutwater cutwater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cutwater
Copy link
Contributor

cutwater commented Apr 8, 2021

@dralley Thank you for addressing my comments.

@dralley
Copy link
Contributor Author

dralley commented Apr 8, 2021

CI failure looks unrelated

@dralley dralley merged commit 289c1cd into pulp:master Apr 8, 2021
@dralley dralley deleted the tasking-race branch April 8, 2021 14:53
@dralley
Copy link
Contributor Author

dralley commented Apr 8, 2021

@cutwater Merged. Do you need a new build with this patch or do you plan to upgrade to 3.12?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants