Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass blocking IO in content handler to a separate executor #1116

Merged
merged 1 commit into from Jun 3, 2021

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Feb 5, 2021

The standard async executor is single-threaded (which is perfectly fine for async code)
but blocking IO shreds its throughput. Let's create a separate ThreadPool executor for
blocking IO so that our async code is actually async.

[noissue]

@dralley
Copy link
Contributor Author

dralley commented Feb 5, 2021

Consider this my "day of learning" project :)

No idea how much difference it will make for content serving performance. I'll try to make a tool to take measurements.

If we just don't want to merge this right now, that's fine. Just the exercise of finding these problems felt worthwhile.

@dralley dralley changed the title Pass blocking IO in content handler to a threadpool executor Pass blocking IO in content handler to a separate executor Feb 5, 2021
@pulpbot
Copy link
Member

pulpbot commented Feb 5, 2021

WARNING!!! This PR is not attached to an issue. In most cases this is not advisable. Please see our PR docs for more information about how to attach this PR to an issue.

@dralley dralley force-pushed the blocking-in-content-handler branch 6 times, most recently from 01af976 to a995692 Compare February 7, 2021 17:14
@bmbouter
Copy link
Member

bmbouter commented Feb 8, 2021

Thank you for exploring this. Maybe we want to do something like this, but we need to think a lot about it. Is this how we should be increasing throughput?

First let me check my understanding: Am I right in understanding that this doesn't decrease the latency per request, but it increases the number of clients that can be served concurrently (aka increasing client throughput)?

If that's the case how does this compare against:

  • Having gunicorn manage the threadpool with more threads per worker: https://docs.gunicorn.org/en/stable/settings.html#threads
  • Increase increasing the workers per gunicorn process?
  • Increasing the number of gunicorn processes? <--- idea: if most of our time is just spent waiting in theory we can do this to increase throughput without actually seeing an increase in cpu/ram resources.

@dralley
Copy link
Contributor Author

dralley commented Feb 8, 2021

@bmbouter It improves both the throughput and latency by roughly 2x. I did notice though that there was no significant improvement beyond one or two threads in the threadpool on my system, I suppose because the queries are fast enough that they don't stack up too much. But that might not be true on a system with an HDD. Or maybe the HDD would be equally uncapable of handling many queries at once, idk.

I don't have any numbers on how it compares against more processes. I think more workers is something we might want regardless. And Adam Winberg did see good results by increasing the count.

Re: gunicorn threads, though - those don't apply to the aiottp worker type we are using, see the note at the bottom of the documentation you linked. Having our own threadpool for blocking IO serves essentially the same purpose as having more worker threads in a standard blocking IO http server, though.

I'm in general agreement that we need to play with the parameters to get the biggest impact, but this PR also has separate value in that right now we have all this async code which is not actually async, which basically violates all the assumptions async code and servers and whatnot are supposed to make. And Django 3+ has the ability to detect this and complains very very loudly about it (refuses to work unless you set a certain ENV variable). I haven't tested but this will likely ease the pain of migrating to newer Django when we do so.

@mdellweg
Copy link
Member

mdellweg commented Feb 8, 2021

I can only second what you say about breaking the assumptions of async code. With async, the only allowed way to wait for anything is by handing the control back to the main loop (aka calling anything await <...>).

@bmbouter
Copy link
Member

bmbouter commented Feb 8, 2021

It does offer improvements, but it also offers a huge increase in complexity. We need Pup's code to be very simple. People already find the content app hard to work on.

I want to state an idea to see what you all think: the real measure of our code is how efficient it is is terms of hardware to producing a user experience.

I think that's the key concern actually given that more gunicorn processes can be started and gunicorn workers per process. This is the whole horizontal versus vertical scaling debate. Unless we're increasing the hardware efficiency all we're doing is choosing to vertically scale instead of horizontally scale, and in doing so taking on a significant code complexity.

Maybe scaling vertically is the best thing to do, but maybe it's not. We don't really know until we evaluate a variety of solutions. I'd like us to look at the problem more holistically before we continue down this road.

@dralley
Copy link
Contributor Author

dralley commented Feb 8, 2021

Unless we're increasing the hardware efficiency all we're doing is choosing to vertically scale instead of horizontally scale, and in doing so taking on a significant code complexity.

But isn't entire purpose of async to improve hardware efficiency by:

  • eliminating the wasted memory needed for hundreds or thousands of threads on servers that service hundreds or thousands of concurrent connections / requests
  • reducing the cost of context switches between them / improving the scheduling of them

-- at the expense of more complex code?

Right now we have added complexity from async, but without a lot of the benefits that async is supposed to bring, and without the extra horizontal scaling tunables traditional webservers have.

edit: Parts of this were not as constructive as they should have been and this venue is the wrong place for them. I removed them to stay focused on the merits of this PR.

@mdellweg
Copy link
Member

mdellweg commented Feb 9, 2021

I am with dalley here. The whole functionality of the async technology and the aiohttp framework is based on the assumption that no coroutine is wasting time. Blocking calls will introduce lag in places that we can not even anticipate.
And offloading tasks into thread executors is the official way to make originally blocking calls well behaved. Clearly the ultimate goal must be to switch to both async file access as well as an async postgres adapter (and i am yet unsure if this will introduce more or less complexity than the thread executors in the end).

@daviddavis daviddavis marked this pull request as draft February 16, 2021 14:49
@dralley dralley force-pushed the blocking-in-content-handler branch 4 times, most recently from 89c305c to 7b6942f Compare April 7, 2021 03:53
@dralley
Copy link
Contributor Author

dralley commented Apr 7, 2021

@daviddavis @mdellweg This PR does actually make Django 3.1 happy in terms of not complaining about using the ORM in an async context (at least for the content app, the sync pipeline still throws errors everywhere).

And then adding Environment="DJANGO_ALLOW_ASYNC_UNSAFE=1" to the worker .service template "fixes" that. We could theoretically use it here too but given the other benefits we've seen from this approach, I don't know if we'd want to.

Either way, migration to 3.x might not be that difficult.

@dralley dralley force-pushed the blocking-in-content-handler branch 3 times, most recently from 4876700 to 758de50 Compare April 7, 2021 13:35
@dralley dralley marked this pull request as ready for review May 11, 2021 02:36
@dralley dralley force-pushed the blocking-in-content-handler branch from 758de50 to 3d59264 Compare May 11, 2021 02:39
# Django ORM is blocking, as are most of our file operations. This means that the
# standard single-threaded async executor cannot help us. We need to create a separate,
# thread-based executor to pass our heavy blocking IO work to.
io_pool_exc = ThreadPoolExecutor(max_workers=2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did experiment with the numbers a bit and wasn't seeing much difference past 2 threads, although I wasn't especially thorough. Once I get back to content app perf work, we can tune this better.

@dralley dralley force-pushed the blocking-in-content-handler branch 4 times, most recently from b2618bd to 8fb2055 Compare May 25, 2021 22:41
while True:
content_app_status, created = ContentAppStatus.objects.get_or_create(name=name)
if not created:
content_app_status, created = await loop.run_in_executor(io_pool_exc, get_status_blocking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help readability to use a lambda here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, personally. In my mind it makes sense to explicitly label the functions as blocking io, it feels more self-documenting.

Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use set_default_executor and then run_in_executor(None, ...) to not have to carry that variable around.
Also i'd prefer to use the args argument of run_in_executor and functool.partial over defining local functions in all places.

https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.set_default_executor

pulpcore/content/handler.py Outdated Show resolved Hide resolved
return cls.distribution_model.objects.get(base_path__in=base_paths)
return cls.distribution_model.objects.select_related(
"repository", "repository_version", "publication"
).get(base_path__in=base_paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated optimization, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suggested to use this since we access these values later and they are secretly queries that would needed to be wrapped into a thread executor.

pulpcore/content/handler.py Outdated Show resolved Hide resolved
return directory_list
else:
raise PathNotResolved(path)
return await loop.run_in_executor(io_pool_exc, list_directory_blocking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function can benefit from a decorator that wraps it into the thread pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. Does it work with WSGI though? At some point we should consider switching the content app to use ASGI, but we still use WSGI today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it does, but we can find that out later, when we adopt Django 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not roll our own, is what I'm saying. And I think this is fine for now, personally.

Comment on lines 383 to 385
def match_distribution_blocking():
return self._match_distribution(path)

distro = await loop.run_in_executor(io_pool_exc, match_distribution_blocking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def match_distribution_blocking():
return self._match_distribution(path)
distro = await loop.run_in_executor(io_pool_exc, match_distribution_blocking)
distro = await loop.run_in_executor(io_pool_exc, self._distribution_blocking, path)

Copy link
Contributor Author

@dralley dralley May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference is to stick to one consistent way of doing the wrapping, that is explicit about why they are being wrapped (_blocking suffix).

Also, I kinda think that making the calls stick out strongly from the rest of the code is a feature, not a bug, considering our long-term goal is to eliminate as many of them as possible. This is a critical hot loop so it's not necessarily a bad thing for the "slow" parts to stick out like a sore thumb.

Even if we do caching I think we should still be looking for ways to reduce the # of queries

Comment on lines 439 to 440
def get_published_artifact_blocking():
publication.published_artifact.get(relative_path=index_path)

await loop.run_in_executor(io_pool_exc, get_published_artifact_blocking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pythonic way topass keyword args without the need for get_published_artifact_blocking

https://docs.python.org/3/library/asyncio-eventloop.html#asyncio-pass-keywords

Suggested change
def get_published_artifact_blocking():
publication.published_artifact.get(relative_path=index_path)
await loop.run_in_executor(io_pool_exc, get_published_artifact_blocking)
await loop.run_in_executor(io_pool_exc, functools.partial(publication.published_artifact.get, relative_path=index_path))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but considering that we can't use that approach in the large majority of cases, I have a mild preference for sticking to one consistent way of doing things.

@dralley dralley force-pushed the blocking-in-content-handler branch 4 times, most recently from 59e4ea4 to 53c13e9 Compare May 26, 2021 19:30
Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this is ready to go...

The one suggestion i want to reiterate is to use set_default_executor to not carry the io_pool_exc variable around.

Django's ORM is thread safe, right?

@dralley
Copy link
Contributor Author

dralley commented May 27, 2021

As far as I can tell, yes, it's thread-safe by virtue of creating a new thread-local connection object for every new thread, rather than re-using the connection object.

nonlocal repo_version
nonlocal publication
# secretly an ORM query
if repository:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, this if statement should not be blocking (obviously the code inside is), but having one less blocking call will help with performance.

Comment on lines 503 to 512
if repository or repo_version:
if repository:

def get_latest_version_blocking():
return distro.repository.latest_version()

repo_version = await loop.run_in_executor(
io_pool_exc, get_latest_version_blocking
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression (probably from rebasing). The check for repository has already happened and repo_version has already been set.

Comment on lines 518 to 504
ContentArtifact.objects.get(
content__in=repo_version.content, relative_path=index_path
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a select_related("artifact") here for fewer database queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no difference here, because the value is never actually used. We only do the query to know whether or not it exists (and then do something else in the exception case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, would a filter().exists() be better than using a try/except block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yeah

@@ -459,23 +548,34 @@ async def _match_and_stream(self, path, request):
pass
else:
if ca.artifact:
return self._serve_content_artifact(ca, headers)
return await self._serve_content_artifact(ca, headers)
else:
return await self._stream_content_artifact(
request, StreamResponse(headers=headers), ca
)

if distro.remote:
Copy link
Contributor

@gerrod3 gerrod3 May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this a query, not sure how many plugins use this field. (the distro's remote field, since GH didn't make it obvious what I was talking about)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with select_related()

Comment on lines 720 to 725
def blocking_io():
artifact_file = content_artifact.artifact.file
artifact_name = artifact_file.name
return (artifact_file, artifact_name)

(artifact_file, artifact_name) = await loop.run_in_executor(io_pool_exc, blocking_io)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all instances of content_artifact have select_related("artifact") before this function then this blocking wrapper could be removed.

@bmbouter
Copy link
Member

This PR looks good to me, but I'd like to give one final look once @gerrod3 and @dralley say it's final. Thanks everyone for this great work!

@dralley dralley force-pushed the blocking-in-content-handler branch 3 times, most recently from 779a31d to d7a432b Compare June 2, 2021 23:27
Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just one question about using an exist query.

The standard async executor is single-threaded and blocking IO shreds
its throughput. Let's create a separate executor for blocking IO.

fixes: #8821
https://pulp.plan.io/issues/8821
@dralley dralley force-pushed the blocking-in-content-handler branch from d7a432b to a0cec39 Compare June 3, 2021 15:52
@dralley
Copy link
Contributor Author

dralley commented Jun 3, 2021

@bmbouter This is ready for review

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for putting this together!

@dralley dralley merged commit 91a57d3 into pulp:master Jun 3, 2021
@dralley dralley deleted the blocking-in-content-handler branch June 3, 2021 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants