Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlalchemy: make sqlalchemy instrumentation thread safe #315

Merged
merged 4 commits into from
Mar 25, 2021

Conversation

cbrand
Copy link
Contributor

@cbrand cbrand commented Feb 4, 2021

Description

This change ensures the SQLAlchemy instrumentation to b thread safe.

In tests using Flask with gunicorn and uwsgi we were often getting a "Trying to end span which has already been ended" error. After checking we realized that this has to do with the SQLAlchemy EngineTracer using one local variable, namely current_span to try to track the span of the SQL request. If threading is involved and multiple SQL commands are running at the same time, this breaks as current_span will be overwritten before it has been ended. The new span will then be tried to be ended twice.

To remedy this, the changes introduce a tracking mechanism over SQLAlchemy cursors, tracking via a weakref dictionary all spans which are currently existing on a cursor level.

To still support error handling, we also store the span in a thread local variable to esnure that _handle_error can be utilized. This will break if a query is executed in another thread then _before_cur_exec has been called. I think that this is highly unlikely to actually outside of an artifical test written happen.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • Run parallel requests on a web server with ab executing multiple sql statements in parallel and check jaeger if the instances are recorded as expected.

Does This PR Require a Core Repo Change?

  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added (no functionality which can be unit tested)
  • Documentation has been updated (no change necessary)

@cbrand cbrand requested a review from a team as a code owner February 4, 2021 22:37
@cbrand cbrand requested review from codeboten and lzchen and removed request for a team February 4, 2021 22:37
@cbrand cbrand force-pushed the main branch 3 times, most recently from 8d3a8fe to a71b838 Compare February 5, 2021 08:24
@cbrand cbrand changed the title sqlalchemy: make sqlalchemy thread safe sqlalchemy: make sqlalchemy instrumentation thread safe Feb 5, 2021
@cbrand
Copy link
Contributor Author

cbrand commented Feb 21, 2021

@codeboten @lzchen Sorry for pushing this once.

I think this is a serious issue with the current sqlalchemy instrumentation which occurs on all systems which use sqlalchemy in production and breaks functionality there occuring into traces not being sent to jaeger and dropped due to the UDP buffer being full. So it would be good to get this merged and soon into a new version.

We currently also forked the project and host an internal version, but I would really like to see it also being back in the mainline.

@codeboten
Copy link
Contributor

@cbrand not sure if this is something a test could be added to verify or not, any thoughts?

@cbrand
Copy link
Contributor Author

cbrand commented Feb 22, 2021

@cbrand not sure if this is something a test could be added to verify or not, any thoughts?

I couldn't think of a way to write a test which would reliably test this behavior. That's why I didn't add one (See also the comment in the checklist)

@lzchen
Copy link
Contributor

lzchen commented Feb 24, 2021

@codeboten
Should we proceed without a test?

@owais
Copy link
Contributor

owais commented Mar 5, 2021

Can we add integration/docker test? It might be easier to model this use case over there.

@cbrand
Copy link
Contributor Author

cbrand commented Mar 10, 2021

@owais @lzchen @codeboten I don't think that will help either as it requires accidental timing on multple threads which isn't really controllable on an integration test level.

I would recommend merging this with the existing unit tests, ensuring that during "normal" flow it runs as expected.

@owais
Copy link
Contributor

owais commented Mar 10, 2021

@owais @lzchen @codeboten I don't think that will help either as it requires accidental timing on multiple threads which isn't really controllable on an integration test level.

I would recommend merging this with the existing unit tests, ensuring that during "normal" flow it runs as expected.

I'm fine with merging this if this is a critical fix but I feel strongly that it should be followed up with some sort of tests to ensure the issue this PR fixed doesn't happen again. I understand it can be harder to reproduce but if we know the exact scenario it shouldn't be hard to write an example script that deliberately triggers the case and if we have that, we have an integration test.

Original description above states:

If threading is involved and multiple SQL commands are running at the same time, this breaks as current_span will be overwritten before it has been ended. The new span will then be tried to be ended twice.

With some thread coordination and/or time.sleep, I think this should be simple enough to reproduce even in a unit test.

@cbrand
Copy link
Contributor Author

cbrand commented Mar 18, 2021

Sorry for not responding any more, I suffer under some very tight deadlines at the moment. I don't have capacity to do this now or in the near future.

My recommendation are thus giving the time constraints I am currently under:

  • If you don't feel comfortable merging this at all please retract the current implementation of SQLAlchemy or clearly state that its broken in the documentation and package description, because it has the tendendency to drop all other traces in the overall system.
  • Otherwise please feel free @owais or @lzchen to implement the unit test which you envision for feeling more comfortable deploying this.

I think keeping the status quo is the worst possible option though, so I would definitely recommend to not do this.

@owais
Copy link
Contributor

owais commented Mar 18, 2021

I say let's merge this and open an issue to add the test.

@lzchen
Copy link
Contributor

lzchen commented Mar 22, 2021

@codeboten
Could you create an issue for this? I don't have all the context.

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a followup item to add the tests here #383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants