Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send Zipkin spans in json format in a separate process #8150

Conversation

@cattibrie
Copy link
Contributor

commented Aug 8, 2019

Problem

In py_zipkin implementation, Zipkin spans are sent after they all are recorded. When the number of spans is several hundred thousand the time overhead of encoding and sending spans is significant.
For example for 70000 spans is ~ 7 secs in thrift encoding and 16 seconds in JSON.

Because how thrift encoding is implemented in py_zipkin the time to encode in json is smaller than in thrift.
Also, the overhead of sending that many numbers of spans is also too big.

Solution

Zipkin spans are encoded in json and sent to the Zipkin server from a separate process.

The follow-up work can be:

  • to make thrift encoding faster in py-zipkin.
  • encode and send v2 spans from rust part (without passing them to python and encode and send them with py-zipkin)
  • record fewer spans (e.g. do not record spans that take less than 100 milliseconds)

Result

This is a temporary solution that will allow to encode and send Zipkin spans without the noticeable overhead.

@cattibrie cattibrie force-pushed the cattibrie:etyurina/DPB-11348-send-zipkin-spans-async-in-json branch from 70cd4f7 to e2a4d58 Aug 9, 2019

@cattibrie cattibrie changed the title WIP: Send zipkin spans in json format in a separate process Send Zipkin spans in json format in a separate process Aug 9, 2019

@illicitonion
Copy link
Contributor

left a comment

Generally looks good, a few things to clean up and document :) Thanks!

)
command = 'curl -v -X POST -H "Content-Type: application/json" --data @{} {}'.format(file_path, self.endpoint)
args = command.split(' ')
p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor

We probably want to connect all of stdin, stdout, and stderr to /dev/null by doing something like:
Popen(args, stdin=os.devnull, stdout=os.devnull, stderr=os.devnull)

args = command.split(' ')
p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

logger.debug("pid of child process that sends spans to Zipkin server: {}".format(p.pid))

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor
Suggested change
logger.debug("pid of child process that sends spans to Zipkin server: {}".format(p.pid))
logger.debug("Sending spans to Zipkin server from pid {}".format(p.pid))
@@ -191,3 +204,129 @@ def bulk_record_workunits(self, engine_workunits):

def from_secs_and_nanos_to_float(secs, nanos):
return secs + ( nanos / NANOSECONDS_PER_SECOND )


def emit_spans(self):

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor

Can you add some comments explaining why this exists, what it's mostly copied from, and what changes you've made to it?


class ZipkinSpanSender(object):

MAX_BATCH_SIZE = 100

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor

I would inline this constant because it's only used in one place, and it's clear how it's used from context.

Having this named as it currently is, and existing where it does, implies that this value will always be used.

Otherwise, you could perhaps call it DEFAULT_MAX_BATCH_SIZE :)


MAX_BATCH_SIZE = 100

def __init__(self, transport_handler, max_span_batch_size, encoder, zipkin_spans_dir, encoding):

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor

If things are optional, we tend to make them kwargs with default values; so something like:
def __init__(self, transport_handler, encoder, zipkin_spans_dir, encoding, max_span_batch_size=100):

But I'm also not sure why this is optional in the first place; why don't we require the caller to specify the max_span_batch_size?

This comment has been minimized.

Copy link
@cattibrie

cattibrie Aug 10, 2019

Author Contributor

I am not sure that a caller knows (or should know) what should be a max_span_batch_size value.
As max_span_batch_size is an option (#8075) I will not use MAX_BATCH_SIZE and pass this value to ZipkinHandler from the command-line option.


MAX_BATCH_SIZE = 100

def __init__(self, transport_handler, max_span_batch_size, encoder, zipkin_spans_dir, encoding):

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor

I'd add to a docstring here that the encoder/encoding must be JSON (because the class assumes that it is)

headers={'Content-Type': 'application/x-thrift'},
)
command = 'curl -v -X POST -H "Content-Type: application/json" --data @{} {}'.format(file_path, self.endpoint)
args = command.split(' ')

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 9, 2019

Contributor

I suspect you don't actually want to split on spaces, because you have some quoted things. Maybe just make your list manually:

command = [
  "curl",
  "-v",
  "-X", "POST",
  "-H", "Content-Type: application/json",
  ...
]

@cattibrie cattibrie force-pushed the cattibrie:etyurina/DPB-11348-send-zipkin-spans-async-in-json branch 2 times, most recently from 9884ae4 to a55af72 Aug 11, 2019

@illicitonion
Copy link
Contributor

left a comment

Much cleaner!

Just some really minor comments :) Thanks!!

data=payload,
headers={'Content-Type': 'application/x-thrift'},
)
if os.path.exists(self.zipkin_spans_dir):

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 13, 2019

Contributor

Let's remove this check - we generally trust that people don't mess with the workdir underneath us.

If we want to keep it, I'd probably phrase it more as:

if not os.path.exists(self.zipkin_spans_dir):
  log.error("Not uploading zipkin spans because directory {} got deleted".format(self.zipkin_spans_dir))
  return

Two things about that:

  1. Doing the "if not - early return" thing means you don't end up indenting all of the code a bunch (which means you have less context to think about when reading it)
  2. Logging that we're not doing something because of something we didn't expect, which indicates something went wrong somewhere

args = ['curl', '-v',
'-X', 'POST',
'-H', '"Content-Type: application/json"',

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 13, 2019

Contributor

I don't think you need the double-quoting here

@@ -19,19 +21,43 @@


class HTTPTransportHandler(BaseTransportHandler):

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 13, 2019

Contributor

Maybe rename this to AsyncHTTPTransportHandler?

self.zipkin_spans_dir = os.path.join(self.run_tracker.run_info_dir, 'zipkin')
self.handler = HTTPTransportHandler(endpoint, self.zipkin_spans_dir, self.encoding)

# Derictory to store encoded spans.

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 13, 2019

Contributor
Suggested change
# Derictory to store encoded spans.
# Directory to store encoded spans.

@cattibrie cattibrie force-pushed the cattibrie:etyurina/DPB-11348-send-zipkin-spans-async-in-json branch from 9fd5e72 to 9b98a0a Aug 14, 2019

@illicitonion
Copy link
Contributor

left a comment

Looks good to merge to me - I'm confused about the broken shard... Anyone have any ideas? @Eric-Arellano @stuhood @benjyw ?

@illicitonion

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

Blowing away the cache appears to have fixed that, but now a test is timing out... Probably unrelated?

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

Blowing away the cache appears to have fixed that, but now a test is timing out... Probably unrelated?

Please see https://groups.google.com/forum/#!topic/pants-devel/pN2fzqCIk-k/discussion

@cattibrie cattibrie force-pushed the cattibrie:etyurina/DPB-11348-send-zipkin-spans-async-in-json branch from 9b98a0a to 9475a9a Aug 14, 2019

@cattibrie cattibrie force-pushed the cattibrie:etyurina/DPB-11348-send-zipkin-spans-async-in-json branch from 9475a9a to 61a96de Aug 19, 2019

@cattibrie

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

@stuhood I would be very thankful if you blow away the cache for the last red shard and merge PR. Thanks in advance!

@stuhood stuhood merged commit c81d21d into pantsbuild:master Aug 20, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
patliu85 added a commit to twitter/pants that referenced this pull request Aug 20, 2019
Send Zipkin spans in json format in a separate process (pantsbuild#8150)
### Problem
In py_zipkin implementation, Zipkin spans are sent after they all are recorded.  When the number of spans is several hundred thousand the time overhead of encoding and sending spans is significant.
For example for 70000 spans is ~ 7 secs in thrift encoding and 16 seconds in JSON. 

Because how thrift encoding is implemented in py_zipkin the time to encode in json is smaller than in thrift.
Also, the overhead of sending that many numbers of spans is also too big.

### Solution
Zipkin spans are encoded in json and sent to the Zipkin server from a separate process.

The follow-up work can be:
- to make thrift encoding faster in py-zipkin.
- encode and send v2 spans from rust part (without passing them to python and encode and send them with py-zipkin)
- record fewer spans (e.g. do not record spans that take less than 100 milliseconds) 

### Result
This is a temporary solution that will allow to encode and send Zipkin spans without the noticeable overhead.
illicitonion added a commit that referenced this pull request Aug 29, 2019
Send Zipkin spans in json format in a separate process (#8150)
### Problem
In py_zipkin implementation, Zipkin spans are sent after they all are recorded.  When the number of spans is several hundred thousand the time overhead of encoding and sending spans is significant.
For example for 70000 spans is ~ 7 secs in thrift encoding and 16 seconds in JSON. 

Because how thrift encoding is implemented in py_zipkin the time to encode in json is smaller than in thrift.
Also, the overhead of sending that many numbers of spans is also too big.

### Solution
Zipkin spans are encoded in json and sent to the Zipkin server from a separate process.

The follow-up work can be:
- to make thrift encoding faster in py-zipkin.
- encode and send v2 spans from rust part (without passing them to python and encode and send them with py-zipkin)
- record fewer spans (e.g. do not record spans that take less than 100 milliseconds) 

### Result
This is a temporary solution that will allow to encode and send Zipkin spans without the noticeable overhead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.