New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line protocol write API #2696

Merged
merged 5 commits into from May 29, 2015

Conversation

Projects
None yet
@jwilder
Collaborator

jwilder commented May 29, 2015

This PR adds a new write HTTP endpoint (/write_points) that uses a text based line protocol instead of JSON. The protocol is a list of points separated by newlines \n.

Each point is composed of three blocks separated by whitespace. The first block is the measurement name and tags separated by commas. The second block is fields separated by commas. The last block is optional and is the timestamp for the point as a unix epoch in nanoseconds.

measurement[,tag=value,tag2=value2...] field=value[,field2=value2...] [unixnano]

Each point must have a measurement name. Tags are optional. Measurement, tag and values can not have any spaces. If the value contains a comma, it needs to be escaped with \,.

Each point must have at least one value. The format of a field is name=value. Fields can be one of four types: integer, float, boolean or string. Integers are all numeric and cannot have a decimal point .. Floats are all numeric and must have a decimal point. Booleans are the values true and false. Strings must be surrounded by double-quores ". If the value contains a quote, it must be escaped \". There can be no spaces between consecutive field values.

For example,

cpu,host=serverA,region=us-west value=1.0 10000000000
cpu,host=serverB,region=us-west value=3.3 10000000000
cpu,host=serverB,region=us-east user=123415235,event="overloaded" 20000000000
mem,host=serverB,regstion=us-east swapping=true 2000000000

Points written in this format should be sent to the /write_points endpoint. The request should be a POST with the points in the body of the request. The content can also be gzip encoded.

The following URL params may also be sent:

  • db: required The database to write points
  • rp: optional The retention policy to write points. If not specified, the default retention policy will be used.
  • precision: optional The precision of the time stamps (n, u, ms, s,m,h). If not specified, n is used.
  • consistency: optional The write consistency level required for the write to succeed. Can be one of one, any, all,quorum. Defaults to all.
  • u: optional The username for authentication
  • p: optional The password for authentication

A successful response to the request will return a 204. If a parameter or point is not valid, a 400 will be returned.


PR Notes:

The parser has been tuned to minimize allocations and extra work during parsing. For example, the raw byte slice read in is held onto as much as possible until there is a need to modify it. Similarly, values are not unmarshaled into Go types until necessary. It also tries to validate the input using a single pass over the data as much as possible. Tags need to be sorted so it is preferable to send them in already sorted to avoid sorting on the server. The sort has been tuned as well so that it performs consistently over a large range of inputs.

My local benchmarks have parsing performing around 750k-2m/points/sec depending on the shape of the point data.

jwilder added some commits May 28, 2015

Add text protocol parsing and serialzation for points
This changes the implementation of point to minimize the extra
processing needed to parse and marshal point data though the system.

@jwilder jwilder added the 2 - Working label May 29, 2015

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

Measurement, tags, tag values, field names, and field values should all be able to have spaces, providing they are escaped. I assume the list of characters to escape are , " =. The empty one being a space.

The default consistency should be changed to one.

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

Overall looks good except for the updates to support spaces. Can you also update the Go client to use this endpoint and format instead?

@jwilder

This comment has been minimized.

Collaborator

jwilder commented May 29, 2015

Ok. I'll fix the escaping and default consistency. I'll do a separate pr for the client.

@otoolep

This comment has been minimized.

Contributor

otoolep commented May 29, 2015

Good idea. Simple, but we can still curl it up.

@beckettsean

This comment has been minimized.

Contributor

beckettsean commented May 29, 2015

The last block is optional and is the timestamp for the point as a unix epoch in nanoseconds. and precision: optional The precision of the time stamps (n, u, ms, s,m,h). If not specified, n is used. seem to be in conflict with each other.

Perhaps that first sentence should just be The last block is optional and is the timestamp for the point as a unix epoch. The documentation for the optional precision query param does say the default is nanoseconds if none is supplied.

@beckettsean

This comment has been minimized.

Contributor

beckettsean commented May 29, 2015

Will this line protocol support Unicode characters in strings or only ASCII?

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

@beckettsean you should not prominently on the docs that the tags should be sorted by the tag keys.

@corylanou

This comment has been minimized.

Contributor

corylanou commented May 29, 2015

Sorted for optimal performance, right? If they aren't sorted, we will sort them, but I think currently drops performance by 50%?

@jwilder

This comment has been minimized.

Collaborator

jwilder commented May 29, 2015

@beckettsean I added a test for unicode just now. Seems to work for string field values at least. I'll need to check the other spots though.

jwilder added a commit that referenced this pull request May 29, 2015

@jwilder jwilder merged commit 99bc7d2 into alpha1 May 29, 2015

1 check passed

ci/circleci Your tests passed on CircleCI!
Details

@jwilder jwilder removed the 2 - Working label May 29, 2015

@jwilder jwilder deleted the jw-write-path branch May 29, 2015

@beckettsean

This comment has been minimized.

Contributor

beckettsean commented May 29, 2015

@pauldix For the docs, Tag Keys should be ASCII sorted? Alpha sorted?

basically, what's the order for the following characters:

1 a A å Å _ - , . <space> <tab>

Are any of those illegal characters for this protocol?

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

@corylanou that's right. In the docs for the protocol we should push them in the right direction. Meaning they should sort the tags.

Also, even though the timestamp is optional, we should push them to include it. This becomes important if they're running a cluster and they get a response back on the client side that said it was only a partial write. In most cases they would want to repost the data. However, they'll potentially get duplicates. UNLESS they include the timestamp for each point in their request to write.

@beckettsean

This comment has been minimized.

Contributor

beckettsean commented May 29, 2015

@pauldix we will make clear in the docs the potential gotchas with not supplying a timestamp. The particular issue you describe only affects clusters, correct? For single node setups the writes are atomic so no partial write is possible, as far as I understand it.

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

@beckettsean technically it's still possible with a single server if they're writing data with different timestamps that end up dividing the points across multiple shards.

However, if they do a write of a bunch of points with no timestamp specified, and it's a single server, then a partial write isn't possible. It'll either succeed entirely or fail i.e. Atomic.

@gunnaraasen

This comment has been minimized.

Member

gunnaraasen commented May 29, 2015

Similar to what @pauldix mentioned. It would be great if measurement, tag key, tag values, field keys, and field values all followed the query language's identifier definition as closely as possible.

  • unquoted identifiers must start with an upper or lowercase ASCII character or "_"
  • unquoted identifiers may contain only ASCII letters, decimal digits, and "_"
  • double quoted identifiers can contain any unicode character other than a new line
  • double quoted identifiers can contain escaped " characters (i.e., \")

And string literals are single quoted.

Not sure if those rules make sense for the line format, but users will probably expect it to be similar to the query language identifier definition.

@neonstalwart

This comment has been minimized.

Contributor

neonstalwart commented May 29, 2015

how about rather than a new endpoint, continue to use /write but switch behavior based on the Content-Type header? this would be text/plain and the JSON would be application/json. that's a fairly common way to interact via HTTP.

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

@gunnaraasen Those rules don't quite make sense for the line format mainly because this is so strict we know where the tag keys vs tag values are. The query language is much more flexible so it has more limitations.

Requiring double quotes around the identifiers (meaurement names, tag keys, and field keys) would be unnecessary for the line protocol for writing and would bloat the message.

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

@neonstalwart the problem is that previously we didn't require a Content-Type to be set, so that could break the existing write API for a bunch of people.

Going forward, the JSON write endpoint is going to be deprecated and this new endpoint is going to be the preferred way of writing data in.

Also, in my experience, many people have trouble interacting with HTTP APIs that require you to set things in the headers. I know it's part of the thing, but if we're optimizing for ease of use, a different endpoint is the best.

@neonstalwart

This comment has been minimized.

Contributor

neonstalwart commented May 29, 2015

the problem is that previously we didn't require a Content-Type to be set, so that could break the existing write API for a bunch of people.

that's kind of a weak argument given timestamp -> time, name -> measurement.

@pauldix

This comment has been minimized.

Member

pauldix commented May 29, 2015

@neonstalwart I guess that's true since we broke it before. Not buying the usability argument? :)

@neonstalwart

This comment has been minimized.

Contributor

neonstalwart commented May 29, 2015

Going forward, the JSON write endpoint is going to be deprecated and this new endpoint is going to be the preferred way of writing data in.

😞 would you reconsider? i think HTTP+JSON is more or less the de facto way i interact with things these days. it's of course not the only way but with the move towards putting each service in its own container and using HTTP to communicate between components it seems very common in my experience.

Also, in my experience, many people have trouble interacting with HTTP APIs that require you to set things in the headers. I know it's part of the thing, but if we're optimizing for ease of use, a different endpoint is the best.

i agree that many people find it difficult to use HTTP. i just think that a /write that accepts json and a /write_points that accepts plain text is kind of awkward. to be kind of brutal, as a consumer considering a product, the API feels cheap (not well crafted and thought out) which leaves me wondering about the quality of the code. the API is your public face of this product.

@majst01

This comment has been minimized.

Contributor

majst01 commented Jun 8, 2015

Ok this did the trick, same applies to my java impl :-)

@allgeek

This comment has been minimized.

allgeek commented Jun 9, 2015

Everywhere under the hood we use a key to identify series. We use this in the underlying storage and we use it to route requests and writes within a cluster. In the case of the line protocol, that's the bytes up to the first space. We can get at it without doing any additional work or allocations.

That sounds like a huge benefit, and overall I'm definitely a fan of the line protocol approach for this use case (and of course the performance impacts as a result). However I'm wondering if this series-key optimization may be problematic if clients aren't pre-sorting the tags as suggested for performance purposes. If the parser re-sorts if necessary, but this shortcut is being used, wouldn't these keys still be the unsorted version? In the clustering redesign blog post, an assumption was made that equal points are duplicates. Would there be issues with receiving these two ('equal') points, but not really considering them duplicates?

cpu,host=serverA,region=uswest value=23.2

cpu,region=uswest,host=serverA value=23.2

It seems highly unlikely that a client would send 'duplicates' in different orders like this, but not knowing the full impact of the assumptions being made around these keys and handling of duplicates, my developer's spidey sense was tingling with potential edge case issues when I read the above quote..

@otoolep

This comment has been minimized.

Contributor

otoolep commented Jun 10, 2015

The problem you speak of @allgeek is accounted for deep in the system. Rest-assured the two example points you show are considered the same point, and the tags are sorted by key on every point before performing the identity check:

https://github.com/influxdb/influxdb/blob/master/tsdb/meta.go#L1099

@jwilder

This comment has been minimized.

Collaborator

jwilder commented Jun 10, 2015

@allgeek For best performance, you should send them in pre-sorted if you can. If they are not sorted, they will be sorted before being stored. If they are already sorted, we don't attempt a sort. The parsing throughput drops by approximately 50% for unsorted tags and is proportional to the number tags present. It is still ~16x faster than JSON though and moves the bottleneck closer to the disks.

The two points you show would would have the same key but different timestamp since a timestamp is not shown. They would be two different points in the same series. If they both had the same timestamps, then they would be duplicate points.

@otoolep

This comment has been minimized.

Contributor

otoolep commented Jun 10, 2015

@jwilder is correct to point out the requirement for an identical timestamp. Just to be clear, I assumed the same timestamp for each point, in your example.

@randywallace

This comment has been minimized.

randywallace commented Jun 12, 2015

Directly related to this PR, I hacked out a quick rubygem that facilitates using the LineProtocol. I wrote it so that I can follow this up with easy integration into our sensu infrastructure. Comments/concerns/complaints/PR's are warmly welcomed: https://github.com/randywallace/influxdb-lineprotocol-writer-ruby

It does correctly from my testing sort tag keys automatically.

I spent about 4 hours on this, and in that time couldn't find a reasonable way to get nanosecond/microsecond precision in ruby (although for our use cases it isn't at all useful); I also didn't test SSL. If anyone wants to help with that, its appreciated.

@ORANGE-XFM

This comment has been minimized.

ORANGE-XFM commented Jun 12, 2015

spaces within strings (ie within double quotes) seem to need escaping, is this normal ?
example :

    test,a="hello" value=1 --> 204 no content
    test,a="hello there" value=1 --> 400 bad request
    test,a="hello\ there" value=1 --> 204 no content

is this normal ? why is this escaping needed ?

@jwilder

This comment has been minimized.

Collaborator

jwilder commented Jun 22, 2015

@xfmoulet Tag values should not be double-quoted. Just escape the spaces with \ and don't surround with double-quotes.

test,a=hello\ there value=1
@edlane

This comment has been minimized.

edlane commented Aug 17, 2016

@jwilder I would like to revisit the assumptions made above considering the recent improvements to JSON parsing claimed here: https://github.com/buger/jsonparser#benchmarks
Json parsing performance has been an acknowledged embarrassment to the Go community in the past --- losing out to Python, Ruby, Lua, .... and of course to C.

Rather than abandoning JSON support entirely would it be better to just use a faster library?

@jwilder

This comment has been minimized.

Collaborator

jwilder commented Aug 17, 2016

@edlane The JSON endpoint was disabled back in 0.11 and removed in 0.12.

@edlane

This comment has been minimized.

edlane commented Aug 17, 2016

@jwilder Yes, That is why I asked the question.

@jwilder

This comment has been minimized.

Collaborator

jwilder commented Aug 17, 2016

@edlane We've moved on from JSON on the write path and are very unlikely to add it back. We have a proposal for a v2 line protocol that we're considering as well. JSON also presents some problems with sending int64 values because it only has float numbers.

We have the same performance/memory issues on the query side related to JSON and have been adding support for other formats csv, msgpack. Switching the marshaller to something more performant for the query side might be worthwhile though.

@edlane

This comment has been minimized.

edlane commented Aug 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment