Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parsers.xpath): Add Concise Binary Object Representation parser #13480

Merged
merged 10 commits into from
Jun 28, 2023

Conversation

srebhan
Copy link
Contributor

@srebhan srebhan commented Jun 22, 2023

resolves #13464

This PR adds support for parsing Concise Binary Object Representation (CBOR) messages using XPath expressions.

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins labels Jun 22, 2023
@srebhan srebhan mentioned this pull request Jun 22, 2023
@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jun 23, 2023
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome and so simple. What do you think about some additional test cases for other layouts that we can reference as examples? Maybe we can ask the requester for some?

I will let you merge once you hear back from the original requester.

go.mod Outdated Show resolved Hide resolved
@tulku
Copy link
Contributor

tulku commented Jun 23, 2023

Hi! Sorry for the delay in commenting. Thank you very much for working on this!
I also have some problems with the XPaths queries. However, I can already provide some feedback.

I like the batch config because it makes it more robust to new keys being added to the messages. Currently, when using this configuration all fields are of type string (as documented here), even when using xpath_native_types = true.

As far as I understood from tests and documentation, xpath_native_types is used when fields are explicitly defined, but no XPath type function (like number()) is called in the field definition (like in this example. It would be nice to respect the field types even in batch mode.

Another inconvenience I found when testing with our payloads is that we use numbers as keys. Not even numbers in a string, like "123", just 123. This was done to reduce the size of the messages we send. We map those numerical keys to names on reception. Currently, I don't care about this mapping, I would like to store directly these numbers as field names. (Field names could be strings with the numbers, that's fine). I could not figure out how to match numbers for fields in XPath queries, and even if they were "123" strings, you cannot have names that start with numbers as far as I understood.

I don't know if these numbers as keys are really something that could be changed (as it looks like XPath does not support it), but its something that we could change on our end, and send shorter strings instead of numbers, sacrificing some message size. However, respecting the types would be really nice, so that we could use the batch configuration option.

I will provide examples derived from our messages, changing the keys, so that they work, but I need a bit more time (and learn more about XPath haha)

Thanks again! This is amazing!

@srebhan
Copy link
Contributor Author

srebhan commented Jun 23, 2023

What do you think about some additional test cases for other layouts that we can reference as examples? Maybe we can ask the requester for some?

Totally would love to see more test cases/examples! I just don't have real world data so if the requester provides some including "query targets" that would be awesome...

@srebhan
Copy link
Contributor Author

srebhan commented Jun 23, 2023

[...] Currently, when using this configuration all fields are of type string (as documented here), even when using xpath_native_types = true.

This should be fixed now. Would love to get a confirmation on this with the latest version.

Another inconvenience I found when testing with our payloads is that we use numbers as keys. Not even numbers in a string, like "123", just 123. This was done to reduce the size of the messages we send. We map those numerical keys to names on reception. Currently, I don't care about this mapping, I would like to store directly these numbers as field names. (Field names could be strings with the numbers, that's fine). I could not figure out how to match numbers for fields in XPath queries, and even if they were "123" strings, you cannot have names that start with numbers as far as I understood.

This is a tricky thing... Numerical node names are not supported in XPath so I need to convert them to strings. Furthermore, Telegraf only supports string-types for field names. This being said, the best I can offer is to use the string-representation of these numbers, but you need to re-convert them to integers on the receiver side... Or even easier you send the field-names/keys as string like "123".

In any case, if you can provide some example data and queries with their expected result I would love to add them as test-cases to prevent regressions by future changes and as examples for other users.

@tulku
Copy link
Contributor

tulku commented Jun 23, 2023

@srebhan thanks for the update. I can confirm that the xpath_native_types = true now works as expected. This is a great addition!

Regarding the numbers, it would not be a problem if the numbers (123) are converted to strings ("123") in the Telegraf field names. I don't really need them to number once they are received. However, from what I understood numerical node names are not supported even if they are strings with numbers. Hopefully, I am wrong though.

I will provide a series of examples, with the original keys that we use, another with the "123" type strings (which I could not really get working with the XPath queries either), and a third variation with regular names as keys.

@srebhan
Copy link
Contributor Author

srebhan commented Jun 23, 2023

However, from what I understood numerical node names are not supported even if they are strings with numbers. Hopefully, I am wrong though.

As long as the names are strings, the content is irrelevant. A node name of "123" is fine because the datatype is string. So you should be wrong indeed. :-)

Btw: Did you think about zipping the transmitted packages? If they contain a decent amount of strings you could get out a good compression ratio...

@tulku
Copy link
Contributor

tulku commented Jun 24, 2023

@srebhan Regarding having numbers as keys. In CBOR they are valid, but not in XPath/Telegraf, so I think it would make sense to always convert the keys to strings (and document it as such in the plugin). Currently, the parser (even when using batch configuration) will output errors like:

2023-06-24T10:44:11Z D! [inputs.http_listener_v2] Parse error: object key 261 (uint64) is not a string

and there is no way to configure the plugin to avoid this problem (at least that I can see) and accept the inputs.

@tulku
Copy link
Contributor

tulku commented Jun 24, 2023

Regarding queries of node names that are numbers only, I am also having trouble with them:

D! [inputs.http_listener_v2] Parse error: failed to query timestamp: failed to compile query "//260": expression must evaluate to a node-set

With the following configuration:

  [[inputs.http_listener_v2.xpath]]
    timestamp = "//260"
    metric_name = "'cbor'"
    field_selection = "//*"

However, if I change the keys to start with a letter, for example the timestamp now is "k260" instead of just "260", and change the configuration so that:

timestamp = "//k260"

Works, and the timestamp in that key is used as time for the metric.

@tulku
Copy link
Contributor

tulku commented Jun 24, 2023

I was finally able to put some examples together. I was unsure how to better provide them, so I pushed them here:
https://github.com/tulku/cbor_examples

There you'll find Python code that generates the CBOR (and Json for reference) messages which are cleaned-up versions of real messages we manage, as well as the Telegraf configuration I was using to do the tests.

I am happy with the resulting configuration:

  1. Known keys can be translated to their real names (done in this section)
  2. Unknown keys are just saved as field names using field_selection.
  3. I can mark known keys to be tags, with this section.
  4. Values can be bytestrings, and those are properly stored in influx. I also included a Python script to query those values and verify that the stored value is correct. The only API that works is query_raw as csv or the table API (of the python influx_client) just crashes when getting the bytestring data.

However, for this to work, I needed to change the keys from numbers to strings and add a letter at the start (k in this case).

As stated before, it would be great if:

  1. The plugin converts non-string keys to string.
  2. I would figure out how to make XPath queries that work with all numbers in the strings. If it is indeed impossible, maybe the plugin could add a letter as part of the 'string' operation. Though this is very hacky.

Regarding the binary data, I tried adding this line:

fields_bytes_as_hex = ["sensors_data", "sensors_calibration", "k6433", "k6434"]

right after this one, but it did not have any effect. That would be nice to avoid the problem of storing non-character bytes. Not sure if this is not how I am supposed to use it, but I could not make it work and it would be great.

@tulku
Copy link
Contributor

tulku commented Jun 24, 2023

I forgot to talk about the base64 experiments included in the linked repo. It would be nice to have an option like fields_bytes_as_hex but that would encode to base64 instead of an hex string.
I know that this is out of the scope of this PR and probably better suited for a processor plugin that just does the encoding.

Having the base64 option is nice because the resulting string is significantly smaller than when encoding as a hex string.
For example for the binary example included in the repo:
Original size as f16: 1536 bytes, encoded in b64: 2097 bytes, and as a hex string: 3121 bytes, almost 1.5 times bigger.

@tulku
Copy link
Contributor

tulku commented Jun 26, 2023

Thanks for adding support for numerical keys @srebhan! I just tested it here and works great. I really appreciate the work put into this.

Out of the things that I tested, the only remaining issue would be that I can't get the option fields_bytes_as_hex to work as expected.

@telegraf-tiger
Copy link
Contributor

@srebhan
Copy link
Contributor Author

srebhan commented Jun 27, 2023

@tulku hex encoding works for me. Added unit-test to show it...

@tulku
Copy link
Contributor

tulku commented Jun 27, 2023

Thank you very much for the test! I realized what was my problem.

This configuration abstract works for me:

  [[inputs.http_listener_v2.xpath]]
    timestamp = "//n260"
    metric_name = "'binary_test_1'"
    field_selection = "descendant::*[not(local-name()='n260')]"
    fields_bytes_as_hex = ["n6433", "n6434"]

    [inputs.http_listener_v2.xpath.fields]
      sys_alive = "//n261"
      sys_boot_count = "//n263"

Mean that keys n6433 and n6434 will be appropriately converted to hex strings.

However, what I was trying to do and failing was defining a field, and having that field be converted to a hex string, for example:

  [[inputs.http_listener_v2.xpath]]
    timestamp = "//n260"
    metric_name = "'binary_test_1'"
    field_selection = "descendant::*[not(local-name()='n260')]"
    fields_bytes_as_hex = ["sensor_data", "sensor_calibration"]

    [inputs.http_listener_v2.xpath.fields]
      sensor_data = "//n6433"
      sensor_calibration = "//n6434"

If in this example, I change fields_bytes_as_hex = ["sensor_data", "sensor_calibration"] to fields_bytes_as_hex = ["n6433", "n6434"] it will create 4 fields, two of them as binary strings directly and two of them converted to hex strings.

Thank you a lot for the clarification and the extra unittests! And sorry for the extra work!

Again, it is amazing to have this plugin now. Thank you a lot! From what I could test, this is ready to merge.

@srebhan
Copy link
Contributor Author

srebhan commented Jun 28, 2023

@tulku can you please open a separate issue for the hex-encoding as this is true for all formats XPath supports. If this is ok for you, I would love to merge this PR and handle the bug-fix in another one to get it out in the next bugfix release...

@srebhan
Copy link
Contributor Author

srebhan commented Jun 28, 2023

@powersj can you please give the PR a final review and then we are good to go. As mentioned, the hex-encoding problem is a separate issue that should receive a bug-fix PR so we can roll it out in v1.27.2 in case I manage to fix it in time...

@srebhan srebhan requested a review from powersj June 28, 2023 19:53
@powersj powersj merged commit c35cabd into influxdata:master Jun 28, 2023
24 checks passed
@tulku
Copy link
Contributor

tulku commented Jun 29, 2023

@srebhan @powersj Again, thanks a lot for this work!
I think that the current behavior of hex-encoding is not necessarily wrong, we might need to clarify that it only works on original keys and not on defined fields.

I was thinking of adding a base64 encoding option. Would you review/support such a feature? If that is the case, I can create an issue and attempt a PR for it.

@srebhan
Copy link
Contributor Author

srebhan commented Jul 3, 2023

@tulku it is wrong. You request this on explicit fields and it does not work. That's clearly unexpected and a bug. Would be nice if you could open an issue so we do not forget about this.

@tulku
Copy link
Contributor

tulku commented Jul 3, 2023

@srebhan issue created. Thanks again! (#13531)

@srebhan srebhan added this to the v1.28.0 milestone Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CBOR parser plugin
3 participants