Add json_array_parser parser and assign_keys transformer #30644

RoeiDimi · 2024-01-17T13:29:25Z

Description:

We are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and in one use case the json is a simple headerless jarray and so we need a way to parse it and match headers to each field (something similar to what csv_parser does but also supports types supported in a json format and nested objects)

Link to tracking Issue:
#30321

Testing:

unittests
All the tests found in csv_parser were copied and adjusted adding test scenarios for different types (numbers, booleans, null) as well as a test for parsing a nested object as a part of the jarray
End to end tests
Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table

Documentation:

linux-foundation-easycla · 2024-01-17T13:29:29Z

The committers listed above are authorized under a signed CLA.

✅ login: RoeiDimi (07aa655, 1fe4ff7, 2d2fde1, de83043, fce1fff, 6c29ede, 488d4d4, 4dd2117, e00f4d7, 7d54086, 0088093, 3c23e50, 91fea1d, 9a3897f, b325b8f, f68cd9a, fe40b11, 9ac4971, b2b26f2, 76c1fe0, beacea4, 4cf11dd, 9e17cea, c371559, 4e7dc24, 89dae82, 1a140ff)
✅ login: djaglowski / name: Daniel Jaglowski (947cbe9, 4e3af6f, 7f93ec8)

pkg/stanza/operator/parser/headerless_jarray/headerless_jarray.go

pkg/stanza/docs/operators/headerless_jarray_parser.md

pkg/stanza/operator/parser/headerless_jarray/headerless_jarray.go

…tests

pkg/stanza/docs/operators/json_array_parser.md

djaglowski

Thanks @RoeiDimi, the design feels right. I left a lot of comments but mostly just nits. A few minor things I think we need to take care of though.

pkg/stanza/docs/operators/assign_keys.md

pkg/stanza/operator/parser/json_array/json_array_parser.go

pkg/stanza/operator/transformer/assign_keys/assign_keys.go

djaglowski · 2024-01-23T18:18:55Z

@RoeiDimi, I was able to resolve the issue by adding the new dependency directly to the root module. Thanks for the new features and for iterating on the design with me.

RoeiDimi · 2024-01-23T19:48:11Z

Thanks @djaglowski, it's been a pleasure working together!

One last question though - assign_keys can't really parse the values straight into attributes (the way csv_parser can).
can you think of an easy way to make that happen?

i mean, I now understand that all this may not solve my initial issue completely.
with the initial design i could do this -

- type: csv_parser
      header: TimeGenerated,SourceIP,SourcePort,DestinationIP,DestinationPort,Protocol,SentBytes,ReceivedBytes,ExtID
      parse_from: body
      parse_to: attributes
      is_jarray: true

we have a generic exporter code that reads all the key-values from attributes and sends them.

Can you think of a way I could use the json_array_parser -> assign_keys design to achieve that?

djaglowski · 2024-01-23T20:27:00Z

@RoeiDimi, looks like we lost track of the goal somewhat but luckily you should be able to flatten after assigning keys:

    operators:
      - type: json_array_parser
        parse_from: body
        parse_to: attributes.tmp
      - type: assign_keys
        keys: ["a", "b", "c"]
        field: attributes.tmp
      - type: flatten
        field: attributes.tmp

Edit: I should point out this is not entirely by luck, but rather is a natural benefit of the highly decomposed approach we've always pursued with these operators. With a robust set of simple operations, we can typically compose them to do what's needed.

RoeiDimi · 2024-01-24T07:49:58Z

@djaglowski yeah that's pretty cool, thank you so much! :)

I apologize but I have another question :
About the feature gate we added - how do I enable it? I saw this link

so I tried running the collector with all the following variations but none worked -

--feature-gates=logs.assignKeys,logs.jsonParserArray
--feature-gates=assignKeys,+jsonParserArray
--feature-gates=assignKeys,jsonParserArray
--feature-gates=+logs.assignKeys,+logs.jsonParserArray

sumo-drosiek · 2024-01-24T08:17:03Z

@djaglowski, @RoeiDimi Shouldn't we have the functionality in OTTL language? I was able to achive the same functionality I believe using transformprocessor:

exporters:
  logging:
    verbosity: detailed
processors:
  transform/parsejson:
    log_statements:
      - context: log
        statements:
          // ParseJSON expects `{}` so I need to build it manually
          - set(body, ParseJSON(Concat(["{\"tmp\":", body, "}"], "")))
          - set(body["key1"], body["tmp"][0])
          - set(body["key2"], body["tmp"][1])
          - set(body["key3"], body["tmp"][2])
          - delete_key(body, "tmp")
receivers:
  filelog/containers:
    start_at: beginning
    include:
      - ./tmp/logs/5.json
service:
  pipelines:
    logs/containers:
      exporters:
      - logging
      processors:
      - transform/parsejson
      receivers:
      - filelog/containers

./tmp/logs/5.json

["value1", "value2", "value3"]

output:

LogRecord #0
ObservedTimestamp: 2024-01-24 08:11:49.193363736 +0000 UTC
Timestamp: 1970-01-01 00:00:00 +0000 UTC
SeverityText: 
SeverityNumber: Unspecified(0)
Body: Map({"key1":"value1","key2":"value2","key3":"value3"})
Attributes:
     -> log.file.name: Str(5.json)
Trace ID: 
Span ID: 
Flags: 0
        {"kind": "exporter", "data_type": "logs", "name": "logging"}

RoeiDimi · 2024-01-24T12:49:08Z

@sumo-drosiek @djaglowski , this suggestion may be interesting as after i ran large-scale stress tests it seems that using the 3 operators together (json_array_parser, assign_keys, flatten) result in a massive performance hit. Seems like chaining operators is not very efficient in general as the initial design with a single operator (headerless_jarray_parser) that does the same logic as in json_array_parser+assign_keys together was significantly better performance-wise.

anyway @sumo-drosiek I'd like to test your suggestion, i tried adding this to the builder config -

processors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/transformprocessor v0.90.0

but im getting and error saying the revision is wrong -
go: github.com/open-telemetry/opentelemetry-collector-contrib/transformprocessor@v0.90.0: reading github.com/open-telemetry/opentelemetry-collector-contrib/transformprocessor/go.mod at revision transformprocessor/v0.90.0: unknown revision transformprocessor/v0.90.0

is it not v0.90.0? can you please share the revision you used?

Thanks guys for helping me through this entirely new world of OTEL collector hehe..

sumo-drosiek · 2024-01-24T13:52:10Z

@RoeiDimi It should be github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor

RoeiDimi · 2024-01-25T14:25:30Z

Hi @djaglowski
I know this is closed as completed but after trying to use the solution we went for, I see too big of a performance hit to be able to use it. Using these 3 operators is at least 2X slower than using the original headerless_jarray_parser alone.
I also tried @sumo-drosiek 's suggestion (thank you for trying to help!) and it shows roughly the same numbers as using the 3 operators

For my benchmarks I'm using Azure kubernetes, single node, single pod, 5 cpus. I tested the following:

headerless-jarray-parser can take ~10-11 mil events per minute
3 operators chained (or the processor suggestion) ~4-5 mil events per minute
Just for testing i removed the flatten to see if this is the only issue but its just a part of it, i got ~6-7 mil events per minute
json_array_parser alone got me back to the ~10-11 mil events per minute

It also makes sense in a way as headerless-jarray-parser could iterate only once on the input array, create a map and then replace attributes with that map..

I totally get the refactor we went for from a modularity point of view but is there any chance you'd agree to now also adding headerless-jarray-parser for performance reasons?

djaglowski · 2024-01-25T15:29:36Z

@RoeiDimi, thanks for providing detailed performance numbers.

I think given the notable performance benefit, it's reasonable to revisit the combined operator.

json_array_parser and assign_keys are both useful in their own right, even if composition affects performance. I'd like to keep these capabilities in place regardless.

If we were to apply the same terminology update as we arrived at, we are talking about a headerless_json_array_parser. Rather than having two closely related parsers, headerless_json_array_parser and json_array_parser, we should consider what this looks like to support both use cases. In other words, pull the header/less part into the config. I imagine this could work as follows:

Re-add header settings, equivalent to those in the csv parser.
If the neither header setting is used, then the parser behaves exactly as it does today. That is, the result placed in parse_to is expected to be []any.
If either header setting is used, then we perform the parsing step and then apply the header, and ultimately place an object in parse_to.

The benefit of this approach is that the operator just has the ability to apply an optional behavior, but one that is closely related to the first step of parsing the array. WDYT?

RoeiDimi · 2024-01-28T11:34:00Z

@djaglowski yes, that's a good, clean idea
I guess we can now continue here - #30814
:)

**Description:** <Describe what has changed.>  Adding a feature following #30644. This feature allow json_array_parser parser to accept a comma-delimited header and for every json array it parses, output a map which contains the header fileds as keys and the matching values are the ones parsed from the input json array. This feature as added mainly for performance reasons as from a functional POV, this is mostly similar to chaining the 2 operators: `json_array_parser -> assign_keys ` **Link to tracking Issue:** <Issue number if applicable> #30321 **Testing:** <Describe what testing was performed and which tests were added.> - unittests - End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table and performance looks good **Documentation:** <Describe the documentation added.> - [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)

…try#30644) **Description:** <Describe what has changed.>  We are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and in one use case the json is a simple headerless jarray and so we need a way to parse it and match headers to each field (something similar to what csv_parser does but also supports types supported in a json format and nested objects) **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30321 **Testing:** <Describe what testing was performed and which tests were added.> * unittests All the tests found in csv_parser were copied and adjusted adding test scenarios for different types (numbers, booleans, null) as well as a test for parsing a nested object as a part of the jarray * End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table **Documentation:** <Describe the documentation added.> * [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md) * [assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md) --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

**Description:** <Describe what has changed.>  Adding a feature following open-telemetry#30644. This feature allow json_array_parser parser to accept a comma-delimited header and for every json array it parses, output a map which contains the header fileds as keys and the matching values are the ones parsed from the input json array. This feature as added mainly for performance reasons as from a functional POV, this is mostly similar to chaining the 2 operators: `json_array_parser -> assign_keys ` **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30321 **Testing:** <Describe what testing was performed and which tests were added.> - unittests - End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table and performance looks good **Documentation:** <Describe the documentation added.> - [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)

…try#30644) **Description:** <Describe what has changed.>  We are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and in one use case the json is a simple headerless jarray and so we need a way to parse it and match headers to each field (something similar to what csv_parser does but also supports types supported in a json format and nested objects) **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30321 **Testing:** <Describe what testing was performed and which tests were added.> * unittests All the tests found in csv_parser were copied and adjusted adding test scenarios for different types (numbers, booleans, null) as well as a test for parsing a nested object as a part of the jarray * End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table **Documentation:** <Describe the documentation added.> * [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md) * [assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md) --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

**Description:** <Describe what has changed.>  Adding a feature following open-telemetry#30644. This feature allow json_array_parser parser to accept a comma-delimited header and for every json array it parses, output a map which contains the header fileds as keys and the matching values are the ones parsed from the input json array. This feature as added mainly for performance reasons as from a functional POV, this is mostly similar to chaining the 2 operators: `json_array_parser -> assign_keys ` **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30321 **Testing:** <Describe what testing was performed and which tests were added.> - unittests - End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table and performance looks good **Documentation:** <Describe the documentation added.> - [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)

Add a headerless jarray parser operator

07aa655

RoeiDimi requested review from djaglowski and a team as code owners January 17, 2024 13:29

github-actions bot assigned TylerHelmuth Jan 17, 2024

github-actions bot added the pkg/stanza label Jan 17, 2024

RoeiDimi added 3 commits January 17, 2024 17:41

Add documentation file

1fe4ff7

Add a change log yaml

2d2fde1

Add an example

de83043

djaglowski reviewed Jan 17, 2024

View reviewed changes

RoeiDimi added 12 commits January 18, 2024 11:56

Grab changes in PR open-telemetry#30645, remove header delim and fix …

fce1fff

…tests

Merge branch 'main' into stanza-jarray-parser

6c29ede

update readme

488d4d4

update readme

4dd2117

update readme

e00f4d7

update readme

7d54086

update readme

0088093

change name to json_array_parser

3c23e50

change name to json_array_parser

91fea1d

change name to json_array_parser

9a3897f

change name to json_array_parser

b325b8f

change name to json_array_parser

f68cd9a

djaglowski reviewed Jan 18, 2024

View reviewed changes

pkg/stanza/docs/operators/json_array_parser.md Outdated Show resolved Hide resolved

pkg/stanza/docs/operators/json_array_parser.md Show resolved Hide resolved

pkg/stanza/docs/operators/json_array_parser.md Outdated Show resolved Hide resolved

RoeiDimi added 5 commits January 21, 2024 16:50

Seperate jarray parsing and new assign_keys transformer

fe40b11

Merge branch 'main' into stanza-jarray-parser

9ac4971

Update json array doc

b2b26f2

Add assign_keys doc

76c1fe0

Update changelog

beacea4

RoeiDimi changed the title ~~Add a headerless jarray parser operator~~ Add json_array_parser parser and assign_keys transformer Jan 22, 2024

djaglowski reviewed Jan 22, 2024

View reviewed changes

github-actions bot requested review from armstrmi, dehaansa, dineshg13, liustanley, mackjmr, pjanotti, pmcollins, schmikei and sumo-drosiek January 23, 2024 17:11

Explicitly add package to root module

7f93ec8

djaglowski force-pushed the stanza-jarray-parser branch from a8388f3 to 7f93ec8 Compare January 23, 2024 17:33

djaglowski merged commit aee7b70 into open-telemetry:main Jan 23, 2024
85 checks passed

github-actions bot added this to the next release milestone Jan 23, 2024

djaglowski mentioned this pull request Jan 23, 2024

Support parsing a headerless jarray (expected behavior is similar to csv_parser) #30321

Closed

RoeiDimi mentioned this pull request Jan 28, 2024

Header support for json_array_parser #30814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add json_array_parser parser and assign_keys transformer #30644

Add json_array_parser parser and assign_keys transformer #30644

RoeiDimi commented Jan 17, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jan 17, 2024 •

edited

Loading

djaglowski left a comment

djaglowski commented Jan 23, 2024

RoeiDimi commented Jan 23, 2024 •

edited

Loading

djaglowski commented Jan 23, 2024 •

edited

Loading

RoeiDimi commented Jan 24, 2024

sumo-drosiek commented Jan 24, 2024 •

edited

Loading

RoeiDimi commented Jan 24, 2024 •

edited

Loading

sumo-drosiek commented Jan 24, 2024

RoeiDimi commented Jan 25, 2024 •

edited

Loading

djaglowski commented Jan 25, 2024

RoeiDimi commented Jan 28, 2024

Add json_array_parser parser and assign_keys transformer #30644

Add json_array_parser parser and assign_keys transformer #30644

Conversation

RoeiDimi commented Jan 17, 2024 • edited Loading

linux-foundation-easycla bot commented Jan 17, 2024 • edited Loading

djaglowski left a comment

Choose a reason for hiding this comment

djaglowski commented Jan 23, 2024

RoeiDimi commented Jan 23, 2024 • edited Loading

djaglowski commented Jan 23, 2024 • edited Loading

RoeiDimi commented Jan 24, 2024

sumo-drosiek commented Jan 24, 2024 • edited Loading

RoeiDimi commented Jan 24, 2024 • edited Loading

sumo-drosiek commented Jan 24, 2024

RoeiDimi commented Jan 25, 2024 • edited Loading

djaglowski commented Jan 25, 2024

RoeiDimi commented Jan 28, 2024

RoeiDimi commented Jan 17, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jan 17, 2024 •

edited

Loading

RoeiDimi commented Jan 23, 2024 •

edited

Loading

djaglowski commented Jan 23, 2024 •

edited

Loading

sumo-drosiek commented Jan 24, 2024 •

edited

Loading

RoeiDimi commented Jan 24, 2024 •

edited

Loading

RoeiDimi commented Jan 25, 2024 •

edited

Loading