Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix compound query_key and mixed mappings #1330

Merged
merged 16 commits into from
Dec 13, 2023

Conversation

jmacdone
Copy link
Contributor

@jmacdone jmacdone commented Dec 3, 2023

Description

Fixes to allow top_count_keys to still function when using a list of query_key values.

Checklist

  • I have reviewed the contributing guidelines.
  • I have included unit tests for my changes or additions.
  • I have successfully run make test-docker with my changes.
  • I have manually tested all relevant modes of the change in this PR.
  • I have updated the documentation.
  • I have updated the changelog.

Questions or Comments

tl;dr fc583f7 and e0795fe are bug fixes. 9d1f291 is a guess.

I didn't attempt unit tests as a. get_hits_terms looks strongly coupled with elasticsearch results and b. there wasn't an existing get_hits_terms unit test (that I could find)

Confirmed through manual testing that a single query_key still behaves as before and the list of query_key values now populates top_events_* as expected.

I have low confidence 9d1f291 is the correct way to handle raw_count_keys=false but it's working (finally) as expected, when dealing with a mix of text terms and keyword, ip, int, etc. terms. As it stands now, with this PR, a rule with this config works as I expect.

query_key:
  - serialnumber.keyword
  - src_computer.keyword

top_count_keys:
  - src_displayname.keyword
  - src_computer.keyword
  - src_uid    # mapping type: keyword
  - src        # mapping of type: ip
 
raw_count_keys: false

Manual testing done on a python 3.11 image (similar to the Dockerfile) with vscode debugger checkpoints around the changes

"dev.containers.source": "https://github.com/devcontainers/images",
"dev.containers.variant": "3.11-bookworm"

make test-docker

py311: OK (585.32=setup[491.51]+cmd[91.95,1.86] seconds)
docs: OK (492.05=setup[476.75]+cmd[15.30] seconds)
congratulations :) (1077.43 seconds)

@nsano-rururu
Copy link
Collaborator

I forgot to update CHANGELOG.md

@jertel
Copy link
Owner

jertel commented Dec 3, 2023

Does this related at all to #982 ?

Also some unit test coverage are needed for these changes, even if there hadn't been coverage previously. The new .keyword logic should be a straightforward unit test since there are already lookup tests in util_test.py. For the get_hits_terms changes if you have to split out the logic into its own helper function so that you can isolate the unit testing to just that logic, that is fine, and preferable anyway.

Regarding @nsano-rururu's comment I think he meant "You forgot to update the changelog".

@jmacdone
Copy link
Contributor Author

A status update: I've been grinding on this for a while. While trying to write unit tests the current PR, (it feels like) I keep finding things like

qk = self.rules.get('query_key')
for event in data:
if qk:
key = hashable(lookup_es_key(event, qk))
and
qk = lookup_es_key(match, rule['query_key'])
that assume query_key will be a string representing a single fieldname, but
rule['query_key'] = ','.join(raw_query_key)
will create a comma-joined string of fieldnames (multiple).

So, I'm scratching my head as to what would be the right fix. I'm leaning towards adding guards to check for compound_query_key before using query_key for those two examples (and any others found).

But, I'm also considering "should query_key just always be normalized to a list?" and then remove the "hidden" compound_query_key created as a side-effect of loading rule options and chase after all the broken that'd create. Probably way more work, but this somehow feels "more correct" at the moment.

@jmacdone
Copy link
Contributor Author

Does this related at all to #982 ?

They could very well be related. One of the symptoms of this PR's bug is "man, why does top_events never seem to work for me?!"

@jertel
Copy link
Owner

jertel commented Dec 11, 2023

From what I recall query_key is intentionally not supposed to be a list, since Elasticsearch's aggregations will respond using the comma delimited string as the bucket key for a compound query key, and we need to repeatedly fetch that bucket using the comma delimited form. I believe this is the case, it's been a while since I dove into that area.

Now that I'm looking more closely at these code changes it appears your new split(", ") will not work correctly in some cases. The rule loader uses join(",") which will create that compound string without spaces. When I first reviewed this I was thinking split(delim) treated any character found in delim as the delimiter string, but that's not the case. Instead it treats the full delim string as the delimiter. A safer approach would be to check if qk.contains(",") to determine if this is a compound key or not on line 471 in elastalert.py. Then later where qk_list[i] is used, add a .strip() to that reference to remove any extra whitespace for instances when the query_key value is specified with spaces in the comma delimited format.

That said, I'm now wondering what scenario it was where you found whitespace characters in the query_key string value in the first place, which led to your change toward split(", "). Were you manually specifying a "query_key: key1, key2" in the rule YAML file, instead of using the list format?

@jmacdone
Copy link
Contributor Author

jmacdone commented Dec 11, 2023

what scenario it was where you found whitespace characters in the query_key string value in the first place

I've been wondering the same. After more debugger time, I see the get_hits() method calls the process_hits() method which adds rules options (like rule['query_key']) to the _source dict returned from elastic.

hit['_source'][rule['query_key']] = ', '.join([str(value) for value in values])

remove any extra whitespace for instances when the query_key value is specified with spaces in the comma delimited format

It looks it used to tried to work that way ~5 years ago
76ab593

@jmacdone
Copy link
Contributor Author

jmacdone commented Dec 11, 2023

I'm not sure how to how find the blame for a deleted line, but I'm curious as to when and why this one was removed Ah, this was meant to handle the whitespace stripping, but didn't, because it never used the qk_tmp value

qk_tmp = qk.replace(" ", "")
qk_list = qk.split(",")

and because the qk_tmp was never used, it was stripped in 8abff23

leaving us with the old commaspace-split replaced with a comma-split. i.e. 76ab593 introduced buggy behavior.

@jertel
Copy link
Owner

jertel commented Dec 11, 2023

Isn't this line

hit['_source'][rule['query_key']] = ', '.join([str(value) for value in values])
comma-space joining the values of the returned query keys? It looks like it is setting that new string as the value for the query_key key, rather than the key itself.

@jmacdone
Copy link
Contributor Author

Confusing, right? Sometimes qk is a key, and sometimes holds the value of a key. In one of my example hits,

hit['_source'][rule['query_key']] = ', '.join([str(value) for value in values])
results in

hit['_source']['serialnumber.keyword,src_computer.keyword'] = 'NXHFPAA001234567890, LAPTOP-M16RFTYS'

and when it comes time to alert, the alert calls get_hits_terms(..., qk='NXHFPAA001234567890, LAPTOP-M16RFTYS', ...) which can't be split properly by

def get_hits_terms(self, rule, starttime, endtime, index, key, qk=None, size=None):
rule_filter = copy.copy(rule['filter'])
if qk:
qk_list = qk.split(",")

And this is in the context of a cardinality rule configured with

query_key: 
  - serialnumber.keyword
  - src_computer.keyword
raw_count_keys: false

@jertel
Copy link
Owner

jertel commented Dec 11, 2023

Yes, that qk variable needs to be renamed qk_value to eliminate confusion. Then there's the issue with a pre-joined value itself containing a comma, and later the code thinks it's splitting apart the query key into their respective values, but now the extra embedded commas are going to trip up that split. It could use a better design. Perhaps a quick fix is to join those values with a more explicit delimiter, such as __EA2_DELIM__ and then the join and split statements can share a single constant var. It would help avoid future maintainers and show how those two join and split statements are connected.

@jertel
Copy link
Owner

jertel commented Dec 11, 2023

Thinking more on this... The drawback to using that explicit delimiter is that is any other code or existing users relying on that field are now going to see that ugly delimiter in their alert message.

If we leave it to the comma delimited approach we could at least do a sanity check after the split and see if the length of the split value array matches the compound_query_key array. If it doesn't then at least log an error for visibility, or even append a message to the value that the top counts couldn't be accurately determined.

But this isn't even the problem you're trying to solve so apologies for muddying up the issue.

Back to the original code review, there may still be some benefit leaving the split as-is and using a .strip() on the qk_list[i] reference.

jmacdone pushed a commit to jmacdone/elastalert2 that referenced this pull request Dec 12, 2023
@jmacdone
Copy link
Contributor Author

  • added a requirement for >=py3.9 (it seems the project wants to be on 3.12 regardless)
  • rebased from master
  • used re.split() to handle variations of comma and commaspace
  • added the suggested waring if query_key values just-so-happen to include commas
  • snark block comments
  • re-tested with test-docker

I left the commits "organic". But if the project prefers consolidated commits, I can do that.

@jertel
Copy link
Owner

jertel commented Dec 12, 2023

I'm not seeing any new commits.

jmacdone pushed a commit to jmacdone/elastalert2 that referenced this pull request Dec 12, 2023
James Macdonell added 13 commits December 12, 2023 06:26
otherwise, a space gets prefixed to the value
The `raw_count_keys` handler now tests the intended value while iterating
over `compound_query_key` terms.  Previously, it was checking the
method param `key` and erroneously skipping the add_keyword_postfix()
conditional
allow finding of qk values from matches when using raw_count_keys=false

hard-coding '.keyword' is probably wrong.  Not sure where to get the string_multi_field_name from in this context
To use string.removesuffix()

and f-strings (>=3.6)

and type-hints for builtin collections

and because it looks like the project is standardizing on 3.12 anyway
@jmacdone
Copy link
Contributor Author

🤦‍♂️ git did what I asked rather than what I wanted. Better now?

Copy link
Owner

@jertel jertel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this improvement, and for adding the log warning.

@jertel jertel merged commit f94739f into jertel:master Dec 13, 2023
1 check passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants