Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address issues with NetBox database and Logstash's NetBox cache #259

Closed
mmguero opened this issue Sep 12, 2023 · 5 comments
Closed

address issues with NetBox database and Logstash's NetBox cache #259

mmguero opened this issue Sep 12, 2023 · 5 comments
Assignees
Labels
bug Something isn't working logstash Relating to Malcolm's use of Logstash netbox Related to Malcolm's use of NetBox
Milestone

Comments

@mmguero
Copy link
Collaborator

mmguero commented Sep 12, 2023

I'm not going to hold up the v23.09.0 release for this, but this is an issue I just discovered that needs to be address:

The Logstash Ruby script in charge of enrichment from netbox uses some LRU caches to avoid frequent repetitive API lookups of items.

However, the netbox database is not static: things can be added (not so much of an issue, since a cache miss will just cause the thing to be looked up) but also deleted and changed.

So Logstash could be going along and have information in its cache that is now invalid because the underlying database has been changed.

I don't think there's a good way to "trigger" Logstash when changes are written to NetBox (externally to auto-population, like through the UI or through ./scripts/netbox-restore). But we need to consider how to handle this.

At the very least, a complete wipe/restore of the database via ./scripts/netbox-restore needs to trigger a complete Logstash restart (or, if we want to be more subtle about it, somehow notify logstash to clear all of its caches).

This needs some careful thought to figure out how to deal with it. On one hand, we know NetBox is using postgresql and redis to do its own database and caching: maybe we simply don't need to cache in LogStash at all and it won't be too expensive to just call the API every time? I don't know.

@mmguero mmguero added bug Something isn't working logstash Relating to Malcolm's use of Logstash netbox Related to Malcolm's use of NetBox labels Sep 12, 2023
@mmguero mmguero added this to the v23.10.0 milestone Sep 12, 2023
@mmguero mmguero modified the milestones: v23.10.0, v23.11.0 Oct 23, 2023
@mmguero
Copy link
Collaborator Author

mmguero commented Oct 26, 2023

After looking a little closer at this, I think that's the approach I'm going to take. Since netbox is already caching things I think we're adding complication without a lot of gain by doing our own caching on top of that.

@mmguero mmguero self-assigned this Oct 26, 2023
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Oct 26, 2023
…olab#259); work in progress, almost certainly broken in this state
@mmguero
Copy link
Collaborator Author

mmguero commented Oct 26, 2023

Working on it, but as of mmguero-dev/Malcolm@cab66bd it's broken (seems to be a concurrency issue). Just as a reminder to myself...

mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Oct 27, 2023
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Oct 27, 2023
@mmguero
Copy link
Collaborator Author

mmguero commented Oct 27, 2023

The good news is I've removed our extra layer of caching and it does seem to be more consistent now. The bad news is the CPU load increases quite a bit when, especially when doing autopopulation due to the increased API load. I need to do some benchmarks to compare, and also compare when autopopulation is turned on vs. not.

@mmguero
Copy link
Collaborator Author

mmguero commented Oct 27, 2023

Okay, on further analysis I think actually we're going to have to come to a middle ground: still do the caching, but with two changes: 1) decrease the TTL significantly (from 600 seconds down to maybe like 30 or 60 seconds) and 2) have ALL of the caches in the file be a TTL cache, rather than having the site/role/device type just being LRU.

I think removing the caching completely will just make it too slow.

mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Oct 27, 2023
…olab#259); restore caching for performance reasons, but decrease TTL significantly and allow it to be specified via environment variable
@mmguero
Copy link
Collaborator Author

mmguero commented Oct 27, 2023

Closing as per my last comment and commit.

Some rough benchmarks (the last number is milliseconds in the filter):

With no caching, autopopulate on:

ruby_netbox_enrich_destination_ip_segment;87489;87489;3017294
ruby_netbox_enrich_source_ip_segment;96610;96610;3411570
ruby_netbox_enrich_source_ip_device;96610;96610;10968202
ruby_netbox_enrich_destination_ip_device;87489;87489;11361163

With no caching, autopopulate off:

ruby_netbox_enrich_destination_ip_segment;85755;85755;169437
ruby_netbox_enrich_source_ip_segment;94305;94305;222455
ruby_netbox_enrich_destination_ip_device;85755;85755;958852
ruby_netbox_enrich_source_ip_device;94305;94305;1607698

Although this is rough (the numbers of the events aren't exactly the same), you can see that without the caching it's like 10x+ worse.

@mmguero mmguero closed this as completed Oct 27, 2023
This was referenced Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logstash Relating to Malcolm's use of Logstash netbox Related to Malcolm's use of NetBox
Projects
Status: Released
Development

No branches or pull requests

1 participant