Fleet performance overhaul and manual labels implementation #2251
Conversation
I will work on rebasing these commits to address the merge conflicts, fix up the tests, and break this down into a few logical commits. |
@directionless I've rebased, squashed the commits into a more coherent grouping, and fixed up the tests. Please review when you have a chance. |
For context on the performance improvements, I ran the server on AWS infrastructure and used the osquery simulator to load test it. Hardware ~160,000 hosts online, ~700,000 enrolled Stable response times from load balancer Requests per minute increasing as simulated hosts start up CPU utilization on Fleet servers (note very large time span so only the last segment is relevant) - CPU up to about 50% |
For context on the performance numbers presented above - what do the numbers look like before the change? As in, without this change applied, but using the same hardware for fleet, what do the graphs look like as we scale up # of hosts, and is the max # of hosts supported on that hardware much lower? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass. It's big enough it's hard for me to absorb the higher level architectural changes. Though I'm not sure I need to.
A bunch of nits. Mostly comments. I think one thing for discussion, but I've kinda lost track
IPAddress: "99.100.101.103", | ||
}, | ||
} | ||
h2.PrimaryIP = "99.100.101.103" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this IP was in prior test. But it'a probably assignable, it's not reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding replies to some comments and questions.
@nyanshak I can't make a direct comparison to the current version as doing this kind of testing is quite costly and time consuming for me. What I can say is that most folks seem to observe that the frontend becomes unusable around 20k hosts. Folks are getting to higher host counts by avoiding the frontend and only using fleetctl, however running live queries becomes unreliable in the 30k range or so (depending on hardware for MySQL server). These changes allow Fleet to easily scale to those host counts on modest hardware configurations, and to much higher host counts (as demonstrated by the testing metrics) using more robust hardware. |
Fleet used significant resources storing the full network interface information for each host. This data was unused, except to get the IP and MAC of the primary interface. With these changes, only those pieces of data are stored. - Calculate and store primary IP and MAC - Remove transaction for storing full interfaces - Update targets search to use new IP and MAC columns - Update frontend to use new new columns
This change optimizes live queries by pushing the computation of query targets to the creation time of the query, and efficiently caching the targets in Redis. This results in a huge performance improvement at both steady-state, and when running live queries. - Live queries are stored using a bitfield in Redis, and takes advantage of bitfield operations to be extremely efficient. - Only run Redis live query test when REDIS_TEST is set in environment - Ensure that live queries are only sent to hosts when there is a client listening for results. Addresses an existing issue in Fleet along with appropriate cleanup for the refactored live query backend.
This commit takes advantage of the existing pagination APIs in the Fleet server, and provides additional APIs to support pagination in the web UI. Doing this dramatically reduces the response sizes for requests from the UI, and limits the performance impact of UI clients on the Fleet and MySQL servers.
Label membership is now stored in the label_membership table. This is done in preparation for adding "manual" labels, as previously label membership was associated directly with label query executions. Label queries are now all executed at the same time, rather than on separate intervals. This simplifies the calculation of which distributed queries need to be run when a host checks in.
"Manual" labels can be specified by hostname, allowing users to specify the membership of a label without having to use a dynamic query. See the included documentation.
Cleans up some repetition in tests.
Getting a single host with `fleetctl get host foobar` will look up the host with the matching hostname, uuid, osquery identifier, or node key, and provide the full host details along with the labels the host is a member of.
- Debounce frontend to reduce number of target searches in live query. - More efficiently calculate label counts in live query and hosts dashboard. Instead of using the (slow) CountHostsInTargets function, retrieve the host counts while looking up the labels. - Optimize targets search query. Removing the nested query retrieves the same logical result set, but substantially optimizes MySQL CPU usage. Testing indicates about a 50% reduction in MySQL CPU usage for the frontend targets search API call after applying this change.
@directionless I've updated per your comments and rebased to fix merge conflicts. Please let me know how it looks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try.
It's big enough I don't think I can wrap my head around all the nuance. But it seems like it's probably fine to try. And I'm happy enough with the redis implementation
This pull request contains commits addressing the major performance issues experienced by larger Fleet installations, along with an implementation of “manual” host labels. These features come in the same PR as the implementations are tied to each other.
Thank you to Bloomberg for supporting development of these features.
Performance
Implement pagination support for hosts in the web UI. This addresses long page load times and high load on the Fleet server when a large number of hosts are enrolled. Fixes Implement Paging for
/api/v1/kolide/hosts
#2204, fixes hosts query taking a ton of RAM #2051.Store only the MAC and IP of the primary network interface for hosts (as the full data set is unused by the UI). This reduces load on the Fleet server. Closes Proposal: Remove network_interfaces from host details #2207. Users who require this data in API responses can use the "additional" query capability introduced in Add capability to collect "additional" information from hosts #2236.
Move live query operations on the server from MySQL to Redis. This reduces load on the Fleet server at steady-state, and substantially during live query execution. Closes Fleet unable to return all query results for ad hoc queries #2199, closes Fleet unable to return query results for somewhat large number of hosts #2195.
Simplify label query execution by executing all label queries on the same interval. This reduces load on the Fleet server when hosts check in for live queries.
Manual Labels
fleetctl
yaml definitions. Closes [feature request] Host grouping feature based on custom value, instead of labels #1967 and partially addresses Feature Idea: multiple enrollment secrets w/ automatic host labeling #1994.