Fleet performance overhaul and manual labels implementation #2251

zwass · 2020-07-01T16:37:48Z

This pull request contains commits addressing the major performance issues experienced by larger Fleet installations, along with an implementation of “manual” host labels. These features come in the same PR as the implementations are tied to each other.

Thank you to Bloomberg for supporting development of these features.

Performance

Implement pagination support for hosts in the web UI. This addresses long page load times and high load on the Fleet server when a large number of hosts are enrolled. Fixes Implement Paging for /api/v1/kolide/hosts #2204, fixes hosts query taking a ton of RAM #2051.
Store only the MAC and IP of the primary network interface for hosts (as the full data set is unused by the UI). This reduces load on the Fleet server. Closes Proposal: Remove network_interfaces from host details #2207. Users who require this data in API responses can use the "additional" query capability introduced in Add capability to collect "additional" information from hosts #2236.
Move live query operations on the server from MySQL to Redis. This reduces load on the Fleet server at steady-state, and substantially during live query execution. Closes Fleet unable to return all query results for ad hoc queries #2199, closes Fleet unable to return query results for somewhat large number of hosts #2195.
Simplify label query execution by executing all label queries on the same interval. This reduces load on the Fleet server when hosts check in for live queries.

Manual Labels

Allow users to specify sets of hosts as a label, in addition to the existing dynamic labels feature. This feature is available through fleetctl yaml definitions. Closes [feature request] Host grouping feature based on custom value, instead of labels #1967 and partially addresses Feature Idea: multiple enrollment secrets w/ automatic host labeling #1994.

zwass · 2020-07-01T16:40:07Z

I will work on rebasing these commits to address the merge conflicts, fix up the tests, and break this down into a few logical commits.

zwass · 2020-07-06T21:47:55Z

@directionless I've rebased, squashed the commits into a more coherent grouping, and fixed up the tests. Please review when you have a chance.

zwass · 2020-07-08T21:48:02Z

For context on the performance improvements, I ran the server on AWS infrastructure and used the osquery simulator to load test it.

Hardware
MySQL: db.m4.4xlarge Core count - 8 vCPU - 16 Memory - 64GB
Redis: cache.t2.micro vCPU - 1 Memory - 0.555GB
Fleet servers: 6 server instances running in containers on AWS Fargate

~160,000 hosts online, ~700,000 enrolled

Stable response times from load balancer

Requests per minute increasing as simulated hosts start up

CPU utilization on Fleet servers (note very large time span so only the last segment is relevant) - CPU up to about 50%

CPU utilization on MySQL server got up to about 50%

CPU utilization on Redis server up to about 25%

nyanshak · 2020-07-09T21:02:08Z

For context on the performance numbers presented above - what do the numbers look like before the change?

As in, without this change applied, but using the same hardware for fleet, what do the graphs look like as we scale up # of hosts, and is the max # of hosts supported on that hardware much lower?

directionless

Made a first pass. It's big enough it's hard for me to absorb the higher level architectural changes. Though I'm not sure I need to.

A bunch of nits. Mostly comments. I think one thing for discussion, but I've kinda lost track

cmd/fleetctl/get.go

server/datastore/mysql/labels.go

server/live_query/redis_live_query_test.go

frontend/kolide/entities/hosts.js

directionless · 2020-07-09T21:55:52Z

server/datastore/datastore_hosts_test.go

-			IPAddress: "99.100.101.103",
-		},
-	}
+	h2.PrimaryIP = "99.100.101.103"


I know this IP was in prior test. But it'a probably assignable, it's not reserved.

server/service/transport_hosts.go

server/test/functions.go

zwass

Adding replies to some comments and questions.

frontend/kolide/entities/hosts.js

server/datastore/mysql/labels.go

server/live_query/redis_live_query.go

server/test/functions.go

zwass · 2020-07-14T19:39:17Z

@nyanshak I can't make a direct comparison to the current version as doing this kind of testing is quite costly and time consuming for me.

What I can say is that most folks seem to observe that the frontend becomes unusable around 20k hosts. Folks are getting to higher host counts by avoiding the frontend and only using fleetctl, however running live queries becomes unreliable in the 30k range or so (depending on hardware for MySQL server).

These changes allow Fleet to easily scale to those host counts on modest hardware configurations, and to much higher host counts (as demonstrated by the testing metrics) using more robust hardware.

Fleet used significant resources storing the full network interface information for each host. This data was unused, except to get the IP and MAC of the primary interface. With these changes, only those pieces of data are stored. - Calculate and store primary IP and MAC - Remove transaction for storing full interfaces - Update targets search to use new IP and MAC columns - Update frontend to use new new columns

This change optimizes live queries by pushing the computation of query targets to the creation time of the query, and efficiently caching the targets in Redis. This results in a huge performance improvement at both steady-state, and when running live queries. - Live queries are stored using a bitfield in Redis, and takes advantage of bitfield operations to be extremely efficient. - Only run Redis live query test when REDIS_TEST is set in environment - Ensure that live queries are only sent to hosts when there is a client listening for results. Addresses an existing issue in Fleet along with appropriate cleanup for the refactored live query backend.

This commit takes advantage of the existing pagination APIs in the Fleet server, and provides additional APIs to support pagination in the web UI. Doing this dramatically reduces the response sizes for requests from the UI, and limits the performance impact of UI clients on the Fleet and MySQL servers.

Label membership is now stored in the label_membership table. This is done in preparation for adding "manual" labels, as previously label membership was associated directly with label query executions. Label queries are now all executed at the same time, rather than on separate intervals. This simplifies the calculation of which distributed queries need to be run when a host checks in.

"Manual" labels can be specified by hostname, allowing users to specify the membership of a label without having to use a dynamic query. See the included documentation.

Cleans up some repetition in tests.

Getting a single host with `fleetctl get host foobar` will look up the host with the matching hostname, uuid, osquery identifier, or node key, and provide the full host details along with the labels the host is a member of.

- Debounce frontend to reduce number of target searches in live query. - More efficiently calculate label counts in live query and hosts dashboard. Instead of using the (slow) CountHostsInTargets function, retrieve the host counts while looking up the labels. - Optimize targets search query. Removing the nested query retrieves the same logical result set, but substantially optimizes MySQL CPU usage. Testing indicates about a 50% reduction in MySQL CPU usage for the frontend targets search API call after applying this change.

zwass · 2020-07-16T01:02:07Z

@directionless I've updated per your comments and rebased to fix merge conflicts. Please let me know how it looks.

directionless

Let's try.

It's big enough I don't think I can wrap my head around all the nuance. But it seems like it's probably fine to try. And I'm happy enough with the redis implementation

zwass added Component: Technical Debt Component: Frontend Feature: CLI Feature: Labels Performance labels Jul 1, 2020

zwass requested a review from directionless July 1, 2020 16:37

directionless suggested changes Jul 10, 2020

View reviewed changes

zwass commented Jul 14, 2020

View reviewed changes

zwass added 8 commits July 15, 2020 17:54

Implement manual labels

d8ce0c0

"Manual" labels can be specified by hostname, allowing users to specify the membership of a label without having to use a dynamic query. See the included documentation.

Extract functionName into helper

42d28a6

Cleans up some repetition in tests.

zwass requested a review from directionless July 16, 2020 01:01

directionless approved these changes Jul 17, 2020

View reviewed changes

Clean up and comments before merge.

891a88e

zwass merged commit 7494513 into kolide:master Jul 21, 2020

zwass deleted the perf-manual-labels branch July 21, 2020 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet performance overhaul and manual labels implementation #2251

Fleet performance overhaul and manual labels implementation #2251

zwass commented Jul 1, 2020 •

edited

zwass commented Jul 1, 2020

zwass commented Jul 6, 2020

zwass commented Jul 8, 2020

nyanshak commented Jul 9, 2020

directionless left a comment

directionless Jul 9, 2020

zwass left a comment

zwass commented Jul 14, 2020

zwass commented Jul 16, 2020

directionless left a comment

Fleet performance overhaul and manual labels implementation #2251

Fleet performance overhaul and manual labels implementation #2251

Conversation

zwass commented Jul 1, 2020 • edited

Performance

Manual Labels

zwass commented Jul 1, 2020

zwass commented Jul 6, 2020

zwass commented Jul 8, 2020

nyanshak commented Jul 9, 2020

directionless left a comment

Choose a reason for hiding this comment

directionless Jul 9, 2020

Choose a reason for hiding this comment

zwass left a comment

Choose a reason for hiding this comment

zwass commented Jul 14, 2020

zwass commented Jul 16, 2020

directionless left a comment

Choose a reason for hiding this comment

zwass commented Jul 1, 2020 •

edited