Skip to content
This repository has been archived by the owner on Dec 15, 2020. It is now read-only.

Fleet performance overhaul and manual labels implementation #2251

Merged
merged 9 commits into from Jul 21, 2020
Merged

Fleet performance overhaul and manual labels implementation #2251

merged 9 commits into from Jul 21, 2020

Conversation

zwass
Copy link
Contributor

@zwass zwass commented Jul 1, 2020

This pull request contains commits addressing the major performance issues experienced by larger Fleet installations, along with an implementation of “manual” host labels. These features come in the same PR as the implementations are tied to each other.

Thank you to Bloomberg for supporting development of these features.

Performance

Manual Labels

@zwass
Copy link
Contributor Author

zwass commented Jul 1, 2020

I will work on rebasing these commits to address the merge conflicts, fix up the tests, and break this down into a few logical commits.

@zwass
Copy link
Contributor Author

zwass commented Jul 6, 2020

@directionless I've rebased, squashed the commits into a more coherent grouping, and fixed up the tests. Please review when you have a chance.

@zwass
Copy link
Contributor Author

zwass commented Jul 8, 2020

For context on the performance improvements, I ran the server on AWS infrastructure and used the osquery simulator to load test it.

Hardware
MySQL: db.m4.4xlarge Core count - 8 vCPU - 16 Memory - 64GB
Redis: cache.t2.micro vCPU - 1 Memory - 0.555GB
Fleet servers: 6 server instances running in containers on AWS Fargate

~160,000 hosts online, ~700,000 enrolled

Screen Shot 2020-04-14 at 9 48 01 PM

Stable response times from load balancer

Screen Shot 2020-04-14 at 9 44 33 PM

Requests per minute increasing as simulated hosts start up

Screen Shot 2020-04-14 at 9 43 56 PM

CPU utilization on Fleet servers (note very large time span so only the last segment is relevant) - CPU up to about 50%

Screen Shot 2020-04-14 at 9 42 57 PM

CPU utilization on MySQL server got up to about 50%
Screen Shot 2020-04-14 at 9 42 45 PM

CPU utilization on Redis server up to about 25%
Screen Shot 2020-04-14 at 9 54 56 PM

@nyanshak
Copy link
Contributor

nyanshak commented Jul 9, 2020

For context on the performance numbers presented above - what do the numbers look like before the change?

As in, without this change applied, but using the same hardware for fleet, what do the graphs look like as we scale up # of hosts, and is the max # of hosts supported on that hardware much lower?

Copy link
Contributor

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass. It's big enough it's hard for me to absorb the higher level architectural changes. Though I'm not sure I need to.

A bunch of nits. Mostly comments. I think one thing for discussion, but I've kinda lost track

cmd/fleetctl/get.go Show resolved Hide resolved
server/datastore/mysql/labels.go Outdated Show resolved Hide resolved
server/datastore/mysql/labels.go Outdated Show resolved Hide resolved
server/datastore/mysql/labels.go Show resolved Hide resolved
server/datastore/mysql/labels.go Show resolved Hide resolved
server/live_query/redis_live_query_test.go Show resolved Hide resolved
frontend/kolide/entities/hosts.js Show resolved Hide resolved
IPAddress: "99.100.101.103",
},
}
h2.PrimaryIP = "99.100.101.103"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this IP was in prior test. But it'a probably assignable, it's not reserved.

server/service/transport_hosts.go Outdated Show resolved Hide resolved
server/test/functions.go Show resolved Hide resolved
Copy link
Contributor Author

@zwass zwass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding replies to some comments and questions.

frontend/kolide/entities/hosts.js Show resolved Hide resolved
server/datastore/mysql/labels.go Show resolved Hide resolved
server/datastore/mysql/labels.go Show resolved Hide resolved
server/datastore/mysql/labels.go Show resolved Hide resolved
server/datastore/mysql/labels.go Show resolved Hide resolved
server/live_query/redis_live_query.go Show resolved Hide resolved
server/test/functions.go Show resolved Hide resolved
@zwass
Copy link
Contributor Author

zwass commented Jul 14, 2020

@nyanshak I can't make a direct comparison to the current version as doing this kind of testing is quite costly and time consuming for me.

What I can say is that most folks seem to observe that the frontend becomes unusable around 20k hosts. Folks are getting to higher host counts by avoiding the frontend and only using fleetctl, however running live queries becomes unreliable in the 30k range or so (depending on hardware for MySQL server).

These changes allow Fleet to easily scale to those host counts on modest hardware configurations, and to much higher host counts (as demonstrated by the testing metrics) using more robust hardware.

Fleet used significant resources storing the full network interface
information for each host. This data was unused, except to get the
IP and MAC of the primary interface. With these changes, only those
pieces of data are stored.

- Calculate and store primary IP and MAC
- Remove transaction for storing full interfaces
- Update targets search to use new IP and MAC columns
- Update frontend to use new new columns
This change optimizes live queries by pushing the computation of query
targets to the creation time of the query, and efficiently caching the
targets in Redis. This results in a huge performance improvement at both
steady-state, and when running live queries.

- Live queries are stored using a bitfield in Redis, and takes
advantage of bitfield operations to be extremely efficient.

- Only run Redis live query test when REDIS_TEST is set in environment

- Ensure that live queries are only sent to hosts when there is a client
listening for results. Addresses an existing issue in Fleet along with
appropriate cleanup for the refactored live query backend.
This commit takes advantage of the existing pagination APIs in the Fleet
server, and provides additional APIs to support pagination in the web
UI. Doing this dramatically reduces the response sizes for requests from
the UI, and limits the performance impact of UI clients on the Fleet and
MySQL servers.
Label membership is now stored in the label_membership table. This is
done in preparation for adding "manual" labels, as previously label
membership was associated directly with label query executions.

Label queries are now all executed at the same time, rather than on
separate intervals. This simplifies the calculation of which distributed
queries need to be run when a host checks in.
"Manual" labels can be specified by hostname, allowing users to specify
the membership of a label without having to use a dynamic query. See the
included documentation.
Cleans up some repetition in tests.
Getting a single host with `fleetctl get host foobar` will look up the
host with the matching hostname, uuid, osquery identifier, or node key,
and provide the full host details along with the labels the host is a
member of.
- Debounce frontend to reduce number of target searches in live query.
- More efficiently calculate label counts in live query and hosts
  dashboard. Instead of using the (slow) CountHostsInTargets function,
  retrieve the host counts while looking up the labels.
- Optimize targets search query. Removing the nested query retrieves the
  same logical result set, but substantially optimizes MySQL CPU usage.
  Testing indicates about a 50% reduction in MySQL CPU usage for the
  frontend targets search API call after applying this change.
@zwass zwass requested a review from directionless July 16, 2020 01:01
@zwass
Copy link
Contributor Author

zwass commented Jul 16, 2020

@directionless I've updated per your comments and rebased to fix merge conflicts. Please let me know how it looks.

Copy link
Contributor

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try.

It's big enough I don't think I can wrap my head around all the nuance. But it seems like it's probably fine to try. And I'm happy enough with the redis implementation

@zwass zwass merged commit 7494513 into kolide:master Jul 21, 2020
@zwass zwass deleted the perf-manual-labels branch July 21, 2020 21:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.