Skip to content
This repository has been archived by the owner on Jul 31, 2024. It is now read-only.

Add a global locate sample rate, with optional rate controller #1426

Merged
merged 10 commits into from
Dec 9, 2020

Conversation

jwhitlock
Copy link
Member

@jwhitlock jwhitlock commented Nov 20, 2020

For issue #1398, add a global sample rate for locate observations. Automatically adjust it, with the goal of processing as many observations as possible without accumulating a growing backlog (and without manual intervention).

To make the data queues easier to monitor, the queue metrics has two new tags:

  • queue_type: data or task
  • data_type: For data queues, the type of data. bluetooth, cell, and (most of all) wifi contribute to the backlog calculation

DataQueue and derived classes store the data_type, passed as a new initialization variable.

The queue metric task monitor_queue_size is now monitor_queue_size_and_rate_control, and runs the rate controller if enabled. The rate controller is implemented by a PID controller provided by the simple-pid library. Simulation showed that only proportional control was needed, so it is more of a P Controller.

This PR adds some new Redis keys:

  • global_locate_sample_rate: Read in web app, assumed 100.0 (100%) if unset. Set in the task app, if the rate controller is enabled
  • rate_controller_target: The target maximum data queue size, as an integer. Must be set by an administrator to enable the rate controller.
  • rate_controller_enabled: Read in the task app, 1 to enable the rate controller, 0 or unset to disable. Set by an administrator.
  • rate_controller_kp, rate_controller_ki, rate_controller_kd: Kp, Ki, and Kd, the proportional, integral, and derivative gain terms. They are set to defaults (8, 0, and 0) when the rate controller is enabled, and could be adjusted by an administrator.
  • rate_controller_state: The internal state of the PID controller, as a JSON-encoded string, used to reload it when the task runs.

There are new metrics as well:

  • rate_control.locate: The current value of global_locate_sample_rate, or 100.0 if unset
  • rate_control.locate.kp, rate_control.locate.ki, rate_control.locate.kd: The current values of PID gains Kp, Ki, and Kd.
  • rate_control.locate.pterm, rate_control.locate.iterm, rate_control.locate.dterm: The internal components of the PID controller, for debugging and adjusting the PID gains.

There are new documents for rate control, as well as updates to the metrics docs.

@jwhitlock jwhitlock self-assigned this Nov 20, 2020
@jwhitlock jwhitlock marked this pull request as draft November 20, 2020 18:00
@jwhitlock jwhitlock force-pushed the global-sample-rate-1398 branch 3 times, most recently from ce0f735 to 41dfacd Compare December 3, 2020 19:04
Change sample rate calculation to use float math and
random.random() instead of random.randint(). Use mocks to test the
algorithm directly rather than statistically.
The global locate sample rate will allow reducing sampling when backend
processing is overloaded.
These queues have not generated any metrics in the last year, probably
because the namei are queue_export_* rather than export_queue_*.
@jwhitlock jwhitlock changed the title WIP: Add a dynamically controlled global locate sample rate Add a global locate sample rate, with optional rate controller Dec 8, 2020
Add tags for queue_type (task or data) and data_type (various) to the
queue entries. This will make it easier to filter and aggregate data in
Graphana.

Tests now have a list of the expected queues and their tags. This
highlighted some missing queues in the documentation.
The station data backlog is computed while measuring the queue sizes,
and the rate controller parameters are read from Redis. If the rate
controller is enabled, the controller is initialized, and the
previous internal state (if any) is loaded. A new global locate sample
rate is determined and written to Redis, along with the rate controller
state.

Since the rate controller parameters will be set manually, they are
validated, and the rate controller turned off if they are invalid.
New validation is used in the API to read the global sample rate as
well, to make it a little safer, defaulting to 100%.
Instead of auto-disabling rate control, set the PID paramters to
reasonable settings.
@jwhitlock jwhitlock marked this pull request as ready for review December 9, 2020 00:16
@jwhitlock
Copy link
Member Author

I'm a little bothered by the way the rate jumps around when under rate control. This could be solved by a filter on the rate output (some suggest a second PID controller on the output). However, it is more important to get some real-world data from using this code, so stopping here.

Locally, I ran:

  • make build docs lint test, standard QA stuff
  • Three terminals:
    • make run for the web server. No change is expected.
    • make runcelery for the backend. The new tags for the queues are emitted.
    • make shell so I could run redis-cli -h redis

The rate controller can't be tested outside of simulations (in private docs and private repositories, since it uses production data for simulated traffic), but I was able to manually test:

  • Set rate_controller_enabled to 1, is reset to 0 when the queue task runs.
  • Set global_locate_sample_rate to 50, rate_controller_target to 1000, rate_controller_enabled to 1. When the queue task runs, the rate controller runs, it sets the rate to 100.0, sets the gains to the defaults, and emits the metrics.

Let's get some real-world usage data.

@jwhitlock jwhitlock merged commit 244eccd into mozilla:main Dec 9, 2020
@jwhitlock jwhitlock deleted the global-sample-rate-1398 branch December 9, 2020 16:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant