Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.service subpkg #483

Merged
merged 38 commits into from
Mar 10, 2023
Merged

.service subpkg #483

merged 38 commits into from
Mar 10, 2023

Conversation

goodboy
Copy link
Contributor

@goodboy goodboy commented Mar 8, 2023

Move all actor-service control APIs and the service manager (basically all the stuff that was in piker._daemon) into a new piker.service sub package; this will serve (get it..) as the high-level actor-service orchestration sub-system, the business logic for piker's distributed architecture and runtime deployment.


Of note,

  • includes improvements to .service._ahab which repair a regression: a silent startup failure with marketstore due to the log processing poll sleep period change..
    • adds proper async log msg proxying without blocking the ahabd supervisor task
    • proxies through pikerd logging correctly to the docker super code in general..
  • moved the _ahab.py docker supervisor into the sub-package as well since it is literally an external, containerized service supervisor 😂
  • moved both the .data.elastix and .marketstore mods here as well until we figure out how to formalize our storage layer subsystems, and since most of that code is related to db interaction/mangement vs. data processing.

ToDo:

  • extend the tests for both dbs to have a client connect and do something or other..
  • add cancellation tests for the dockerized services
    => moved these to a follow up testing task ahabd supervision tests #487
  • consider breaking out the daemon endpoints to a ._daemon.py module in each sub-system package?
  • move Services api into it's own module?
  • move the startup machinery into _runtime.py? (** chose .service._actor_runtime.py instead **)
  • move the ._exec.py stuff (which is more or less just Qt runtime startup) into this pkg as well? => also moved to How to expose/organize <daemon>d entryoints? #488
  • make test harness use the tmp_dir fixure for our config dir path during testing
    • required augmenting the runtime init to passthrough data to tractor._state._runtime_vars

@goodboy goodboy added integration external stack and/or lib augmentations (sub-)systems general sw design and eng tsdb time series db stuff labels Mar 8, 2023
if delete:
for fqsn in symbols:
syms = await storage.client.list_symbols()
breakpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left in a breakpoint unintentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ish, was WIP, fixed now (ish) via 6a0ae58

@@ -322,37 +354,94 @@ async def open_ahabd(
) = ep_func(client)
cntr = Container(dcntr)

with trio.move_on_after(start_timeout):
found = await cntr.process_logs_until(start_lambda)
conf: ChainMap[str, Any] = ChainMap(
Copy link
Contributor

@jaredgoldman jaredgoldman Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why use a ChainMap here instead of a dict?

Copy link
Contributor Author

@goodboy goodboy Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's easier then writing a lot of dict.get(blah, default) stuff and makes everything extensible for if we ever need to offer overrides of container configs for specific use cases..

if you haven't read it yet, check out the docs on this type btw, this is kinda why it was added to stdlib afaiu 🏄🏼

@@ -135,7 +135,7 @@ def start_marketstore(

# create dirs when dne
if not os.path.isdir(config._config_dir):
Path(config._config_dir).mkdir(parents=True, exist_ok=True)
Path(config._config_dir).mkdir(parents=True, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ws?

@@ -659,7 +662,7 @@ async def tsdb_history_update(
# - https://github.com/pikers/piker/issues/98
#
profiler = Profiler(
disabled=False, # not pg_profile_enabled(),
disabled=True, # not pg_profile_enabled(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is default true on the Profiler purposeful?

Copy link
Contributor

@jaredgoldman jaredgoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once ci is passing bruddr

@goodboy
Copy link
Contributor Author

goodboy commented Mar 9, 2023

🙏🏼 to da CI gawdz

@goodboy
Copy link
Contributor Author

goodboy commented Mar 9, 2023

nothing like hanging for 40 mins

Adds a `piker storage` subcmd with a `-d` flag to wipe a particular
fqsn's time series (both 1s and 60s). Obviously this needs to be
extended much more but provides a start point.
With the addition of a new `elastixsearch` docker support in
#464, adjustments were made
to container startup sync logic (particularly the `trio` checkpoint
sleep period - which itself is a hack around a sync client api) which
caused a regression in upstream startup logic wherein container error
logs were not being bubbled up correctly causing a silent failure mode:

- `marketstore` container started with corrupt input config
- `ahabd` super code timed out on startup phase due to a larger log
  polling period, skipped processing startup logs from the container,
  and continued on as though the container was started
- history client fails on grpc connection with no clear error on why the
  connection failed.

Here we revert to the old poll period (1ms) to avoid any more silent
failures and further extend supervisor control through a configuration
override mechanism. To address the underlying design issue, this patch
adds support for container-endpoint-callbacks to override supervisor
startup configuration parameters via the 2nd value in their returned
tuple: the already delivered configuration `dict` value.

The current exposed values include:
    {
        'startup_timeout': 1.0,
        'startup_query_period': 0.001,
        'log_msg_key': 'msg',
    },

This allows for container specific control over the startup-sync query
period (the hack mentioned above)  as well as the expected log msg key
and of course the startup timeout.
Previously we would make the `ahabd` supervisor-actor sync to docker
container startup using pseudo-blocking log message processing.

This has issues,
- we're forced to do a hacky "yield back to `trio`" in order to be
  "fake async" when reading the log stream and further,
- blocking on a message is fragile and often slow.

Instead, run the log processor in a background task and in the parent
task poll for the container to be in the client list using a similar
pseudo-async poll pattern. This allows the super to `Context.started()`
sooner (when the container is actually registered as "up") and thus
unblock its (remote) caller faster whilst still doing full log msg
proxying!

Deatz:
- adds `Container.cuid: str` a unique container id for logging.
- correctly proxy through the `loglevel: str` from `pikerd` caller task.
- shield around `Container.cancel()` in the teardown block and use
  cancel level logging in that method.
For now just moves everything that was in `piker._daemon` to a subpkg
module but a reorg is coming pronto!
Not really sure there's much we can do besides dump Grpc stuff when we
detect an "error" `str` for the moment..

Either way leave a buncha complaints (como siempre) and do linting
fixups..
Thanks @esme! XD

Also, do a linter pass and remove a buncha unused references.
Due to making ahabd supervisor init more async we need to be more
tolerant to mkts server startup: the grpc machinery needs to be up
otherwise a client which connects to early may just hang on requests..

Add a reconnect loop (which might end up getting factored into client
code too) so that we only block on requests once we know the client
connection is actually responsive.
Provides a more correct solution (particularly for distributed testing)
to override the `piker` configuration directory by reading the path from
a specific `tractor._state._runtime_vars` entry that can be provided by
the test harness.

Also fix some typing and comments.
Needed to move the startup sequence inside the `try:` block to guarantee
we always do the (now shielded) `.cancel()` call if we get a cancel
during startup.

Also, support an optional `started_afunc` field in the config if
backends want to just provide a one-off blocking async func to sync
container startup. Add a `drop_root_perms: bool` to allow persisting
sudo perms for testing or dyanmic container spawning purposes.
@goodboy
Copy link
Contributor Author

goodboy commented Mar 10, 2023

THERE. moved all oustanding bullets to new follow up issues 🏄🏼

So whoever decides to review this once more and decides we gud, please merge.
If not i'll land later tonight 💥

@goodboy
Copy link
Contributor Author

goodboy commented Mar 10, 2023

Heh, i can tell y'all aren't testing non-disti mode.. 😂

This broke non-disti-mode actor tree spawn / runtime, seemingly because
the cli entrypoint for a `piker chart` also sends these values down
through the call stack independently? Pretty sure we don't need to send
the `enable_modules` from the chart actor anyway.
@goodboy goodboy merged commit eb51033 into master Mar 10, 2023
@goodboy goodboy deleted the service_subpkg branch March 10, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration external stack and/or lib augmentations (sub-)systems general sw design and eng tsdb time series db stuff
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants