Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Standalone watcher #1560

Open
wants to merge 69 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
5054c5b
impl standalone watcher
fregataa Sep 10, 2023
77b9e8f
add news fragment
fregataa Sep 10, 2023
d3b8cd4
get lock file path from FileLock for mount, umount job
fregataa Sep 11, 2023
5fe914b
update news fragment
fregataa Sep 11, 2023
259b4e5
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Sep 11, 2023
cc1e8ff
little touch on setting watcher logging and server init step
fregataa Sep 11, 2023
ece0ca3
let the final WatcherPlugin classes implement abstracmethods
fregataa Sep 11, 2023
84f44f9
remove legacy agent config
fregataa Sep 11, 2023
0ff6c00
impl mount,umount at BaseWatcher
fregataa Sep 11, 2023
c554c24
Merge branch 'main' into feature/standalone-storage-watcher
achimnol Sep 11, 2023
8d4e1a6
impl auth middleware
fregataa Sep 11, 2023
c0fdaf1
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Sep 11, 2023
510d539
Merge remote-tracking branch 'origin/feature/standalone-storage-watch…
fregataa Sep 11, 2023
b656d27
get stop signal
fregataa Sep 12, 2023
20faef3
more config setting
fregataa Sep 12, 2023
880b244
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Sep 13, 2023
26856d8
update config and add sample config
fregataa Sep 13, 2023
128091b
refactor watcher init step and config
fregataa Sep 13, 2023
ce02d74
remove commented codes
fregataa Sep 13, 2023
4d1895d
Merge branch 'main' into feature/standalone-storage-watcher
achimnol Sep 17, 2023
bfeff48
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Sep 25, 2023
c268142
add README
fregataa Sep 25, 2023
1df59e3
follow up type hints
fregataa Sep 25, 2023
72063f7
Merge remote-tracking branch 'origin/feature/standalone-storage-watch…
fregataa Sep 25, 2023
6fae0d3
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Oct 1, 2023
7abf147
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Oct 24, 2023
965870c
add watcher to common BUILD dependents
fregataa Oct 24, 2023
f2e334b
Merge branch 'main' into feature/standalone-storage-watcher
achimnol Nov 2, 2023
6eb948f
repo: Update BUILD to work with latest scie build config
achimnol Nov 2, 2023
48bc1d4
fix: Unify the config file loading mechanism and error message
achimnol Nov 2, 2023
ca54466
refactor: Unify configuration error handling
achimnol Nov 2, 2023
9ebb29c
lint: Fix formatting of watcher/BUILD
achimnol Nov 2, 2023
6127829
test: We should use the fixture containers, not local config
achimnol Nov 2, 2023
b792710
Merge branch 'main' into feature/standalone-storage-watcher
achimnol Nov 3, 2023
e7078ee
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Nov 7, 2023
d9357dc
impl API to query webapp plugins
fregataa Nov 7, 2023
6e56ce6
Merge remote-tracking branch 'origin/feature/standalone-storage-watch…
fregataa Nov 7, 2023
788b326
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Nov 9, 2023
8751eca
consistant plugin list API
fregataa Nov 9, 2023
ca245fb
add route prefix field to watcher plugin
fregataa Nov 9, 2023
19ec9b5
refactor get plugin func
fregataa Nov 10, 2023
d9a8bac
let each watcher webapp has own watcher instance
fregataa Nov 10, 2023
61e9d54
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Nov 14, 2023
dca80a9
impl mount checker
fregataa Nov 14, 2023
4ed7778
fix polling mount
fregataa Nov 14, 2023
1c1804e
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Nov 27, 2023
100999c
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Jan 28, 2024
cfc7988
apply formatting
fregataa Jan 28, 2024
6b014ff
dont depend on agent src-watcher in BUILD file
fregataa Jan 29, 2024
d532b44
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Jan 29, 2024
3e3ecfa
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Feb 6, 2024
7814351
refactor event
fregataa Feb 6, 2024
2525e3d
remove mount-prefix and save mountpoints on etcd
fregataa Feb 8, 2024
2a49839
remove event producer/dispatcher from root-ctx and fix some umount code
fregataa Feb 8, 2024
ac09b4e
refactor utils.mount() API
fregataa Feb 9, 2024
2790dce
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Feb 9, 2024
4043282
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Feb 18, 2024
9a657ba
rename poll_mount_check API to poll_directory_mount
fregataa Feb 19, 2024
02df9e5
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Feb 19, 2024
670deb9
migrate agent/storage repo to bai monorepo
fregataa Feb 19, 2024
0a0fc7a
update watcher BUILD file
fregataa Feb 19, 2024
01ef92d
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Feb 25, 2024
872be43
update news fragment
fregataa Feb 25, 2024
6f478f1
leave TODO to comments
fregataa Feb 25, 2024
a037a39
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Mar 5, 2024
710604e
follow-up LogSeverity type
fregataa Mar 5, 2024
c147f81
resolve dependencies on BUILD
fregataa Mar 5, 2024
e3b4308
Merge branch 'main' into feature/standalone-storage-watcher
fregataa Mar 6, 2024
6358fd2
add README to agent/storage-proxy watcher
fregataa Mar 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/1560.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Implement storage watcher and agent watcher with new standalone watcher interface and remove legacy watchers and their configs.
2 changes: 0 additions & 2 deletions configs/agent/halfstack.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ var-base-path = "./var/lib/backend.ai"
# allow-compute-plugins = []
# block-compute-plugins = []
image-commit-path = "./tmp/backend.ai/commit/"
mount-path = "./vfroot/local"
cohabiting-storage-proxy = true


[container]
Expand Down
7 changes: 0 additions & 7 deletions configs/agent/sample.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,6 @@ agent-sock-port = 6007
# This affects the per-sgroup configuration scope.
scaling-group = "default"

# Set the volume mount path for the agent node.
# mount-path = "/vfroot"

# A boolean flag to indicate if there is any cohabiting storage proxy within the same node.
# If is there any cohabiting storage proxy, then events such as mount or umount will be ignored. [default: true]
# cohabiting-storage-proxy = true

# Create a PID file so that daemon managers could keep track of us.
# If set to an empty string, it does NOT create the PID file.
# pid-file = "./agent.pid" # env: BACKEND_PID_FILE
Expand Down
63 changes: 63 additions & 0 deletions configs/watcher/sample.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
[etcd]
namespace = "local" # env: BACKEND_NAMESPACE
addr = { host = "127.0.0.1", port = 8121 } # env: BACKEND_ETCD_ADDR (host:port)
user = "" # env: BACKEND_ETCD_USER
password = "" # env: BACKEND_ETCD_PASSWORD


[logging]
# One of: "NOTSET", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
# Set the global logging level.
level = "INFO"
drivers = ["console"]

[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[logging.console]
# If set true, use ANSI colors if the console is a terminal.
# If set false, always disable the colored output in console logs.
colored = true

# One of: "simple", "verbose"
format = "simple"

[debug]
# Enable the debug mode by overriding the global loglevel and "ai.backend" loglevel.
enabled = false

[watcher]
# Override the name of this watcher node.
# If empty or unspecified, the config builds this from the hostname by prefixing it with "i-",
# like "i-watcher-hostname". The "i-watcher-" prefix is not mandatory, though.
# Explicit configuration may be required if the hostname changes frequently,
# to handle the event bus messages consistently.
# This affects the per-node configuration scope.
# node-id = "i-watcher-storage-proxy-01"

# One of: "asyncio", "uvloop"
# This changes the event loop backend.
# uvloop is a fast libuv-based implementation but sometimes has
# compatibility issues.
# event-loop = "uvloop"

# The base directory to put domain sockets for IPC.
# Normally you don't have to change it.
# NOTE: If Docker is installed via Snap (https://snapcraft.io/docker),
# you must change this to a directory under your *home* directory.
# ipc-base-path = "/tmp/backend.ai/ipc"

# Set the SSL certificate chain and the private keys used for serving the API requests.
ssl-enabled = false
#ssl-cert = "/etc/backend.ai/ssl/apiserver-fullchain.pem" # env: BACKNED_SSL_CERT
#ssl-privkey = "/etc/backend.ai/ssl/apiserver-privkey.pem" # env: BACKNED_SSL_KEY


# Set module's config.
# <MODULE_NAME> should be the `name` property of watcher plugin.
# It is recommended to define a watcher config that inherits from BaseWatcherConfig
# for better config validation and strict typing.
# [module.<MODULE_NAME>]
1 change: 0 additions & 1 deletion scripts/install-dev.sh
Original file line number Diff line number Diff line change
Expand Up @@ -888,7 +888,6 @@ configure_backendai() {
sed '/scratch-nfs-/d' ./agent.toml > ./agent.toml.sed
mv ./agent.toml.sed ./agent.toml
fi
sed_inplace "s@\(# \)\{0,1\}mount-path = .*@mount-path = "'"'"${ROOT_PATH}/${VFOLDER_REL_PATH}"'"'"@" ./agent.toml
if [ $ENABLE_CUDA -eq 1 ]; then
sed_inplace "s/# allow-compute-plugins =.*/allow-compute-plugins = [\"ai.backend.accelerator.cuda_open\"]/" ./agent.toml
elif [ $ENABLE_CUDA_MOCK -eq 1 ]; then
Expand Down
15 changes: 0 additions & 15 deletions src/ai/backend/agent/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,6 @@ python_sources(
"src/ai/backend/helpers:src",
"//:reqs#backend.ai-krunner-static-gnu", # not auto-inferred
":resources",
"!./__init__.py:src-watcher",
],
)
python_sources(
name="src-watcher",
sources=[
"__init__.py",
"watcher.py",
],
dependencies=[
":resources",
# Exclude transitive dependencies of the agent itself!
"!./__init__.py:src",
],
)

Expand Down Expand Up @@ -80,14 +67,12 @@ pex_binary(
"//src/ai/backend/accelerator/mock:lib",
"//src/ai/backend/accelerator/mock:buildscript",
"!!stubs/trafaret:stubs",
"!!./__init__.py:src-watcher",
],
)
pex_binary(
name="pex-watcher",
entry_point="watcher.py",
dependencies=[
":src-watcher",
"!!stubs/trafaret:stubs",
"!!./__init__.py:src",
],
Expand Down
54 changes: 27 additions & 27 deletions src/ai/backend/agent/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@

from ai.backend.common import msgpack, redis_helper
from ai.backend.common.config import model_definition_iv
from ai.backend.common.defs import REDIS_STAT_DB, REDIS_STREAM_DB
from ai.backend.common.defs import MOUNT_MAP_KEY, REDIS_STAT_DB, REDIS_STREAM_DB
from ai.backend.common.docker import MAX_KERNELSPEC, MIN_KERNELSPEC, ImageRef
from ai.backend.common.events import (
AbstractEvent,
Expand Down Expand Up @@ -2388,41 +2388,40 @@ async def handle_volume_mount(
source: AgentId,
event: DoVolumeMountEvent,
) -> None:
if context.local_config["agent"]["cohabiting-storage-proxy"]:
log.debug("Storage proxy is in the same node. Skip the volume task.")
# TODO: This should be removed after agent-watcher is fully implemented
err_msg: str | None = None

mountpoint = Path(event.dir_name)
if mountpoint.is_mount():
log.debug("Volume is already mounted. Skip the volume task.")
await context.event_producer.produce_event(
VolumeMounted(
"already-mounted",
str(context.id),
VolumeMountableNodeType.AGENT,
str(mountpoint),
"",
event.quota_scope_id,
)
)
return
mount_prefix = await context.etcd.get("volumes/_mount")
volume_mount_prefix: str | None = context.local_config["agent"]["mount-path"]
if volume_mount_prefix is None:
volume_mount_prefix = "./"
real_path = Path(volume_mount_prefix, event.dir_name)
err_msg: str | None = None
try:
await mount(
str(real_path),
str(mountpoint),
event.fs_location,
event.fs_type,
event.cmd_options,
event.edit_fstab,
event.fstab_path,
mount_prefix,
)
await context.etcd.put(f"{MOUNT_MAP_KEY}/{str(mountpoint)}", event.fs_location)
except VolumeMountFailed as e:
err_msg = str(e)
await context.event_producer.produce_event(
VolumeMounted(
"mount-event",
str(context.id),
VolumeMountableNodeType.AGENT,
str(real_path),
event.quota_scope_id,
str(mountpoint),
err_msg,
)
)
Expand All @@ -2433,40 +2432,41 @@ async def handle_volume_umount(
source: AgentId,
event: DoVolumeUnmountEvent,
) -> None:
if context.local_config["agent"]["cohabiting-storage-proxy"]:
log.debug("Storage proxy is in the same node. Skip the volume task.")
# TODO: This should be removed after agent-watcher is fully implemented
timeout = await context.etcd.get("config/watcher/file-io-timeout")
err_msg: str | None = None

mountpoint = Path(event.dir_name)
if not mountpoint.is_mount():
log.debug("Volume is already umounted. Skip the volume task.")
await context.event_producer.produce_event(
VolumeUnmounted(
"already-umounted",
str(context.id),
VolumeMountableNodeType.AGENT,
str(mountpoint),
"",
event.quota_scope_id,
)
)
return
mount_prefix = await context.etcd.get("volumes/_mount")
timeout = await context.etcd.get("config/watcher/file-io-timeout")
volume_mount_prefix = context.local_config["agent"]["mount-path"]
real_path = Path(volume_mount_prefix, event.dir_name)
err_msg: str | None = None
try:
did_umount = await umount(
str(real_path),
mount_prefix,
str(mountpoint),
event.edit_fstab,
event.fstab_path,
timeout_sec=float(timeout) if timeout is not None else None,
)
await context.etcd.delete(f"{MOUNT_MAP_KEY}/{str(mountpoint)}")
except VolumeMountFailed as e:
err_msg = str(e)
if not did_umount:
log.warning(f"{real_path} does not exist. Skip umount")
log.warning(f"{mountpoint} does not exist. Skip umount")
await context.event_producer.produce_event(
VolumeUnmounted(
"umount-event",
str(context.id),
VolumeMountableNodeType.AGENT,
str(real_path),
event.quota_scope_id,
str(mountpoint),
err_msg,
)
)
2 changes: 0 additions & 2 deletions src/ai/backend/agent/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@
t.Key("var-base-path", default="./var/lib/backend.ai"): tx.Path(
type="dir", auto_create=True
),
t.Key("mount-path", default=None): t.Null | tx.Path(type="dir", auto_create=True),
t.Key("cohabiting-storage-proxy", default=True): t.Bool(),
t.Key("public-host", default=None): t.Null | t.String,
t.Key("region", default=None): t.Null | t.String,
t.Key("instance-type", default=None): t.Null | t.String,
Expand Down
64 changes: 31 additions & 33 deletions src/ai/backend/agent/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -883,38 +883,33 @@ def main(
# Determine where to read configuration.
try:
raw_cfg, cfg_src_path = config.read_from_file(config_path, "agent")
except config.ConfigurationError as e:
print(
"ConfigurationError: Could not read or validate the storage-proxy local config:",
file=sys.stderr,
)
print(pformat(e.invalid_data), file=sys.stderr)
raise click.Abort()

# Override the read config with environment variables (for legacy).
config.override_with_env(raw_cfg, ("etcd", "namespace"), "BACKEND_NAMESPACE")
config.override_with_env(raw_cfg, ("etcd", "addr"), "BACKEND_ETCD_ADDR")
config.override_with_env(raw_cfg, ("etcd", "user"), "BACKEND_ETCD_USER")
config.override_with_env(raw_cfg, ("etcd", "password"), "BACKEND_ETCD_PASSWORD")
config.override_with_env(
raw_cfg, ("agent", "rpc-listen-addr", "host"), "BACKEND_AGENT_HOST_OVERRIDE"
)
config.override_with_env(raw_cfg, ("agent", "rpc-listen-addr", "port"), "BACKEND_AGENT_PORT")
config.override_with_env(raw_cfg, ("agent", "pid-file"), "BACKEND_PID_FILE")
config.override_with_env(raw_cfg, ("container", "port-range"), "BACKEND_CONTAINER_PORT_RANGE")
config.override_with_env(raw_cfg, ("container", "bind-host"), "BACKEND_BIND_HOST_OVERRIDE")
config.override_with_env(raw_cfg, ("container", "sandbox-type"), "BACKEND_SANDBOX_TYPE")
config.override_with_env(raw_cfg, ("container", "scratch-root"), "BACKEND_SCRATCH_ROOT")

if debug:
log_level = LogSeverity.DEBUG
config.override_key(raw_cfg, ("debug", "enabled"), log_level == LogSeverity.DEBUG)
config.override_key(raw_cfg, ("logging", "level"), log_level)
config.override_key(raw_cfg, ("logging", "pkg-ns", "ai.backend"), log_level)

# Validate and fill configurations
# (allow_extra will make configs to be forward-copmatible)
try:
# Override the read config with environment variables (for legacy).
config.override_with_env(raw_cfg, ("etcd", "namespace"), "BACKEND_NAMESPACE")
config.override_with_env(raw_cfg, ("etcd", "addr"), "BACKEND_ETCD_ADDR")
config.override_with_env(raw_cfg, ("etcd", "user"), "BACKEND_ETCD_USER")
config.override_with_env(raw_cfg, ("etcd", "password"), "BACKEND_ETCD_PASSWORD")
config.override_with_env(
raw_cfg, ("agent", "rpc-listen-addr", "host"), "BACKEND_AGENT_HOST_OVERRIDE"
)
config.override_with_env(
raw_cfg, ("agent", "rpc-listen-addr", "port"), "BACKEND_AGENT_PORT"
)
config.override_with_env(raw_cfg, ("agent", "pid-file"), "BACKEND_PID_FILE")
config.override_with_env(
raw_cfg, ("container", "port-range"), "BACKEND_CONTAINER_PORT_RANGE"
)
config.override_with_env(raw_cfg, ("container", "bind-host"), "BACKEND_BIND_HOST_OVERRIDE")
config.override_with_env(raw_cfg, ("container", "sandbox-type"), "BACKEND_SANDBOX_TYPE")
config.override_with_env(raw_cfg, ("container", "scratch-root"), "BACKEND_SCRATCH_ROOT")
if debug:
log_level = LogSeverity.DEBUG
config.override_key(raw_cfg, ("debug", "enabled"), log_level == LogSeverity.DEBUG)
config.override_key(raw_cfg, ("logging", "level"), log_level)
config.override_key(raw_cfg, ("logging", "pkg-ns", "ai.backend"), log_level)

# Validate and fill configurations
# (allow_extra will make configs to be forward-copmatible)
cfg = config.check(raw_cfg, agent_local_config_iv)
if cfg["agent"]["backend"] == AgentBackend.KUBERNETES:
if cfg["container"]["scratch-type"] == "k8s-nfs" and (
Expand All @@ -931,7 +926,10 @@ def main(
pprint(cfg)
cfg["_src"] = cfg_src_path
except config.ConfigurationError as e:
print("ConfigurationError: Validation of agent local config has failed:", file=sys.stderr)
print(
"ConfigurationError: Could not read or validate the agent local config:",
file=sys.stderr,
)
print(pformat(e.invalid_data), file=sys.stderr)
raise click.Abort()

Expand Down Expand Up @@ -1000,7 +998,7 @@ def main(
log.info("runtime: {0}", utils.env_info())

log_config = logging.getLogger("ai.backend.agent.config")
if log_level == "DEBUG":
if log_level == LogSeverity.DEBUG:
log_config.debug("debug mode enabled.")

if cfg["agent"]["event-loop"] == "uvloop":
Expand Down