New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Multi-Model Puller #989
Conversation
Hi @ifilonenko. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@yuzisun @ellistarn @yuzliu @cliveseldon @rakelkar @njhill for review |
cmd/puller/main.go
Outdated
func main() { | ||
flag.Parse() | ||
puller.OnConfigChange(func(e puller.EventWrapper) { | ||
log.Println("Send a request to:", e.LoadState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section will do a load / unload call via a local HTTP request to 0.0.0.0:xxxx.
EventWrapper has information pertaining to:
- Which model
- Whether load / unload
- What information -- should it be a load
For framework, is this string going relate at all to the image in the Predictor spec, i.e. "tensorflow:v1.12", "patched_tensorflow:v1.12" or will it simply be a well known enum of supported MMS frameworks? |
cmd/puller/main.go
Outdated
log.Println("Send a request to:", e.LoadState) | ||
log.Println("for model", e.ModelName) | ||
}) | ||
puller.WatchConfig(*modelDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use "informers" here? https://firehydrant.io/blog/stay-informed-with-kubernetes-informers/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ellistarn We'd like to not depend on k8s for model puller, likewise for model server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know when there is a verdict on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @yuzisun on this
puller.Dockerfile
Outdated
@@ -0,0 +1,21 @@ | |||
# Build the inference-puller binary | |||
FROM golang:1.13.0 as builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move to 1.14 asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries there, should I do that in this PR? Or is that for another time
I was thinking of doing well-known enums, personally. But the nuances of that decision will be in a separate PR, so we can move that discussion to there. This PR will rely on that one. [EDIT] PR can be found here: #992 |
pkg/puller/watcher.go
Outdated
ShouldLoad State = "Load" | ||
ShouldUnload State = "Unload" | ||
|
||
writeOrCreateMask = fsnotify.Write | fsnotify.Create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the writeOrCreateMask state mean? Should configMap update only result in two state: "Load" or "Unload"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I saw how it is used later 👍
pkg/puller/watcher.go
Outdated
// 2 - if the model file was removed as a result of deletion or renaming | ||
if p.onConfigChange != nil { | ||
ext := filepath.Ext(event.Name) | ||
isEdit := event.Op&writeOrCreateMask != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bitwise & AND makes me nervous here, let's me sure we have good tests for it :)
pkg/puller/watcher.go
Outdated
isValidFile := ext == p.fileExtension | ||
if isValidFile && (isEdit || isRemove) { | ||
fileName := strings.TrimSuffix(filepath.Base(event.Name), ext) | ||
if isRemove { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like that we have a file for each model in the configMap. We don't need to do a diff on every ConfigMap change which makes the logic cleaner and simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add version information in the filename e.g. - once we are ready to add VersionPolicy in the TrainedModel CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that can be contained in content of the json
instead of the filename
? although, we are getting a head of ourselves, that will be for a later PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can discuss the details once we add version policy.
cmd/puller/main.go
Outdated
) | ||
|
||
var ( | ||
modelDir = flag.String("model_dir", "/tmp","directory for multi-model config files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we default to /mnt/models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are config files, I thought we would default /mnt/models
for model files, but configs would go into a different location. Lmk where you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change the flag name to config-dir
since model_dir
implies that's where to models go (to me at least), and how about /mnt/config
as the default?
cmd/puller/main.go
Outdated
log.Println("Send a request to:", e.LoadState) | ||
log.Println("for model", e.ModelName) | ||
}) | ||
puller.WatchConfig(*modelDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ellistarn We'd like to not depend on k8s for model puller, likewise for model server.
pkg/puller/downloader.go
Outdated
err := try.Do(func(attempt int) (bool, error) { | ||
err := download(event.ModelDef.StorageUri) | ||
if err != nil { | ||
time.Sleep(1 * time.Second) // wait a second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we wait for a second here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a dummy thing for retry purposes, I don’t mind if we retry immediately. It’s a WIP
cmd/puller/main.go
Outdated
|
||
var ( | ||
modelDir = flag.String("model_dir", "/tmp","directory for multi-model config files") | ||
numWorkers = flag.Int("num_workers", 1,"number of workers for parallel downloads") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the convention for argument is num-workers
pkg/puller/puller.go
Outdated
func InitiatePullers(numWorkers int, numRetries int) { | ||
p.NumWorkers = numWorkers | ||
p.NumRetries = numRetries | ||
p.Channel = make(chan EventWrapper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need buffer size here otherwise it is going to block the sender if receiver is still consuming the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the buffer be the number of workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ifilonenko this is a great start, thanks! I think there's a fair bit more to be considered though to make this robust, and to do so while keeping it as efficient/reactive as possible may be nontrivial.
For example:
- Careful coordination may be needed w.r.t. management of per-model state to avoid race conditions.
- What happens if a model is removed while it is downloading? or added while a prior model with same name is still in the process of unloading?
- How will changes to existing model definitions be handled in a controlled manner? In some model servers this may entail unloading the existing model followed by loading the new one. As an aside this is partly why I feel that it could be best for
TrainedModel
s to be immutable - maybe for another discussion...
- What happens if the puller container dies/restarts and in the meantime models are removed from the configmap? There will be no watcher events for these deletions and so they would remain orphaned with files left on the shared volume and/or loaded in the model server. So an audit of the model data dir would be needed on startup to diff with the current configmap. In support of this I think a special file will need to be written alongside the model data to reflect download completion (along with source URI) and subsequently successful loading into the model server.
- What happens if the model server dies/restarts? Depending on its control mechanism, some way of querying its current config might be needed to reconcile state otherwise lost models won't be reinstated. TF-Serving (and maybe others) sidestep this problem via use of a declarative manifest
- Related to that last point, we may want to consider both declarative and command-based cases w.r.t. interfacing with different model servers... load/unload events aren't very useful in the TF Serving case since it expects the full manifest to be amended and determines the deltas/actions itself.
What would make most sense imo is a level-triggered reconciliation loop analogous to a Kube controller, which compares/converges desired/actual states. Where desired state is the configmap mount dir, and actual state includes the model data shared volume contents, model server's own reported state, and any in-progress file download and/or deletion jobs.
Finally what's the current thinking w.r.t. reporting status back? Ideally the TrainedModel
CR Status should reflect whether the model is loading / loaded / failed download / failed model-server load.
BTW I'd be happy to help with this and apologies if some/all was already planned in some way...
cmd/puller/main.go
Outdated
) | ||
|
||
var ( | ||
modelDir = flag.String("model_dir", "/tmp","directory for multi-model config files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change the flag name to config-dir
since model_dir
implies that's where to models go (to me at least), and how about /mnt/config
as the default?
pkg/puller/watcher.go
Outdated
var w *Watcher | ||
|
||
type Watcher struct { | ||
modelDir string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other comment maybe call it modelConfigDir
for clarity?
pkg/puller/watcher.go
Outdated
type Watcher struct { | ||
modelDir string | ||
onConfigChange func(EventWrapper) | ||
fileExtension string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do file extensions really need to be handled? How about keeping it simple and just assume that the designated directory is only for models and that model name == filename?
pkg/puller/watcher.go
Outdated
func OnConfigChange(run func(in EventWrapper)) { | ||
log.Println("Applying onConfigChange") | ||
w.onConfigChange = run | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could just have this as a second arg to WatchConfig
func?
pkg/puller/watcher.go
Outdated
isEdit := event.Op&writeOrCreateMask != 0 | ||
isRemove := event.Op&fsnotify.Remove != 0 | ||
isValidFile := ext == w.fileExtension | ||
if isValidFile && (isEdit || isRemove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to work as intended, in particular it won't detect changes to existing files since Kubernetes creates them as symlinks pointing to files with a hidden sibling dir called ..data
, which is itself symlinked to a versioned dir (also hidden) containing a consistent revision of the configmap.
If keys are added to / removed from the configmap then you will get events for the corresponding symlinks being created which will be what you observed when testing, but they won't produce their own events when key contents changes.
My suggestion would be to:
- Keep a filename/key ->
ModelDefinition
map in memory of the current state - Watch only for
fsnotify.Create
events corresponding to the..data
symlink - On that event, resolve the
..data
symlink to its (versioned) target dir and read its contents; then produce any load/unload events based on diffing this with the map in memory (with a retry loop for IO errors in case the configmap is updated concurrently)
Great comments ! thanks @njhill !
All the events for the same model should be processed by the same worker, we can create a set of channels and assign a worker for each channel, so events for same models are consistent hashed to the same channel and processed by the same worker.
I think this depends on how KFServing manages the model version, definitely another big discussion.
Once the model puller restarts it should reconcile the model dir with the current configmap, yes I agree an audit trail file is needed to facilitate this reconciliation process. Each successful download should write a file with the storage uri there, if the file does not present or storage uri in the file is different from the one in the configmap, we should trigger a new download.
TFServing can recover from the model configuration during restarts, for other model servers I think model puller need someway to detect the model server status and resync all the models, on the other hand trained model controller also resync the models periodically which then triggers load/unload to the model server.
Correct, we have planned this work to send model status probes from trained model controller so status can be reflected.
|
Thanks @yuzisun
I'm not sure about the hashing approach. It would be functional but risk delaying work resulting in longer load times than necessary sometimes. There are use cases where the rate of model churn is quite high which are sensitive to total time it takes for a model to become available. For tfserving there is also a question of how best to trigger the config reloads.
Agree, please include me because I have some thoughts :)
Great, but curious from trainedmodel controller to who exactly? Imo this should come from the puller somehow rather directly from model server (whether that is push or pull) since there could be other kinds of failures related to downloading for example, and it will be keeping track of the model-server's state anyhow as mentioned in the model-server restart recovery point. To me this seems analogous to how Pod CRs are manipulated in the context of scheduling to Nodes. Various actors contribute to the Pod's Status and update it directly IIRC (specifically the Conditions). In particular it's the kubelet that monitors and updates the running status of the Pod, and here I would suggest Puller/Model is kind of equivalent to Kubelet/Pod. But that would probably mean coupling the puller to Kube which I also agree seems to be a nice thing to avoid if possible. |
@ellistarn, @yuzisun and are considering to remove version information from the ModelSpec's framework enum and from this model configuration file. The reason is that we can validate framework against different model servers, but validating if X model server supports Y framework's Z version is pretty difficult. But if we allow users to put ML framework version in the TrainedModel CR they will expect that KFServing can validate framework versions against model server versions properly. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great launching point. Thanks for putting in so much effort in refinements so far.
/lgtm
|
||
# Copy in the go src | ||
WORKDIR /go/src/github.com/kubeflow/kfserving | ||
COPY pkg/ pkg/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned a cool trick, that if you first do:
COPY go.mod
COPY go.sum
RUN go mod download
Docker will correctly layer your images so that you don't need to redownload on build unless your go.mod or go.sum change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ that is neat, will need to re-organize for that. good call, because the build took forevahh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate ? move before the pkg?
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuzisun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Follow-on changes to kserve#989 based on remaining review suggestions. - Simplified configmap change diffing - Connect watcher and puller with event channel - Have puller track in-progress ops per model via op completion channel and tie lifecycle of per-model channel+goroutine pairs to this
Follow-on changes to kserve#989 based on remaining review suggestions. - Simplified configmap change diffing - Connect watcher and puller with event channel - Have puller track in-progress ops per model via op completion channel and tie lifecycle of per-model channel+goroutine pairs to this
* Puller streamlining/simplification Follow-on changes to #989 based on remaining review suggestions. - Simplified configmap change diffing - Connect watcher and puller with event channel - Have puller track in-progress ops per model via op completion channel and tie lifecycle of per-model channel+goroutine pairs to this * Minor change: fully decouple puller from watcher * Address some of the review comments The complete ModelOp struct is now passed all the way back and forth.
What this PR does / why we need it:
This PR is needed when trying to define the multi-model story. The agent is responsible for pulling models into a model directory that the model server will use to host.
Fixes #1040
Design:
We assume the ConfigMap looks like this:
as such
model_name1.json
andmodel_name2.json
will each be files in the directory into whichmodels-config
will be mounted. This puller will watch (or pull) form this directory should any models be added or removed from the cmap, and send the appropriate request to the load / unload methods.Special notes for your reviewer:
Tested by running:
by modifying the data with the addition of
model1
, the addition ofmodel2
:Release note: