Skip to content

Updates to ModelKit format#635

Merged
amisevsk merged 8 commits intokitops-ml:mainfrom
amisevsk:modelkit-updates
Dec 6, 2024
Merged

Updates to ModelKit format#635
amisevsk merged 8 commits intokitops-ml:mainfrom
amisevsk:modelkit-updates

Conversation

@amisevsk
Copy link
Copy Markdown
Contributor

@amisevsk amisevsk commented Dec 3, 2024

Description

This PR implements the updates discussed in #489:

  • We've added fields to the ModelKit manifest's config to store layer digests and diff IDs, so that we can match Kitfile fields to layers more easily (rather than relying on order)
  • We're now packing tar layers slightly differently -- instead of packing the contents of a layer's path, we pack full directory structure from the pack context. In short, if you extract the tarballs for all layers in a manifest, you should get back the original file structure

This change is backwards compatible with ModelKits packed prior to this change -- they can still be unpacked, etc. This change does, however, break reproducible digests for older ModelKits (packing the same data with Kit prior to this change will result in a ModelKit with a different digest due to the changes to tar files and the config). Within Kit versions, digests are still reproducible.

Linked issues

Closes #489

Add fields 'digest' and 'diffId' to Kitfile fields that represent
ModelKit layers, which are only serialized in JSON format. When the
Kitfile is used as the ModelKit OCI artifact config, it should contain
digests and diff IDs, but when it is deserialized to YAML it should be
the Kitfile as normal.
Change how files are packed into tarballs for ModelKit layers: instead
of packing files relative to the layer's path within a Kitfile, we
instead pack them relative to the context directory for the kit command
(i.e. kit pack <path>).

This means that a ModelKit's layers can be extracted directly to
reproduce the files/directories from the original repository, without
relying on the CLI to place files at the correct path.

Additionally, this commit uses UID 1000 for the file owner inside the
tarball instead of 0 (root). This makes it easier to consume ModelKit
layers as layers in an OCI image.
We need to handle existing ModelKits in addition to the new tar format
-- for new ModelKits, we can unpack the tars directly (and simply), but
for older ones, we need to pre-create the layer's directory and
concatenate paths from the tarball.

To avoid overcomplicating things (since we're still relatively early),
we detect new ModelKits via the digest/diffId fields in the config; this
is a lot simpler to support than using new versions on our media types.
@amisevsk amisevsk requested a review from gorkem December 3, 2024 19:32
Comment thread pkg/lib/kitfile/layer.go
// Clean path to ensure consistent format (./path vs path/ vs path)
path = filepath.Clean(path)

if layerIgnored, err := ignore.Matches(path, path); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call to ignore.Matches is ood to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the ignore matcher is used for both the .kitignore file and for intersecting layers (where e.g. a dataset layer packs a subdirectory of a code layer). The second argument is used to determine whether the path is ignored due to it being packed by another layer in the ModelKit -- using path, path here functionally disables the intersecting dirs ignore check. The function signature is

func Matches(path, layerPath string) (bool, error)

Alternatively, we could split the ignore into two checks (i.e. ignored due to kitignore vs ignored due to being part of another layer, I just wanted to keep the check simple.

Comment thread pkg/lib/kitfile/layer.go
Comment on lines +132 to 142
aToB, errA := filepath.Rel(a, b)
bToA, errB := filepath.Rel(b, a)
if errA != nil || errB != nil {
plog.Logf(output.LogLevelWarn, "Cannot compare directories %s and %s, skipping path", a, b)
return false
}
if strings.Contains(aToB, "..") && strings.Contains(bToA, "..") {
return false
}
return true
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would HasPrefix be more approprate instead of Contains?

Copy link
Copy Markdown
Contributor Author

@amisevsk amisevsk Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be, I opted for Contains so as to not worry about ./.., C:path, etc.

I'd have to test how relative paths are handled on Windows to be sure. We might also have weird bugs around switching drives on Windows, I'm not sure (not familiar with how it all works on that side of things)

Comment thread pkg/cmd/unpack/unpack.go Outdated
_, relPath, err = filesystem.VerifySubpath(opts.unpackDir, config.Model.Path)
if err != nil {
return fmt.Errorf("error resolving model path: %w", err)
if config.Model.LayerInfo != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check if there is a config.Model ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only enter this branch of the switch if we encounter a model layer, which implies the existence of a model field in the Kitfile (i.e. how did we pack a model layer without a model section in the Kitfile?).

Comment thread pkg/cmd/unpack/unpack.go Outdated
_, relPath, err = filesystem.VerifySubpath(opts.unpackDir, docsEntry.Path)
if err != nil {
return fmt.Errorf("error resolving path %s for docs: %w", docsEntry.Path, err)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code for handling every layer. Should it be refactored ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just kind of the way things end up happening in Golang. I can try to rework it but in my experience it ends up decreasing readability :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit should make this a little clearer.

Updates to handle empty directories, since we're hitting a few edge
cases:

* If there are no files to be packed, don't initialize a progress bar,
  since it will never be filled. This causes pack to hang
* Print a warning if the computed size of a layer is 0 (ignoring
  directories) to warn the user of a potential mistake
* Check that target directory exists; since we're walking from the pack
  context, we no longer automatically fail when a layer's path does not
  exist.
@amisevsk amisevsk merged commit e0e5f35 into kitops-ml:main Dec 6, 2024
@amisevsk amisevsk deleted the modelkit-updates branch December 6, 2024 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider QoL improvements to tar artifacts generated by kit

2 participants