Add ability to automatically generate Kitfiles#661
Conversation
df6f11c to
f2011c2
Compare
| if err != nil { | ||
| return "", fmt.Errorf("failed to read input: %w", err) | ||
| } | ||
| return strings.TrimSpace(string(bytes)), nil |
There was a problem hiding this comment.
Should this be trimming? What if the prompt is intended to have spaces?
There was a problem hiding this comment.
strings.TrimSpace only trims leading and trailing whitespace.
There was a problem hiding this comment.
right... but I was thinking is there a case where we want to keep the leading/trailing spaces
There was a problem hiding this comment.
I can't imagine one, and not trimming likely leads to weirder behaviour than otherwise. This code was originally used for handling login (username and password) where we definitely want to trim white-space, since the read input includes the newline delimiter.
| opts.configHome = configHome | ||
| opts.path = args[0] | ||
|
|
||
| if opts.modelkitName == "" { |
There was a problem hiding this comment.
Should we still prompt for if name or description has been passed as options?
There was a problem hiding this comment.
We'll only prompt if we didn't get a name/description (and if the session is interactive). If you specify --name or --desc we don't prompt (unless you somehow pass in an empty name/description).
| // Generate a basic Kitfile by looking at the contents of a directory. Parameter | ||
| // packageOpt can be used to define metadata for the Kitfile (i.e. the package | ||
| // section), which is left empty if the parameter is nil. | ||
| func GenerateKitfile(baseDir string, packageOpt *artifact.Package) (*artifact.KitFile, error) { |
There was a problem hiding this comment.
I think this needs to look into the sub directories as well. If somebody organizes their files under data/ for instance data/training.csv it should be detected.
There was a problem hiding this comment.
Yeah, I was hoping to defer that to a future improvement since it introduces a fair bit of complexity in the general case.
- If we're trying to grab specific files (e.g. a dataset layer for
path: data/training.csv, then we need to decide what to do with the rest of the directory- If that's the only file, we don't need to include the directory itself, since it'll be included in the dataset layer, which means signalling another condition back to the caller
- If there are other files, we need to either classify them or include the whole directory as code (?) layer
- Then we also need to handle the case where we've got a catch-all
codesection
- If we want to include the whole directory (i.e.
path: data/) then we need to figure out a heuristic for deciding what type the whole directory is (most common file type? I'm not sure here) - If a
data/dir has 20 .csv files, do we include a singledata/layer, or generate a Kitfile with 20 dataset layers (one for each file).
There was a problem hiding this comment.
Alright, I've added a (messy) implementation of this in commit adf3372. For e.g. https://huggingface.co/onnx-community/dpt-dinov2-small-kitti/tree/main it generates the following Kitfile:
manifestVersion: 1.0.0
model:
name: onnx/model
path: onnx/model.onnx
parts:
- path: onnx/model_bnb4.onnx
- path: onnx/model_fp16.onnx
- path: onnx/model_int8.onnx
- path: onnx/model_q4.onnx
- path: onnx/model_q4f16.onnx
- path: onnx/model_quantized.onnx
- path: onnx/model_uint8.onnx
- path: config.json
- path: preprocessor_config.json
- path: quantize_config.json
code:
- path: .
docs:
- path: README.md
description: Readme filewhere the code section includes the .gitattributes file and the .git directory (since I cloned it locally)
|
It may be worth linking to this list here. https://github.com/trailofbits/ml-file-formats?tab=readme-ov-file |
I glanced over that list at one point, but I'm not sure how valuable it is in general. I can add it but it doesn't feel like an authoritative source for this sort of thing. |
f2011c2 to
51c8638
Compare
Try to attach the detected license type to a Kitfile layer rather than the overall package section -- i.e. if we have a model, assume the license applies to the model, otherwise, stick it with a dataset or code section. Fallback to package if nothing else makes sense.
…se.txt When deciding if a file is a license file, include all files with `license` as the prefix, to ensure we catch e.g. LICENSE.txt.
Add .joblib as model-type suffix, and move .csv to datasets instead of metadata
Read the contents of any directories and attempt to determine their type for the purposes of Kitfile generation; if a directory contains multiple files but e.g. they are all 'dataset'-type files, add the directory as a dataset layer. This is not a very pretty implementation, but hopefully covers a decent number of common cases.
bacde8a to
adf3372
Compare
gorkem
left a comment
There was a problem hiding this comment.
When I try this on
/scikitlearn-tabular
├── Kitfile
├── data
│ └── online_boutique_v1.csv
── models
│ └── model.onnx
└── notebooks
└── ML_MLFlow.ipynb
I am getting the following. I was expecting the model.onnx to be added at least.
manifestVersion: 1.0.0
package:
name: test
description: this
code:
- path: .
| output.Logf(output.LogLevelWarn, "Unable to determine license type") | ||
| } | ||
| output.Logf(output.LogLevelTrace, "Detected license %s for license file", detectedLicenseType) | ||
| detectedLicenseType = licenseType |
There was a problem hiding this comment.
This line should be before the logging.
There was a problem hiding this comment.
Woops, good catch! Fixed.
|
@gorkem Hm, are you sure you've fetched the latest changes? I rebased on main as part of it so you would need to I tested with (files are just basic text files containing "test") and I see the Kitfile manifestVersion: 1.0.0
model:
name: model
path: models/model.onnx
code:
- path: notebooks
datasets:
- path: data |
7fbd3d6 to
3dee98c
Compare
|
Figured it out -- I was mixing up dirEntry.Name() and the full path to the directory. If you ran the init command from the directory you want to generate a Kitfile in it would work because it's looking for directories in the current working dir. This fails if kit is run from a different directory (I had tried to test this using |
When reading the contents of directories while generating Kitfiles, make sure we use the full path to open the directory.
3dee98c to
8879bfd
Compare
Description
Add
kit initcommand, that can be used to generate a basic Kitfile from the contents of a directory. It basically does this by looking at files and trying to guess what they might be, based on file extensions.I've tested it against a few hand-written Kitfiles and (modulo user-provided metadata) it does an okay job :)
Linked issues
Closes #124
Related: #564