-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manifest documentation #103
Conversation
docs/manifest.md
Outdated
subset is added or modified. | ||
|
||
Each subset contains two properties: | ||
- A location which points to the location where the underlying data is stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning that the location is relative to the base path
docs/manifest.md
Outdated
|
||
### Subsets | ||
|
||
Each subset is a different view on the data contained in the dataset. Different subsets could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't say it's a different view, it's just a part of the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded into "different features".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
} | ||
``` | ||
|
||
All subsets should contain at least the data points defined by the index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question regarding this point, say that we have an image subset and we gathered different subsets linked to it (captions, aesthetic score, ...) where each subset was probably obtained in a separate component. Then we extend the images subset (e.g. laion retrieval) and the index along with it.
seed_images (image_subset)-> caption_component (add caption subset) -> aesthetic_filter (add aesthetic score subset) -> image retrieval (add index and extend image_subset)
How do we handle the missing entries in the other subsets (captions, aesthetic score, ...) if we only plan on including them using subsequent components?
... -> image retrieval (add index and extend image_subset) -> caption_component -> filter_component
I think this is not possible with the current framework? This question might have already been addressed or I could have missed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is not supported. There's two ways around this:
- Reorder the components in your pipeline (eg. in your example, move the image retrieval to the 2nd place)
- Work with non-linear DAGs when this is supported in the future (eg. in your example, the 2nd part of the pipeline could branch off and later merge again)
Fixes #62 Please check if there's anything I forgot to explain :)
Fixes #62
Please check if there's anything I forgot to explain :)