Recursive support for captioning/tagging scripts #400
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi! Thanks for the good work as always.
In this PR, I want to propose some small changes, mostly about recursive support for captioning/tagging scripts so users can annotate their dataset recursively. However, I am not sure if it is implemented correctly, so I hope for your review to make it better.
For recursive args, I borrowed
glob_images_pathlib()
from train_util.--recursive
to find and preprocess datasets inside sub-directories.--remove_underscore
args.--undesired_tags
, so users can delete undesired tags from the tagging process.4
) as well as--character_threshold
. I don't know if it's a good idea, but I think it might be helpful for character training. SmilingWolf released SmilingWolf/wd-v1-4-convnextv2-tagger-v2, and I think it's a great and up-to-date model for character tagging.--thresh
to--general_threshold
.--frequency_tags
to print tag frequency after the tagging process is done.--debug
works with the following new template:--recursive
to find and preprocess datasets inside sub-directories.--recursive
to find and preprocess datasets inside sub-directories.--recursive
to find and preprocess datasets inside sub-directories. I thought it was already covered by--full_path
, but it cannot preprocess datasets inside sub-directories and make latents of them. It might not be useful for multi-concept training, but I think it's useful for multi-directories training. So users can keep the datasets inside the respective folder and preprocess them without needing to re-run the scripts every time.QoL
rather than a new change, but I added--save_precision_as
with choices of["fp16", "bf16", "float"]
.Thank you!