Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Prodigy(Dadapt variety for Dylora) #585

Merged
merged 16 commits into from
Jun 15, 2023
Merged

Conversation

sdbds
Copy link
Contributor

@sdbds sdbds commented Jun 12, 2023

@kohya-ss
Copy link
Owner

Thanks, looks very good! I will check it out when I have time!

@kohya-ss kohya-ss changed the base branch from main to dev June 15, 2023 12:12
@kohya-ss kohya-ss merged commit e97d67a into kohya-ss:dev Jun 15, 2023
fine_tune.py Show resolved Hide resolved
@jimtalksdata
Copy link

jimtalksdata commented Jul 11, 2023

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.

Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

@sdbds
Copy link
Contributor Author

sdbds commented Jul 11, 2023

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.

Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

I would suggest modifying the value of d0 to accommodate SDXL(5e-7) as well as dylora(5e-4), which are models that require a larger initial learning rate.
you can see more experience in here
https://civitai.com/articles/1022/sdxl-trainingbdsqlsz-lora-training-advanced-tutorial2best-optimizerprodigy-is-all-you-need

@FurkanGozukara
Copy link

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.

Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

how do you use the generated safetensors file? can you use with diffusers pipeline?

@jimtalksdata
Copy link

jimtalksdata commented Jul 11, 2023

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.
Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

I would suggest modifying the value of d0 to accommodate SDXL(5e-7) as well as dylora(5e-4), which are models that require a larger initial learning rate. you can see more experience in here https://civitai.com/articles/1022/sdxl-trainingbdsqlsz-lora-training-advanced-tutorial2best-optimizerprodigy-is-all-you-need

Thanks for the tutorial, good stuff. Could use a straightforward way to set d0 (the initial LR) if I know that the algorithm will just "waste time" at 1e-6 in the beginning.

Edit nvm I see, just add d0=(number) and d_coef=(number) to the args.

@jimtalksdata
Copy link

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.
Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

how do you use the generated safetensors file? can you use with diffusers pipeline?

The LORA pipeline works with ComfyUI at the moment. Don't know about other impl.

@sdbds
Copy link
Contributor Author

sdbds commented Jul 11, 2023

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.
Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

I would suggest modifying the value of d0 to accommodate SDXL(5e-7) as well as dylora(5e-4), which are models that require a larger initial learning rate. you can see more experience in here https://civitai.com/articles/1022/sdxl-trainingbdsqlsz-lora-training-advanced-tutorial2best-optimizerprodigy-is-all-you-need

Thanks for the tutorial, good stuff. Could use a straightforward way to set d0 (the initial LR) if I know that the algorithm will just "waste time" at 1e-6 in the beginning.

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.

Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

official only 2,no resetting code
i written 3 but not test it now

# Algorithm 3 D-Adaptation with Resetting[^3^][3]
# Input: d0 > 0, x0, G[^4^][4][^5^][5]
# Initialize variables
d = d0 # initial distance estimate[^6^][6]
x = x0 # initial point
s = 0 # gradient sum
r = 0 # reset counter
k = 0 # iteration counter within a reset
x0_r = x0 # initial point after a reset
s_r = 0 # gradient sum after a reset

# Loop until the maximum number of iterations is reached
for j in range(n):
    # Compute the gradient at the current point
    g_r_k = f.gradient(x)
    # Update the gradient sum
    s_r_k_plus_1 = s_r + g_r_k
    # Compute the step size using D-Adaptation formula
    gamma_r_k_plus_1 = np.sqrt(d / (G**2 + np.sum(s_r_k_plus_1**2)))
    # Update the point using gradient descent
    x_j_plus_1 = x_r_k_plus_1 = x0_r - gamma_r_k_plus_1 * s_r_k_plus_1
    # Compute the new distance estimate using D-Adaptation formula
    d_hat_r_k_plus_1 = (gamma_r_k_plus_1 * np.linalg.norm(s_r_k_plus_1)**2 - np.sum(gamma * g_r_k**2)) / (2 * np.linalg.norm(s_r_k_plus_1))
    # Increment the iteration counter
    k += 1[^7^][7]
    # Check if the distance estimate has increased by a factor of 2 or more
    if d_hat_r_k_plus_1 > 2 * d:
        # Update the distance estimate to the new value
        d = d_hat_r_k_plus_1
        # Reset the initial point, gradient sum and iteration counter to the current values
        x0_r = x_r_k_plus_1
        s_r = 0[^8^][8]
        k = 0[^1^][1][^2^][2]
        # Increment the reset counter
        r += 1

# Return the average of all points visited
x_bar_n = np.mean(x, axis=0)

@FurkanGozukara
Copy link

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.
Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

how do you use the generated safetensors file? can you use with diffusers pipeline?

The LORA pipeline works with ComfyUI at the moment. Don't know about other impl.

can you share example json file please

@jimtalksdata
Copy link

jimtalksdata commented Jul 11, 2023

Something like this? This uses the "new" refiner workflow.

I assume you can also train a LORA for the refiner, but unsure of the purpose or how to use it, or what sort of training set would you use. Thus, this is also likely not the correct final workflow.

https://pastebin.com/j6LnygzJ

@FurkanGozukara
Copy link

Works well for LORAs on SDXL. Convergence to the optimal LR can be a bit slow (1000 steps) compared to DAdapt, or maybe it's just SDXL being big. Does not blow up compared to DAdapt though. Needs more testing.
Question: which version was implemented, Prodigy (2 in the paper) or Resetting (3)?

I would suggest modifying the value of d0 to accommodate SDXL(5e-7) as well as dylora(5e-4), which are models that require a larger initial learning rate. you can see more experience in here https://civitai.com/articles/1022/sdxl-trainingbdsqlsz-lora-training-advanced-tutorial2best-optimizerprodigy-is-all-you-need

hlelo so many parameters are missing here

can you share full command like below?

I tried like this it didnt work

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:\sdxl_lora\img" --reg_data_dir="F:\sdxl_lora\reg" --resolution="1024,1024" --output_dir="F:\sdxl_lora\model" --logging_dir="F:\sdxl_lora\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="test10" --lr_scheduler_num_cycles="8" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5200" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Prodigy" --optimizer_args scale_parameter=False relative_step=False warmup_init=False scale_v_pred_loss_like_noise_pred=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants