Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question for Finetuning #36

Closed
xyIsHere opened this issue Jun 30, 2023 · 17 comments
Closed

Question for Finetuning #36

xyIsHere opened this issue Jun 30, 2023 · 17 comments

Comments

@xyIsHere
Copy link

Dear author,

I tried to reproduce your work and I'm currently want to validate the result generated by the fine-tuned sd model. I used only one A100 gpu card for finetuning (currently at around 130 epoch) and test with the script "sr_val_ddpm_text_T_vqganfin_old.py" by reset the ckpt path and change the dec_w to 0.0. The rest result shows almost no difference with the input. But when I saw the training log, the validation results look pretty good. Do you have any idea for this issue?

Thanks a lot!

@IceClear
Copy link
Owner

Hi. Without training settings as well as the figures for comparison, it is hard to tell the problem. Maybe the test data distribution is very different from the validation one.

@xyIsHere
Copy link
Author

Thanks for your very quick response. I got similar results as this issue #26.

The following zip file contains the training config. I just follow your suggestion to reset the model path.
2023-06-25T16-19-48-project.zip

And here is the results (right: result; left: input). The bottom one is a sample from validation set and it is also has been used for training. The above one did has some difference with the input, but the result is very far from the model you provided.
image
image

Thanks!

@IceClear
Copy link
Owner

Hi.
First, the input is different. My result is generated from a resized 128x128 image while it seems that you directly used the original 720x720 image.
Second, my results are generated with cfw weight = 0.5.
Third, my model is trained on 8 V100 GPUs, of which the training batch size, i.e., 48x4 should be much larger than yours, I guess.
You can train longer for better results, from my experience.

@xyIsHere
Copy link
Author

xyIsHere commented Jul 3, 2023

Thanks! By the way, how long did you spend for the fine-tuning stage, maybe just around 24 hours?

@IceClear
Copy link
Owner

IceClear commented Jul 3, 2023

For several days. The longer the better. One week should be enough.

@xyIsHere
Copy link
Author

xyIsHere commented Jul 3, 2023

Hi. First, the input is different. My result is generated from a resized 128x128 image while it seems that you directly used the original 720x720 image. Second, my results are generated with cfw weight = 0.5. Third, my model is trained on 8 V100 GPUs, of which the training batch size, i.e., 48x4 should be much larger than yours, I guess. You can train longer for better results, from my experience.

Thanks a lot for your help. Actually I did not train the cfw yet. Currently, I just want to make sure my fine-tuning results is reasonable. I think both of the resolution and the cfw weight should not be the reason.
image
As shown in the above figure, I use the same image as input and test with different fine-tuning model (stablesr_000117.ckpt that you provided and the model trained with 4 A100 cards with batch size of 12 and accumulated_grad_batches of 4). I finetune the model for about 24hrs and get the epoch_000131.ckpt).

The command that I used is "python scripts/sr_val_ddpm_text_T_vqganfin_old.py --config configs/stableSRNew/v2-finetune-test.yaml --ckpt "./pretrained_models/stablesr_000117.ckpt" --vqgan_ckpt "./pretrained_models/vqgan_cfw_00011.ckpt" --init-img ./inputs/test_example --outdir out_landscape/ --ddpm_steps 200 --dec_w 0.0 --suffix 'stablesr117'".

I set the dec_w to 0.0, so the result is achieved by only consider the fine-tuning without CFW. Did you have other suggestion for me to dubug? Or do you think I just need to train more days?

@xyIsHere
Copy link
Author

xyIsHere commented Jul 3, 2023

For several days. The longer the better. One week should be enough.

So the provided model (stable_000117.ckpt) is achieved by fine-tuned for around a week? I thought 117 epoch model do not need to spend for so much time to get.

@IceClear
Copy link
Owner

IceClear commented Jul 3, 2023

The speed of A100 is more than 2x than V100, and I do not remember the exact training time of the 512 model.
It is hard to say whether there is a problem.
You may check the performance of different epochs on the real image.
The performance may vary for different epochs.
From my experience, training longer do improves the performance.

@xyIsHere
Copy link
Author

xyIsHere commented Jul 3, 2023

Thank you so much. I will keep running the experiments and let you know if there is any update.

@xyIsHere
Copy link
Author

xyIsHere commented Jul 4, 2023

Dear author,
Could you also show me how to use the thop package to print the params and flops of the stablesr. I tried to this use the test script (vqganfin_old.py) but not get succeed yet. Thanks!

@xyIsHere
Copy link
Author

xyIsHere commented Jul 4, 2023

Dear author, Could you also show me how to use the thop package to print the params and flops of the stablesr. I tried to this use the test script (vqganfin_old.py) but not get succeed yet. Thanks!

I finally solved this problem.
image

@xyIsHere
Copy link
Author

xyIsHere commented Jul 5, 2023

Dear author,
Here the training log is attached. I'm wondering if there is any difference with yours?
train_reproduce_4card_bs12.log

@BobbyZ04
Copy link

BobbyZ04 commented Jul 6, 2023

Hi may I ask how did you generate the latent for the second stage training? it's supposed to be 4D? Cause I got the error saying dimension incorrect like this:
image

The latent shape I checked is:
image

This is where I generated them:
image

Thank you

@BobbyZ04
Copy link

BobbyZ04 commented Jul 6, 2023

Thanks a lot!

@xyIsHere
Copy link
Author

xyIsHere commented Jul 6, 2023

Hi may I ask how did you generate the latent for the second stage training? it's supposed to be 4D? Cause I got the error saying dimension incorrect like this: image

The latent shape I checked is: image

This is where I generated them: image

Thank you

I currently only conducted the fine-tuning experiments and haven't trained the CFW since I found that my fine-tuning result is not good enough to train the CFW. How about your fine-tuning results? For training the CFW, I saw there is a issue #28 that might can help you.

@BobbyZ04
Copy link

BobbyZ04 commented Jul 6, 2023

Hi may I ask how did you generate the latent for the second stage training? it's supposed to be 4D? Cause I got the error saying dimension incorrect like this: image
The latent shape I checked is: image
This is where I generated them: image
Thank you

I currently only conducted the fine-tuning experiments and haven't trained the CFW since I found that my fine-tuning result is not good enough to train the CFW. How about your fine-tuning results? For training the CFW, I saw there is a issue #28 that might can help you.

I think they are making sense but yea different than the author's results hmm, maybe can try fix the seeds for inference to check if the finetuning is successful? https://huggingface.co/docs/diffusers/using-diffusers/reproducibility

@xyIsHere
Copy link
Author

xyIsHere commented Jul 7, 2023

Hi may I ask how did you generate the latent for the second stage training? it's supposed to be 4D? Cause I got the error saying dimension incorrect like this: image
The latent shape I checked is: image
This is where I generated them: image
Thank you

I currently only conducted the fine-tuning experiments and haven't trained the CFW since I found that my fine-tuning result is not good enough to train the CFW. How about your fine-tuning results? For training the CFW, I saw there is a issue #28 that might can help you.

I think they are making sense but yea different than the author's results hmm, maybe can try fix the seeds for inference to check if the finetuning is successful? https://huggingface.co/docs/diffusers/using-diffusers/reproducibility

I'm wondering if it is possible to share one example that you generated using only the fine-tuned model? Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants