Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some details about the training parameters. #2

Closed
lionel3 opened this issue Aug 7, 2018 · 5 comments
Closed

Some details about the training parameters. #2

lionel3 opened this issue Aug 7, 2018 · 5 comments
Assignees

Comments

@lionel3
Copy link

lionel3 commented Aug 7, 2018

I trained the network from scratch but got poor results.
Below are my training parameters(same as the training guide you provide).

--program_name=twingan
--dataset_name="image_only"
--dataset_dir="dir to celeba tfrecord files"
--unpaired_target_dataset_name="anime_faces"
--unpaired_target_dataset_dir="dir to the anime tfrecord you provided"
--train_dir="dir to save results"
--dataset_split_name=train
--preprocessing_name="danbooru"
--resize_mode=RANDOM_CROP
--do_random_cropping=True
--learning_rate=0.0001
--learning_rate_decay_type=fixed
--is_training=True
--generator_network="pggan"
--loss_architecture=dragan
--pggan_max_num_channels=256
--generator_norm_type=batch_renorm
--use_ttur=True
--num_images_per_resolution=50000

Compared with official PGGAN repo, I found some differences.

  1. num_image_per_resolution of TwinGAN is 50000 while PGGAN is 600000.
  2. TwinGAN uses RANDOM_CROP while PGGAN uses RESIZE directly.

Could you please help me out here?

@jerryli27
Copy link
Owner

jerryli27 commented Aug 7, 2018

Yes you're right. Sorry for the wrong documentation. I'll push a newer version shortly.

  1. The num_image_per_resolution I used was '300000'. Of course 600000 should also work, but it takes longer to train.
  2. Please change to --resize_mode=RESHAPE.

FYI. The --do_random_cropping=True is in case You can try RANDOM_CROP as well if at inference time the quality is too bad because the face is not at the center of the image.

I am rerunning the exact code that I provided in the training example code. It will take a day or two for me to verify that it works.

@lionel3
Copy link
Author

lionel3 commented Aug 8, 2018

Thanks for your answer.

Besides, when training with
'hw_to_batch_size', '{4: 16, 8: 16, 16: 16, 32: 16, 64: 12, 128: 12, 256: 12, 512: 6}.
I got ResourceExhaustedError: OOM when allocating tensor with ... during fade-in phase from resolution 128 to 256. Same error when trying 2 GPUs.
I am not familiar with Tensorflow. I guess there may be some bug with Multi-GPU training.

I will try to reproduce the error and show more training details once I have idle GPU.

@jerryli27
Copy link
Owner

I added two lines to the training script. It should work now.
--gradient_penalty_lambda=0.25 --use_unet=True

The whole script now looks like:

python pggan_runner.py
--program_name=twingan
--dataset_name="image_only"
# Assume you have data like 
# ./data/celeba/train-00000-of-00100.tfrecord,  
# ./data/celeba/train-00001-of-00100.tfrecord ...
--dataset_dir="./data/celeba/"
--unpaired_target_dataset_name="anime_faces"
--unpaired_target_dataset_dir="./data/anime_faces/"
--train_dir="./checkpoints/twingan_faces/"
--dataset_split_name=train
--preprocessing_name="danbooru"
--resize_mode=RESHAPE
--do_random_cropping=True
--learning_rate=0.0001
--learning_rate_decay_type=fixed
--is_training=True
--generator_network="pggan"
--use_unet=True
--num_images_per_resolution=300000
--loss_architecture=dragan
--gradient_penalty_lambda=0.25
--pggan_max_num_channels=256
--generator_norm_type=batch_renorm
--hw_to_batch_size="{4: 8, 8: 8, 16: 8, 32: 8, 64: 8, 128: 4, 256: 3, 512: 2}"

I haven't tested with the multi-gpu setting thoroughly yet due to limits in hardware, so yes there may be some bug, but you can try to add the following flags.

--sync_replicas=False
--replicas_to_aggregate=1
--num_clones=2
--worker_replicas=1

I updated the training readme with the comments above.

@lionel3
Copy link
Author

lionel3 commented Aug 9, 2018

Thanks, I will try it out asap.

@lionel3 lionel3 closed this as completed Aug 9, 2018
@lionel3 lionel3 reopened this Aug 9, 2018
@jerryli27 jerryli27 self-assigned this Aug 10, 2018
@jerryli27
Copy link
Owner

Hi @lionel3 I updated the training documentation. There was indeed a bug in my default parameters. After fixing that I am able to reproduce my previous results.

Please sync to the latest version and see https://github.com/jerryli27/TwinGAN/blob/master/docs/training.md .

The parameters I added are:

--do_pixel_norm=True
--l_content_weight=0.1
--l_cycle_weight=1.0

Please reopen this issue if you cannot reproduce. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants