Training stops without an error #13

AidasK · 2018-07-14T10:54:00Z

I was running training for 2 days and it stoped without an error. I have reaced 31 epoch for celeba problem.

nohup python -u train.py --problem celeba --image_size 256 --n_level 6 --depth 32 --flow_permutation 2 --flow_coupling 0 --seed 0 --learntop --lr 0.001 --n_bits_x 5 --data_dir /mnt/celeba/mnt/host/celeba-reshard-tfr/ &

Updated os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1', maybe it will be more verbose this time.

cat logs/train.txt
{"n_batch_init": 256, "flow_coupling": 0, "weight_y": 0.0, "restore_path": "", "verbose": false, "n_batch_test": 50, "n_batch_train": 64, "anchor_size": 32, "epochs": 1000000, "epochs_warmup": 10, "n_bits_x": 5, "rnd_crop": false, "category": "", "depth": 32, "learntop": true, "n_bins": 32.0, "beta1": 0.9, "n_train": 50000, "n_test": 3000, "seed": 0, "logdir": "./logs", "lr": 0.001, "full_test_its": 3000, "ycond": false, "n_levels": 6, "fmap": 1, "dal": 1, "top_shape": [4, 4, 384], "local_batch_test": 1, "local_batch_train": 1, "optimizer": "adamax", "polyak_epochs": 1, "pmap": 16, "weight_decay": 1.0, "n_sample": 1, "test_its": 47, "image_size": 256, "data_dir": "/mnt/celeba/mnt/host/celeba-reshard-tfr/", "train_its": 782, "epochs_full_sample": 50, "n_y": 1, "width": 512, "problem": "celeba", "epochs_full_valid": 50, "local_batch_init": 4, "direct_iterator": true, "gradient_checkpointing": 1, "flow_permutation": 2}
{"pred_loss": "1.0000", "train_time": 4076, "bits_x": "2.0117", "n_processed": 50048, "loss": "2.0117", "epoch": 1, "bits_y": "0.0000", "n_images": 782}
{"pred_loss": "1.0000", "train_time": 7960, "bits_x": "1.4431", "n_processed": 100096, "loss": "1.4431", "epoch": 2, "bits_y": "0.0000", "n_images": 1564}
{"pred_loss": "1.0000", "train_time": 11833, "bits_x": "1.3894", "n_processed": 150144, "loss": "1.3894", "epoch": 3, "bits_y": "0.0000", "n_images": 2346}
{"pred_loss": "1.0000", "train_time": 15698, "bits_x": "1.3369", "n_processed": 200192, "loss": "1.3369", "epoch": 4, "bits_y": "0.0000", "n_images": 3128}
{"pred_loss": "1.0000", "train_time": 19555, "bits_x": "1.3023", "n_processed": 250240, "loss": "1.3023", "epoch": 5, "bits_y": "0.0000", "n_images": 3910}
{"pred_loss": "1.0000", "train_time": 23406, "bits_x": "1.2827", "n_processed": 300288, "loss": "1.2827", "epoch": 6, "bits_y": "0.0000", "n_images": 4692}
{"pred_loss": "1.0000", "train_time": 27373, "bits_x": "1.2652", "n_processed": 350336, "loss": "1.2652", "epoch": 7, "bits_y": "0.0000", "n_images": 5474}
{"pred_loss": "1.0000", "train_time": 31235, "bits_x": "1.2522", "n_processed": 400384, "loss": "1.2522", "epoch": 8, "bits_y": "0.0000", "n_images": 6256}
{"pred_loss": "1.0000", "train_time": 35093, "bits_x": "1.2383", "n_processed": 450432, "loss": "1.2383", "epoch": 9, "bits_y": "0.0000", "n_images": 7038}
{"pred_loss": "1.0000", "train_time": 38955, "bits_x": "1.2361", "n_processed": 500480, "loss": "1.2361", "epoch": 10, "bits_y": "0.0000", "n_images": 7820}
{"pred_loss": "1.0000", "train_time": 42828, "bits_x": "1.2206", "n_processed": 550528, "loss": "1.2206", "epoch": 11, "bits_y": "0.0000", "n_images": 8602}
{"pred_loss": "1.0000", "train_time": 46698, "bits_x": "1.2128", "n_processed": 600576, "loss": "1.2128", "epoch": 12, "bits_y": "0.0000", "n_images": 9384}
{"pred_loss": "1.0000", "train_time": 50564, "bits_x": "1.1951", "n_processed": 650624, "loss": "1.1951", "epoch": 13, "bits_y": "0.0000", "n_images": 10166}
{"pred_loss": "1.0000", "train_time": 54421, "bits_x": "1.1983", "n_processed": 700672, "loss": "1.1983", "epoch": 14, "bits_y": "0.0000", "n_images": 10948}
{"pred_loss": "1.0000", "train_time": 58295, "bits_x": "1.1887", "n_processed": 750720, "loss": "1.1887", "epoch": 15, "bits_y": "0.0000", "n_images": 11730}
{"pred_loss": "1.0000", "train_time": 62163, "bits_x": "1.1754", "n_processed": 800768, "loss": "1.1754", "epoch": 16, "bits_y": "0.0000", "n_images": 12512}
{"pred_loss": "1.0000", "train_time": 66025, "bits_x": "1.1826", "n_processed": 850816, "loss": "1.1826", "epoch": 17, "bits_y": "0.0000", "n_images": 13294}
{"pred_loss": "1.0000", "train_time": 69890, "bits_x": "1.1680", "n_processed": 900864, "loss": "1.1680", "epoch": 18, "bits_y": "0.0000", "n_images": 14076}
{"pred_loss": "1.0000", "train_time": 73756, "bits_x": "1.1749", "n_processed": 950912, "loss": "1.1749", "epoch": 19, "bits_y": "0.0000", "n_images": 14858}
{"pred_loss": "1.0000", "train_time": 77620, "bits_x": "1.1742", "n_processed": 1000960, "loss": "1.1742", "epoch": 20, "bits_y": "0.0000", "n_images": 15640}
{"pred_loss": "1.0000", "train_time": 81488, "bits_x": "1.1676", "n_processed": 1051008, "loss": "1.1676", "epoch": 21, "bits_y": "0.0000", "n_images": 16422}
{"pred_loss": "1.0000", "train_time": 85357, "bits_x": "1.1604", "n_processed": 1101056, "loss": "1.1604", "epoch": 22, "bits_y": "0.0000", "n_images": 17204}
{"pred_loss": "1.0000", "train_time": 89222, "bits_x": "1.1595", "n_processed": 1151104, "loss": "1.1595", "epoch": 23, "bits_y": "0.0000", "n_images": 17986}
{"pred_loss": "1.0000", "train_time": 93085, "bits_x": "1.1667", "n_processed": 1201152, "loss": "1.1667", "epoch": 24, "bits_y": "0.0000", "n_images": 18768}
{"pred_loss": "1.0000", "train_time": 96944, "bits_x": "1.1598", "n_processed": 1251200, "loss": "1.1598", "epoch": 25, "bits_y": "0.0000", "n_images": 19550}
{"pred_loss": "1.0000", "train_time": 100799, "bits_x": "1.1596", "n_processed": 1301248, "loss": "1.1596", "epoch": 26, "bits_y": "0.0000", "n_images": 20332}
{"pred_loss": "1.0000", "train_time": 104652, "bits_x": "1.1489", "n_processed": 1351296, "loss": "1.1489", "epoch": 27, "bits_y": "0.0000", "n_images": 21114}
{"pred_loss": "1.0000", "train_time": 108512, "bits_x": "1.1517", "n_processed": 1401344, "loss": "1.1517", "epoch": 28, "bits_y": "0.0000", "n_images": 21896}
{"pred_loss": "1.0000", "train_time": 112365, "bits_x": "1.1525", "n_processed": 1451392, "loss": "1.1525", "epoch": 29, "bits_y": "0.0000", "n_images": 22678}
{"pred_loss": "1.0000", "train_time": 116231, "bits_x": "1.1481", "n_processed": 1501440, "loss": "1.1481", "epoch": 30, "bits_y": "0.0000", "n_images": 23460}
{"pred_loss": "1.0000", "train_time": 120128, "bits_x": "1.1306", "n_processed": 1551488, "loss": "1.1306", "epoch": 31, "bits_y": "0.0000", "n_images": 24242}

The text was updated successfully, but these errors were encountered:

prafullasd · 2018-08-02T18:41:08Z

Maybe you ran out of CPU memory? Check using htop?

prafullasd closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stops without an error #13

Training stops without an error #13

AidasK commented Jul 14, 2018 •

edited

Loading

prafullasd commented Aug 2, 2018

Training stops without an error #13

Training stops without an error #13

Comments

AidasK commented Jul 14, 2018 • edited Loading

prafullasd commented Aug 2, 2018

AidasK commented Jul 14, 2018 •

edited

Loading