How much memory to run the cifar10 example #112

qtz93 · 2019-11-12T09:11:12Z

The following is the print information I executed the command "./scripts/cifar10_macro_search.sh" (progress was killed):

jovyan@219394b7f111$ ./scripts/cifar10_macro_search.sh 
--------------------------------------------------------------------------------
Path outputs exists. Remove and remake.
--------------------------------------------------------------------------------
Logging to outputs/stdout
--------------------------------------------------------------------------------
batch_size...................................................................128
child_block_size...............................................................3
child_cutout_size...........................................................None
child_drop_path_keep_prob....................................................0.6
child_filter_size..............................................................5
child_fixed_arc.............................................................None
child_grad_bound.............................................................5.0
child_keep_prob..............................................................0.9
child_l2_reg.............................................................0.00025
child_lr.....................................................................0.1
child_lr_T_0..................................................................10
child_lr_T_mul.................................................................2
child_lr_cosine.............................................................True
child_lr_dec_every...........................................................100
child_lr_dec_rate............................................................0.1
child_lr_max................................................................0.05
child_lr_min..............................................................0.0005
child_num_aggregate.........................................................None
child_num_branches.............................................................6
child_num_cells................................................................5
child_num_layers..............................................................12
child_num_replicas.............................................................1
child_out_filters.............................................................36
child_out_filters_scale........................................................1
child_skip_pattern..........................................................None
child_sync_replicas........................................................False
child_use_aux_heads.........................................................True
controller_bl_dec...........................................................0.99
controller_entropy_weight.................................................0.0001
controller_forwards_limit......................................................2
controller_keep_prob.........................................................0.5
controller_l2_reg............................................................0.0
controller_lr..............................................................0.001
controller_lr_dec_rate.......................................................1.0
controller_num_aggregate......................................................20
controller_num_replicas........................................................1
controller_op_tanh_reduce....................................................2.5
controller_search_whole_channels............................................True
controller_skip_target.......................................................0.4
controller_skip_weight.......................................................0.8
controller_sync_replicas....................................................True
controller_tanh_constant.....................................................1.5
controller_temperature......................................................None
controller_train_every.........................................................1
controller_train_steps........................................................50
controller_training.........................................................True
controller_use_critic......................................................False
data_format.................................................................NCHW
data_path...........................................................data/cifar10
eval_every_epochs..............................................................1
log_every.....................................................................50
num_epochs...................................................................310
output_dir...............................................................outputs
reset_output_dir............................................................True
search_for.................................................................macro
--------------------------------------------------------------------------------
Reading data
data_batch_1
data_batch_2
data_batch_3
data_batch_4
data_batch_5
test_batch
Prepropcess: [subtract mean], [divide std]
mean: [125.34512 122.94169 113.83898]
std: [63.02383 62.13708 66.74233]
--------------------------------------------------------------------------------
Build model child
Build data ops
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/models.py:83: shuffle_batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size)`.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/input.py:753: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/input.py:753: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/input.py:861: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/tensor_array_ops.py:162: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/models.py:125: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
--------------------------------------------------------------------------------
Building ConvController
--------------------------------------------------------------------------------
Build controller sampler
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_controller.py:157: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_controller.py:158: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_controller.py:238: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
--------------------------------------------------------------------------------
Build train graph

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_child.py:578: average_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.average_pooling2d instead.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_child.py:581: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling2d instead.
Tensor("child/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_1/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_2/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_3/pool_at_3/from_4/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_4/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_5/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_6/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_7/pool_at_7/from_8/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_8/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_9/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_10/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_11/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_child.py:233: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Model has 697860 params
--------------------------------------------------------------------------------
Build valid graph
Tensor("child_1/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_1/layer_1/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_1/layer_2/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_1/layer_3/pool_at_3/from_4/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_4/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_5/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_6/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_7/pool_at_7/from_8/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_8/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_9/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_10/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_11/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
--------------------------------------------------------------------------------
Build test graph
Tensor("child_2/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_2/layer_1/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_2/layer_2/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_2/layer_3/pool_at_3/from_4/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_4/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_5/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_6/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_7/pool_at_7/from_8/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_8/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_9/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_10/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_11/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
--------------------------------------------------------------------------------
Build valid graph on shuffled data
Tensor("child_3/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_3/layer_1/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_3/layer_2/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_3/layer_3/pool_at_3/from_4/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_4/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_5/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_6/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_7/pool_at_7/from_8/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_8/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_9/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_10/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_11/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
--------------------------------------------------------------------------------
<tf.Variable 'controller/lstm/layer_0/w:0' shape=(128, 256) dtype=float32_ref>
<tf.Variable 'controller/g_emb:0' shape=(1, 64) dtype=float32_ref>
<tf.Variable 'controller/emb/w:0' shape=(6, 64) dtype=float32_ref>
<tf.Variable 'controller/softmax/w:0' shape=(64, 6) dtype=float32_ref>
<tf.Variable 'controller/attention/w_1:0' shape=(64, 64) dtype=float32_ref>
<tf.Variable 'controller/attention/w_2:0' shape=(64, 64) dtype=float32_ref>
<tf.Variable 'controller/attention/v:0' shape=(64, 1) dtype=float32_ref>
WARNING:tensorflow:From /home/jovyan/sources/enas/src/utils.py:231: SyncReplicasOptimizer.__init__ (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The `SyncReplicaOptimizer` class is deprecated. For synchrononous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py:1294: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
--------------------------------------------------------------------------------
Starting session
2019-11-12 08:28:15.808271: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-12 08:28:16.940255: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557a92d62660 executing computations on platform CUDA. Devices:
2019-11-12 08:28:16.940310: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-11-12 08:28:16.940324: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-11-12 08:28:17.072808: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3597780000 Hz
2019-11-12 08:28:17.073571: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557a92d56010 executing computations on platform Host. Devices:
2019-11-12 08:28:17.073612: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-12 08:28:17.073827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:05:00.0
totalMemory: 10.91GiB freeMemory: 5.08GiB
2019-11-12 08:28:17.073934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 5.21GiB
2019-11-12 08:28:17.074477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-11-12 08:28:17.092715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-12 08:28:17.092752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
2019-11-12 08:28:17.092768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
2019-11-12 08:28:17.092779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
2019-11-12 08:28:17.093006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4900 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2019-11-12 08:28:17.093536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 5038 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:809: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
./scripts/cifar10_macro_search.sh: line 40: 10644 Killed                  python src/cifar10/main.py --data_format="NCHW" --search_for="macro" --reset_output_dir --data_path="data/cifar10" --output_dir="outputs" --batch_size=128 --num_epochs=310 --log_every=50 --eval_every_epochs=1 --child_use_aux_heads --child_num_layers=12 --child_out_filters=36 --child_l2_reg=0.00025 --child_num_branches=6 --child_num_cell_layers=5 --child_keep_prob=0.90 --child_drop_path_keep_prob=0.60 --child_lr_cosine --child_lr_max=0.05 --child_lr_min=0.0005 --child_lr_T_0=10 --child_lr_T_mul=2 --controller_training --controller_search_whole_channels --controller_entropy_weight=0.0001 --controller_train_every=1 --controller_sync_replicas --controller_num_aggregate=20 --controller_train_steps=50 --controller_lr=0.001 --controller_tanh_constant=1.5 --controller_op_tanh_reduce=2.5 --controller_skip_target=0.4 --controller_skip_weight=0.8 "$@"

My environment is configured as follows:
System: CentOS7
Graphics driver: NVIDIA GeForce GTX 1080 Ti * 2
Memory: 16G
TensorFlow: 1.13.1 GPU
Python:python3.6

qtz93 · 2019-11-13T01:33:40Z

???

mkamein · 2019-12-18T17:39:05Z

Hey! I thinking I am facing a similar problem. My script gets killed after "starting session" section but no actual errors are displayed. Did you manage to solve this problem?

racheljose21 · 2020-10-30T06:50:27Z

Does anyone have a solution? I'm facing the same issue

nott0 · 2020-10-30T09:48:46Z

I believe that you don't have enough memory. Try increase it to at least 32GB. Or reduce the number ofv tanning data in data_util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much memory to run the cifar10 example #112

How much memory to run the cifar10 example #112

qtz93 commented Nov 12, 2019

qtz93 commented Nov 13, 2019

mkamein commented Dec 18, 2019

racheljose21 commented Oct 30, 2020

nott0 commented Oct 30, 2020

How much memory to run the cifar10 example #112

How much memory to run the cifar10 example #112

Comments

qtz93 commented Nov 12, 2019

qtz93 commented Nov 13, 2019

mkamein commented Dec 18, 2019

racheljose21 commented Oct 30, 2020

nott0 commented Oct 30, 2020