Can this code train with GPU？ #2

kaikaizhu · 2020-03-17T02:53:40Z

I want to use GPUs to train this code，what can I do ? Thanks a lot!

bhack · 2020-03-17T17:41:14Z

And Edge TPU (Coral). Also Will it we available on TF HUB?

kaikaizhu · 2020-03-18T11:08:34Z

If I just have Gpus, can I use the trained weights provided by this project to test my own pictures?

hoangphucITJP · 2020-03-20T09:15:55Z

@kaikaizhu , I guess you can according to https://cloud.google.com/tpu/docs/using-estimator-api:

Models written using TPUEstimator work across CPUs, GPUs, single TPU devices, and whole TPU pods, generally with no code changes.

and the TPUEstimator is used in this repo:
https://github.com/google/automl/blob/master/efficientdet/main.py#L239

liminghuiv · 2020-03-21T04:50:05Z

I am also interested in training with GPU. any tutorial? thanks a lot.

mingxingtan · 2020-03-22T04:35:08Z

Some command line examples

Train on GPU:

python main.py --training_file_pattern=/coco_tfrecord/train* --model_name=effcientdet-d0 --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False

Eval on GPU:

// ssuming /tmp/efficientnet-d0/ contains your checkpoint.
python main.py --mode=eval --model_name=efficientdet-d0 --model_dir=/tmp/efficientdet-d0/ --validation_file_pattern=/coco_tfrecord/val* --val_json_file=/coco_tfrecord/instances_val2017.json --hparams="use_bfloat16=false" --use_tpu=False

Inference a single image:

// pip install pytype pycocotools
python model_inspect.py --runmode=infer --model_name=efficientdet-d0 --ckpt_path=/tmp/efficientdet-d0/ --input_image=/tmp/img1.jpg --output_image_dir=/tmp/det1/

I will add a tutorial colab soon.

airqj · 2020-03-22T14:30:13Z

@mingxingtan
the flag "--use_tpu=False" use cpu instead of tpu to train and it is very slow.
We need to change some code to train efficientdet on GPU?

Jilliansea · 2020-03-23T14:11:00Z

@mingxingtan Hi, I want to detect a image lists in form of 'txt', and I change the code of build_input, but it also error in post process. Because the batch size of inference is 1, when send all images to the model, it also deal as 1 batch, then then anchor numbers will biger then index...
So, could you please publish an inference code to "change ckpt to pb " and inference by pb model for multi-images?

liminghuiv · 2020-03-23T17:12:40Z

Hi @mingxingtan , thanks for taking a look at it. I checked the efficientdet/main.py code:
line 53:
flags.DEFINE_bool('use_tpu', True, 'Use TPUs rather than CPUs')
it seems that it will use CPU instead of GPU, if we set use_tpu FALSE

Pointed out by #2.

mingxingtan · 2020-03-24T22:14:53Z

@liminghuiv it is a wrong comment and I have just fixed it. Estimator will automatically determine use GPU if you have; otherwise it uses CPU.

ruodingt · 2020-05-29T03:18:18Z

Hi @mingxingtan
Thanks so much for sharing your fantastic work.

I got similar problem here that the TPU estimator does train on my GPU
I am using tensorflow 2.0.0 (a docker image from official tf docker hub)

Although the system has a V100 GPU yet still it only trains on CPU.
Could you give me some tips?

Thank you.


I0529 03:07:38.554332 139669136197440 main.py:383] {'name': 'efficientdet-d0', 'act_type': 'swish', 'image_size': (512, 512), 'input_rand_hflip': True, 'train_scale_min': 0.1, 'train_scale_max': 2.0, 'autoaugment_policy': None, 'use_augmix': False, 'augmix_params': (3, -1, 1), 'num_classes': 20, 'skip_crowd_during_training': True, 'label_id_mapping': None, 'min_level': 3, 'max_level': 7, 'num_scales': 3, 'aspect_ratios': [(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)], 'anchor_scale': 4.0, 'is_training_bn': True, 'momentum': 0.9, 'optimizer': 'sgd', 'learning_rate': 0.08, 'lr_warmup_init': 0.008, 'lr_warmup_epoch': 1.0, 'first_lr_drop_epoch': 200.0, 'second_lr_drop_epoch': 250.0, 'poly_lr_power': 0.9, 'clip_gradients_norm': 10.0, 'num_epochs': 18000, 'data_format': 'channels_last', 'alpha': 0.25, 'gamma': 1.5, 'delta': 0.1, 'box_loss_weight': 50.0, 'iou_loss_type': None, 'iou_loss_weight': 1.0, 'weight_decay': 4e-05, 'strategy': '', 'precision': None, 'box_class_repeats': 3, 'fpn_cell_repeats': 3, 'fpn_num_filters': 64, 'separable_conv': True, 'apply_bn_for_resampling': True, 'conv_after_downsample': False, 'conv_bn_act_pattern': False, 'use_native_resize_op': True, 'pooling_type': None, 'fpn_name': None, 'fpn_weight_method': None, 'fpn_config': None, 'survival_prob': None, 'lr_decay_method': 'cosine', 'moving_average_decay': 0.9998, 'ckpt_var_scope': None, 'var_exclude_expr': '.*/class-predict/.*', 'backbone_name': 'efficientnet-b0', 'backbone_config': None, 'var_freeze_expr': None, 'resnet_depth': 50, 'model_name': 'efficientdet-d0', 'iterations_per_loop': 100, 'model_dir': '../output/exp-001-baseline-d0', 'num_shards': 1, 'num_examples_per_epoch': 2000, 'backbone_ckpt': '/home/appuser/project/pretrained/efficientnet-b0', 'ckpt': None, 'val_json_file': None, 'testdev_dir': None, 'mode': 'train_and_eval', 'DATA_CONF': {'CATEGORIES_IN_RANGE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external', 'stain-tooth-internal'], 'EVAL_SCOPE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external', 'stain-tooth-internal'], 'METRIC_SCOPE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external'], 'EVAL': ['coco_stack_out/user-data-2020-Apr-R3_B10M3-34.json'], 'IMAGE_BASEDIR': '../data/', 'SUB_MASK_CATEGORY': ['calculus'], 'TRAIN': ['coco_stack_out/web_decay_600-26-full.json', 'coco_stack_out/gingivitis_web_490-31-full.json', 'coco_stack_out/calculus_web_230-28-full.json', 'coco_stack_out/mturk_mar_2020r-30-full.json', 'coco_stack_out/legacy_decay-25-full.json', 'coco_stack_out/mturk50_mar16_ro-37.json', 'coco_stack_out/tooth_crawl_web_A-36.json', 'coco_stack_out/spotty_stain_web_A-35.json']}}
I0529 03:07:38.554489 139669136197440 main.py:274] Starting training cycle, epoch: 0 / 18000.
INFO:tensorflow:Using config: {'_model_dir': '../output/exp-001-baseline-d0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0729556cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, num_cores_per_replica=8, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 4, 2, 1], {'mean_num_positives': None, 'source_ids': None, 'groundtruth_data': None, 'image_scales': None, 'box_targets_3': [1, 4, 2, 1], 'cls_targets_3': [1, 4, 2, 1], 'box_targets_4': [1, 4, 2, 1], 'cls_targets_4': [1, 4, 2, 1], 'box_targets_5': [1, 4, 2, 1], 'cls_targets_5': [1, 4, 2, 1], 'box_targets_6': [1, 4, 2, 1], 'cls_targets_6': [1, 4, 2, 1], 'box_targets_7': [1, 4, 2, 1], 'cls_targets_7': [1, 4, 2, 1]}], eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0529 03:07:38.554966 139669136197440 estimator.py:212] Using config: {'_model_dir': '../output/exp-001-baseline-d0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0729556cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, num_cores_per_replica=8, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 4, 2, 1], {'mean_num_positives': None, 'source_ids': None, 'groundtruth_data': None, 'image_scales': None, 'box_targets_3': [1, 4, 2, 1], 'cls_targets_3': [1, 4, 2, 1], 'box_targets_4': [1, 4, 2, 1], 'cls_targets_4': [1, 4, 2, 1], 'box_targets_5': [1, 4, 2, 1], 'cls_targets_5': [1, 4, 2, 1], 'box_targets_6': [1, 4, 2, 1], 'cls_targets_6': [1, 4, 2, 1], 'box_targets_7': [1, 4, 2, 1], 'cls_targets_7': [1, 4, 2, 1]}], eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0529 03:07:38.555698 139669136197440 tpu_context.py:221] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0529 03:07:38.556049 139669136197440 tpu_context.py:223] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0529 03:07:38.561795 139669136197440 deprecation.py:506] From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0529 03:07:38.562173 139669136197440 deprecation.py:323] From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2020-05-29 03:07:38.570108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-29 03:07:38.573494: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.574404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
2020-05-29 03:07:38.574616: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-05-29 03:07:38.575953: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-05-29 03:07:38.577151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-05-29 03:07:38.577451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-05-29 03:07:38.579049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-05-29 03:07:38.580268: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-05-29 03:07:38.584087: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-29 03:07:38.584177: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.585136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.586012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
WARNING:tensorflow:From /home/appuser/project/efficientdet/dataloader.py:344: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0529 03:07:38.623125 139669136197440 deprecation.py:323] From /home/appuser/project/efficientdet/dataloader.py:344: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._dataset_parser at 0x7f073ebbabf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: LIVE_VARS_IN
W0529 03:07:39.093873 139669136197440 ag_logging.py:146] Entity <function InputReader.__call__.<locals>._dataset_parser at 0x7f073ebbabf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: LIVE_VARS_IN
INFO:tensorflow:Calling model_fn.
I0529 03:07:39.843184 139669136197440 estimator.py:1147] Calling model_fn.
INFO:tensorflow:Running train on CPU

linkrain-a · 2020-06-11T03:52:13Z

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Find your answer through this link:
https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ

mingxingtan self-assigned this Mar 22, 2020

mingxingtan added a commit that referenced this issue Mar 24, 2020

Fix a doc string error.

e90a471

Pointed out by #2.

mingxingtan closed this as completed Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can this code train with GPU？ #2

Can this code train with GPU？ #2

kaikaizhu commented Mar 17, 2020

bhack commented Mar 17, 2020

kaikaizhu commented Mar 18, 2020

hoangphucITJP commented Mar 20, 2020

liminghuiv commented Mar 21, 2020

mingxingtan commented Mar 22, 2020 •

edited

airqj commented Mar 22, 2020

Jilliansea commented Mar 23, 2020

liminghuiv commented Mar 23, 2020

mingxingtan commented Mar 24, 2020

ruodingt commented May 29, 2020 •

edited

linkrain-a commented Jun 11, 2020

Can this code train with GPU？ #2

Can this code train with GPU？ #2

Comments

kaikaizhu commented Mar 17, 2020

bhack commented Mar 17, 2020

kaikaizhu commented Mar 18, 2020

hoangphucITJP commented Mar 20, 2020

liminghuiv commented Mar 21, 2020

mingxingtan commented Mar 22, 2020 • edited

airqj commented Mar 22, 2020

Jilliansea commented Mar 23, 2020

liminghuiv commented Mar 23, 2020

mingxingtan commented Mar 24, 2020

ruodingt commented May 29, 2020 • edited

linkrain-a commented Jun 11, 2020

mingxingtan commented Mar 22, 2020 •

edited

ruodingt commented May 29, 2020 •

edited