Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this code train with GPU? #2

Closed
kaikaizhu opened this issue Mar 17, 2020 · 11 comments
Closed

Can this code train with GPU? #2

kaikaizhu opened this issue Mar 17, 2020 · 11 comments
Assignees

Comments

@kaikaizhu
Copy link

I want to use GPUs to train this code,what can I do ? Thanks a lot!

@bhack
Copy link

bhack commented Mar 17, 2020

And Edge TPU (Coral). Also Will it we available on TF HUB?

@kaikaizhu
Copy link
Author

If I just have Gpus, can I use the trained weights provided by this project to test my own pictures?

@hoangphucITJP
Copy link

@kaikaizhu , I guess you can according to https://cloud.google.com/tpu/docs/using-estimator-api:

Models written using TPUEstimator work across CPUs, GPUs, single TPU devices, and whole TPU pods, generally with no code changes.

and the TPUEstimator is used in this repo:
https://github.com/google/automl/blob/master/efficientdet/main.py#L239

@liminghuiv
Copy link

I am also interested in training with GPU. any tutorial? thanks a lot.

@mingxingtan
Copy link
Member

mingxingtan commented Mar 22, 2020

Some command line examples

  1. Train on GPU:

python main.py --training_file_pattern=/coco_tfrecord/train* --model_name=effcientdet-d0 --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False

  1. Eval on GPU:

// ssuming /tmp/efficientnet-d0/ contains your checkpoint.
python main.py --mode=eval --model_name=efficientdet-d0 --model_dir=/tmp/efficientdet-d0/ --validation_file_pattern=/coco_tfrecord/val* --val_json_file=/coco_tfrecord/instances_val2017.json --hparams="use_bfloat16=false" --use_tpu=False

  1. Inference a single image:

// pip install pytype pycocotools
python model_inspect.py --runmode=infer --model_name=efficientdet-d0 --ckpt_path=/tmp/efficientdet-d0/ --input_image=/tmp/img1.jpg --output_image_dir=/tmp/det1/

I will add a tutorial colab soon.

@mingxingtan mingxingtan self-assigned this Mar 22, 2020
@airqj
Copy link

airqj commented Mar 22, 2020

@mingxingtan
the flag "--use_tpu=False" use cpu instead of tpu to train and it is very slow.
We need to change some code to train efficientdet on GPU?

@Jilliansea
Copy link

@mingxingtan Hi, I want to detect a image lists in form of 'txt', and I change the code of build_input, but it also error in post process. Because the batch size of inference is 1, when send all images to the model, it also deal as 1 batch, then then anchor numbers will biger then index...
So, could you please publish an inference code to "change ckpt to pb " and inference by pb model for multi-images?

@liminghuiv
Copy link

Hi @mingxingtan , thanks for taking a look at it. I checked the efficientdet/main.py code:
line 53:
flags.DEFINE_bool('use_tpu', True, 'Use TPUs rather than CPUs')
it seems that it will use CPU instead of GPU, if we set use_tpu FALSE

mingxingtan added a commit that referenced this issue Mar 24, 2020
Pointed out by #2.
@mingxingtan
Copy link
Member

@liminghuiv it is a wrong comment and I have just fixed it. Estimator will automatically determine use GPU if you have; otherwise it uses CPU.

@ruodingt
Copy link

ruodingt commented May 29, 2020

Hi @mingxingtan
Thanks so much for sharing your fantastic work.

I got similar problem here that the TPU estimator does train on my GPU
I am using tensorflow 2.0.0 (a docker image from official tf docker hub)

Although the system has a V100 GPU yet still it only trains on CPU.
Could you give me some tips?

Thank you.


I0529 03:07:38.554332 139669136197440 main.py:383] {'name': 'efficientdet-d0', 'act_type': 'swish', 'image_size': (512, 512), 'input_rand_hflip': True, 'train_scale_min': 0.1, 'train_scale_max': 2.0, 'autoaugment_policy': None, 'use_augmix': False, 'augmix_params': (3, -1, 1), 'num_classes': 20, 'skip_crowd_during_training': True, 'label_id_mapping': None, 'min_level': 3, 'max_level': 7, 'num_scales': 3, 'aspect_ratios': [(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)], 'anchor_scale': 4.0, 'is_training_bn': True, 'momentum': 0.9, 'optimizer': 'sgd', 'learning_rate': 0.08, 'lr_warmup_init': 0.008, 'lr_warmup_epoch': 1.0, 'first_lr_drop_epoch': 200.0, 'second_lr_drop_epoch': 250.0, 'poly_lr_power': 0.9, 'clip_gradients_norm': 10.0, 'num_epochs': 18000, 'data_format': 'channels_last', 'alpha': 0.25, 'gamma': 1.5, 'delta': 0.1, 'box_loss_weight': 50.0, 'iou_loss_type': None, 'iou_loss_weight': 1.0, 'weight_decay': 4e-05, 'strategy': '', 'precision': None, 'box_class_repeats': 3, 'fpn_cell_repeats': 3, 'fpn_num_filters': 64, 'separable_conv': True, 'apply_bn_for_resampling': True, 'conv_after_downsample': False, 'conv_bn_act_pattern': False, 'use_native_resize_op': True, 'pooling_type': None, 'fpn_name': None, 'fpn_weight_method': None, 'fpn_config': None, 'survival_prob': None, 'lr_decay_method': 'cosine', 'moving_average_decay': 0.9998, 'ckpt_var_scope': None, 'var_exclude_expr': '.*/class-predict/.*', 'backbone_name': 'efficientnet-b0', 'backbone_config': None, 'var_freeze_expr': None, 'resnet_depth': 50, 'model_name': 'efficientdet-d0', 'iterations_per_loop': 100, 'model_dir': '../output/exp-001-baseline-d0', 'num_shards': 1, 'num_examples_per_epoch': 2000, 'backbone_ckpt': '/home/appuser/project/pretrained/efficientnet-b0', 'ckpt': None, 'val_json_file': None, 'testdev_dir': None, 'mode': 'train_and_eval', 'DATA_CONF': {'CATEGORIES_IN_RANGE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external', 'stain-tooth-internal'], 'EVAL_SCOPE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external', 'stain-tooth-internal'], 'METRIC_SCOPE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external'], 'EVAL': ['coco_stack_out/user-data-2020-Apr-R3_B10M3-34.json'], 'IMAGE_BASEDIR': '../data/', 'SUB_MASK_CATEGORY': ['calculus'], 'TRAIN': ['coco_stack_out/web_decay_600-26-full.json', 'coco_stack_out/gingivitis_web_490-31-full.json', 'coco_stack_out/calculus_web_230-28-full.json', 'coco_stack_out/mturk_mar_2020r-30-full.json', 'coco_stack_out/legacy_decay-25-full.json', 'coco_stack_out/mturk50_mar16_ro-37.json', 'coco_stack_out/tooth_crawl_web_A-36.json', 'coco_stack_out/spotty_stain_web_A-35.json']}}
I0529 03:07:38.554489 139669136197440 main.py:274] Starting training cycle, epoch: 0 / 18000.
INFO:tensorflow:Using config: {'_model_dir': '../output/exp-001-baseline-d0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0729556cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, num_cores_per_replica=8, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 4, 2, 1], {'mean_num_positives': None, 'source_ids': None, 'groundtruth_data': None, 'image_scales': None, 'box_targets_3': [1, 4, 2, 1], 'cls_targets_3': [1, 4, 2, 1], 'box_targets_4': [1, 4, 2, 1], 'cls_targets_4': [1, 4, 2, 1], 'box_targets_5': [1, 4, 2, 1], 'cls_targets_5': [1, 4, 2, 1], 'box_targets_6': [1, 4, 2, 1], 'cls_targets_6': [1, 4, 2, 1], 'box_targets_7': [1, 4, 2, 1], 'cls_targets_7': [1, 4, 2, 1]}], eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0529 03:07:38.554966 139669136197440 estimator.py:212] Using config: {'_model_dir': '../output/exp-001-baseline-d0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0729556cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, num_cores_per_replica=8, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 4, 2, 1], {'mean_num_positives': None, 'source_ids': None, 'groundtruth_data': None, 'image_scales': None, 'box_targets_3': [1, 4, 2, 1], 'cls_targets_3': [1, 4, 2, 1], 'box_targets_4': [1, 4, 2, 1], 'cls_targets_4': [1, 4, 2, 1], 'box_targets_5': [1, 4, 2, 1], 'cls_targets_5': [1, 4, 2, 1], 'box_targets_6': [1, 4, 2, 1], 'cls_targets_6': [1, 4, 2, 1], 'box_targets_7': [1, 4, 2, 1], 'cls_targets_7': [1, 4, 2, 1]}], eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0529 03:07:38.555698 139669136197440 tpu_context.py:221] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0529 03:07:38.556049 139669136197440 tpu_context.py:223] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0529 03:07:38.561795 139669136197440 deprecation.py:506] From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0529 03:07:38.562173 139669136197440 deprecation.py:323] From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2020-05-29 03:07:38.570108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-29 03:07:38.573494: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.574404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
2020-05-29 03:07:38.574616: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-05-29 03:07:38.575953: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-05-29 03:07:38.577151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-05-29 03:07:38.577451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-05-29 03:07:38.579049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-05-29 03:07:38.580268: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-05-29 03:07:38.584087: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-29 03:07:38.584177: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.585136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.586012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
WARNING:tensorflow:From /home/appuser/project/efficientdet/dataloader.py:344: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0529 03:07:38.623125 139669136197440 deprecation.py:323] From /home/appuser/project/efficientdet/dataloader.py:344: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._dataset_parser at 0x7f073ebbabf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: LIVE_VARS_IN
W0529 03:07:39.093873 139669136197440 ag_logging.py:146] Entity <function InputReader.__call__.<locals>._dataset_parser at 0x7f073ebbabf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: LIVE_VARS_IN
INFO:tensorflow:Calling model_fn.
I0529 03:07:39.843184 139669136197440 estimator.py:1147] Calling model_fn.
INFO:tensorflow:Running train on CPU

@linkrain-a
Copy link

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Find your answer through this link:
https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants