Out of Memory during evaluation #47

davidblom603 · 2021-08-18T12:01:54Z

Description

I train the model on a custom dataset, and the training finished without any issues (I needed to decrease the crop size as discussed in the FAQ).
However, when running the evaluation (mode=eval), the compute node runs out of memory. I'm using the same compute node and GPU as during training.
So, my main question is what is the difference between training and evaluation? Are the images not resized to the same sizes as during training, and the original resolution is used?

Here is the experiment config file:

# proto-file: deeplab2/config.proto
# proto-message: ExperimentOptions
#
# Panoptic-DeepLab with ResNet-50-beta model variant and output stride 32.
#
############### PLEASE READ THIS BEFORE USING THIS CONFIG ###############
# Before using this config, you need to update the following fields:
# - experiment_name: Use a unique experiment name for each experiment.
# - initial_checkpoint: Update the path to the initial checkpoint.
# - train_dataset_options.file_pattern: Update the path to the
#   training set. e.g., your_dataset/train*.tfrecord
# - eval_dataset_options.file_pattern: Update the path to the
#   validation set, e.g., your_dataset/eval*.tfrecord
# - (optional) set merge_semantic_and_instance_with_tf_op: true, if you
#   could successfully compile the provided efficient merging operation
#   under the folder `tensorflow_ops`.
#########################################################################
#
# The `resnet50_beta` model variant replaces the first 7x7 convolutions in the
# original `resnet50` with three 3x3 convolutions, which is useful for dense
# prediction tasks.
#
# References:
# For resnet-50-beta, see
# https://github.com/tensorflow/models/blob/master/research/deeplab/core/resnet_v1_beta.py
# For Panoptic-DeepLab, see
# - Bowen Cheng, et al. "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline
#   for Bottom-Up Panoptic Segmentation." In CVPR, 2020.

# Use a unique experiment_name for each experiment.
experiment_name: "air_flange_panoptic_segmentation_resnet50"
model_options {
  # Update the path to the initial checkpoint (e.g., ImageNet
  # pretrained checkpoint).
  initial_checkpoint: "air_flange_panoptic_segmentation_resnet50/ckpt-2000"
  backbone {
    name: "resnet50_beta"
    output_stride: 32
  }
  decoder {
    feature_key: "res5"
    decoder_channels: 256
    aspp_channels: 256
    atrous_rates: 3
    atrous_rates: 6
    atrous_rates: 9
  }
  panoptic_deeplab {
    low_level {
      feature_key: "res3"
      channels_project: 64
    }
    low_level {
      feature_key: "res2"
      channels_project: 32
    }
    instance {
      low_level_override {
        feature_key: "res3"
        channels_project: 32
      }
      low_level_override {
        feature_key: "res2"
        channels_project: 16
      }
      instance_decoder_override {
        feature_key: "res5"
        decoder_channels: 128
        atrous_rates: 3
        atrous_rates: 6
        atrous_rates: 9
      }
      center_head {
        output_channels: 1
        head_channels: 32
      }
      regression_head {
        output_channels: 2
        head_channels: 32
      }
    }
    semantic_head {
      output_channels: 2
      head_channels: 256
    }
  }
}
trainer_options {
  save_checkpoints_steps: 1000
  save_summaries_steps: 100
  steps_per_loop: 100
  loss_options {
    semantic_loss {
      name: "softmax_cross_entropy"
      weight: 1.0
      top_k_percent: 0.2
    }
    center_loss {
      name: "mse"
      weight: 200
    }
    regression_loss {
      name: "l1"
      weight: 0.01
    }
  }
  solver_options {
    base_learning_rate: 0.00025
    # training_number_of_steps: 60000
    training_number_of_steps: 2000
  }
}
train_dataset_options {
  dataset: "air_flange_panoptic"
  # Update the path to training set.
  file_pattern: "train*.tfrecord"
  # Adjust the batch_size accordingly to better fit your GPU/TPU memory.
  # Also see Q1 in g3doc/faq.md.
  batch_size: 8
  crop_size: 641
  crop_size: 641
  min_resize_value: 641
  max_resize_value: 641
  augmentations {
    min_scale_factor: 0.5
    max_scale_factor: 2.0
    scale_factor_step_size: 0.1
    autoaugment_policy_name: "simple_classification_policy_magnitude_scale_0.2"
  }
  increase_small_instance_weights: true
  small_instance_weight: 3.0
}
eval_dataset_options {
  dataset: "air_flange_panoptic"
  # Update the path to validation set.
  file_pattern: "val*.tfrecord"
  batch_size: 1
  crop_size: 641
  crop_size: 641
  min_resize_value: 641
  max_resize_value: 641
  # Add options to make the evaluation loss comparable to the training loss.
  increase_small_instance_weights: true
  small_instance_weight: 3.0
}
evaluator_options {
  continuous_eval_timeout: 43200
  stuff_area_limit: 2048
  center_score_threshold: 0.1
  nms_kernel: 13
  save_predictions: true
  save_raw_predictions: false
  # Use pure tf functions (i.e., no CUDA kernel) to merge semantic and
  # instance maps. For faster speed, compile TensorFlow with provided kernel
  # implementation under the folder `tensorflow_ops`, and set
  # merge_semantic_and_instance_with_tf_op to true.
  merge_semantic_and_instance_with_tf_op: false
  eval_interval: 1000
}

The text was updated successfully, but these errors were encountered:

aquariusjay · 2021-08-18T16:50:02Z

Hi davidblom603,

Based on your provided config, you have set min_resize_value = max_resize_value = 641 and thus the input will be resized to the longest side = 641 on the fly. This resized input will be fed to the network during both training and evaluation. However, one thing to note is that during evaluation, we will further resize the prediction to have the same size as the original groundtruth label (see this line), which makes sure the predictions are evaluated w.r.t. the intact and original groundtruth label.

If your dataset is really large that we could not even fit one image on a GPU during evaluation, maybe you could try to resize them to a smaller size and create a new dataset.

Cheers,

davidblom603 · 2021-08-19T06:20:53Z

Hi @aquariusjay ,

Thanks for your response. Indeed, the issue is that one image cannot fit on a GPU during evaluation due to a very high resolution. What does work is using a high memory CPU node.
Thanks for your help!

Cheers

davidblom603 closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory during evaluation #47

Out of Memory during evaluation #47

davidblom603 commented Aug 18, 2021

aquariusjay commented Aug 18, 2021

davidblom603 commented Aug 19, 2021

Out of Memory during evaluation #47

Out of Memory during evaluation #47

Comments

davidblom603 commented Aug 18, 2021

Description

aquariusjay commented Aug 18, 2021

davidblom603 commented Aug 19, 2021