You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I train the model on a custom dataset, and the training finished without any issues (I needed to decrease the crop size as discussed in the FAQ).
However, when running the evaluation (mode=eval), the compute node runs out of memory. I'm using the same compute node and GPU as during training.
So, my main question is what is the difference between training and evaluation? Are the images not resized to the same sizes as during training, and the original resolution is used?
Here is the experiment config file:
# proto-file: deeplab2/config.proto
# proto-message: ExperimentOptions
#
# Panoptic-DeepLab with ResNet-50-beta model variant and output stride 32.
#
############### PLEASE READ THIS BEFORE USING THIS CONFIG ###############
# Before using this config, you need to update the following fields:
# - experiment_name: Use a unique experiment name for each experiment.
# - initial_checkpoint: Update the path to the initial checkpoint.
# - train_dataset_options.file_pattern: Update the path to the
# training set. e.g., your_dataset/train*.tfrecord
# - eval_dataset_options.file_pattern: Update the path to the
# validation set, e.g., your_dataset/eval*.tfrecord
# - (optional) set merge_semantic_and_instance_with_tf_op: true, if you
# could successfully compile the provided efficient merging operation
# under the folder `tensorflow_ops`.
#########################################################################
#
# The `resnet50_beta` model variant replaces the first 7x7 convolutions in the
# original `resnet50` with three 3x3 convolutions, which is useful for dense
# prediction tasks.
#
# References:
# For resnet-50-beta, see
# https://github.com/tensorflow/models/blob/master/research/deeplab/core/resnet_v1_beta.py
# For Panoptic-DeepLab, see
# - Bowen Cheng, et al. "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline
# for Bottom-Up Panoptic Segmentation." In CVPR, 2020.
# Use a unique experiment_name for each experiment.
experiment_name: "air_flange_panoptic_segmentation_resnet50"
model_options {
# Update the path to the initial checkpoint (e.g., ImageNet
# pretrained checkpoint).
initial_checkpoint: "air_flange_panoptic_segmentation_resnet50/ckpt-2000"
backbone {
name: "resnet50_beta"
output_stride: 32
}
decoder {
feature_key: "res5"
decoder_channels: 256
aspp_channels: 256
atrous_rates: 3
atrous_rates: 6
atrous_rates: 9
}
panoptic_deeplab {
low_level {
feature_key: "res3"
channels_project: 64
}
low_level {
feature_key: "res2"
channels_project: 32
}
instance {
low_level_override {
feature_key: "res3"
channels_project: 32
}
low_level_override {
feature_key: "res2"
channels_project: 16
}
instance_decoder_override {
feature_key: "res5"
decoder_channels: 128
atrous_rates: 3
atrous_rates: 6
atrous_rates: 9
}
center_head {
output_channels: 1
head_channels: 32
}
regression_head {
output_channels: 2
head_channels: 32
}
}
semantic_head {
output_channels: 2
head_channels: 256
}
}
}
trainer_options {
save_checkpoints_steps: 1000
save_summaries_steps: 100
steps_per_loop: 100
loss_options {
semantic_loss {
name: "softmax_cross_entropy"
weight: 1.0
top_k_percent: 0.2
}
center_loss {
name: "mse"
weight: 200
}
regression_loss {
name: "l1"
weight: 0.01
}
}
solver_options {
base_learning_rate: 0.00025
# training_number_of_steps: 60000
training_number_of_steps: 2000
}
}
train_dataset_options {
dataset: "air_flange_panoptic"
# Update the path to training set.
file_pattern: "train*.tfrecord"
# Adjust the batch_size accordingly to better fit your GPU/TPU memory.
# Also see Q1 in g3doc/faq.md.
batch_size: 8
crop_size: 641
crop_size: 641
min_resize_value: 641
max_resize_value: 641
augmentations {
min_scale_factor: 0.5
max_scale_factor: 2.0
scale_factor_step_size: 0.1
autoaugment_policy_name: "simple_classification_policy_magnitude_scale_0.2"
}
increase_small_instance_weights: true
small_instance_weight: 3.0
}
eval_dataset_options {
dataset: "air_flange_panoptic"
# Update the path to validation set.
file_pattern: "val*.tfrecord"
batch_size: 1
crop_size: 641
crop_size: 641
min_resize_value: 641
max_resize_value: 641
# Add options to make the evaluation loss comparable to the training loss.
increase_small_instance_weights: true
small_instance_weight: 3.0
}
evaluator_options {
continuous_eval_timeout: 43200
stuff_area_limit: 2048
center_score_threshold: 0.1
nms_kernel: 13
save_predictions: true
save_raw_predictions: false
# Use pure tf functions (i.e., no CUDA kernel) to merge semantic and
# instance maps. For faster speed, compile TensorFlow with provided kernel
# implementation under the folder `tensorflow_ops`, and set
# merge_semantic_and_instance_with_tf_op to true.
merge_semantic_and_instance_with_tf_op: false
eval_interval: 1000
}
The text was updated successfully, but these errors were encountered:
Based on your provided config, you have set min_resize_value = max_resize_value = 641 and thus the input will be resized to the longest side = 641 on the fly. This resized input will be fed to the network during both training and evaluation. However, one thing to note is that during evaluation, we will further resize the prediction to have the same size as the original groundtruth label (see this line), which makes sure the predictions are evaluated w.r.t. the intact and original groundtruth label.
If your dataset is really large that we could not even fit one image on a GPU during evaluation, maybe you could try to resize them to a smaller size and create a new dataset.
Thanks for your response. Indeed, the issue is that one image cannot fit on a GPU during evaluation due to a very high resolution. What does work is using a high memory CPU node.
Thanks for your help!
Description
I train the model on a custom dataset, and the training finished without any issues (I needed to decrease the crop size as discussed in the FAQ).
However, when running the evaluation (mode=eval), the compute node runs out of memory. I'm using the same compute node and GPU as during training.
So, my main question is what is the difference between training and evaluation? Are the images not resized to the same sizes as during training, and the original resolution is used?
Here is the experiment config file:
The text was updated successfully, but these errors were encountered: