Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning res101 model pretrained models error #557

Open
jsuit opened this issue May 25, 2019 · 28 comments
Open

Finetuning res101 model pretrained models error #557

jsuit opened this issue May 25, 2019 · 28 comments

Comments

@jsuit
Copy link

jsuit commented May 25, 2019

So if you try and retrain the resnet101 (faster_rcnn_1_10_9771.pth) you get the following error:

   File "trainval_net.py", line 336, in <module>
      optimizer.step()
   File "... pytorch_0.4/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
      buf.mul_(momentum).add_(1 - dampening, d_p)
   RuntimeError: The expanded size of the tensor (512) must match the existing size (1024) at 
  non-singleton dimension 1

It's the optimizer version knowledge of the parameters is different than what it actually is. Also, you get the same error if you use, /home/jonathan/faster-rcnn.pytorch/faster_rcnn_1_6_9771.pth

@EMCP
Copy link
Contributor

EMCP commented May 26, 2019

this was fixed a while ago I believe.. I've successfully retrained res101 now

#515

@AlexanderHustinx
Copy link

Can you include your command-line argument and the full stack-trace?
Might be able to help you out if I can recreate the issue.

Also, what version of Python, PyTorch, CUDA, are you using?

@jsuit
Copy link
Author

jsuit commented May 27, 2019

pytorch 0.4
cuda = 9.0
python 3.6.6
faster_rcnn_1_10_9771.pth (gotten from link provided on README page)

Stack trace:

Preparing training data...
done
loading annotations into memory...
Done (t=2.96s)
creating index...
index created!
before filtering, there are 236574 images...
after filtering, there are 234532 images...
234532 roidb entries
Loading pretrained data/pretrained_model/resnet101_caffe.pth
loading checkpoint faster-rcnn.pytorch/faster_rcnn_1_10_9771.pth
loaded checkpoint faster-rcnn.pytorch/faster_rcnn_1_10_9771.pth
Traceback (most recent call last):
File "trainval_net.py", line 336, in
optimizer.step()
File python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (512) must match the existing size (1024) at non-singleton dimension 1

@AlexanderHustinx
Copy link

Could you also include the command-line argument?
i.e. what you enter in your terminal to run the code

For example:
python demo.py --dataset pascal --net res101 --checksession 1 --checkepoch 7 --checkpoint 10021 --cuda

@jsuit
Copy link
Author

jsuit commented May 28, 2019

   python trainval_net.py --r True --cuda --bs 1

I'm basically saying resume training for the load_name:

(basically, I just changed lines 275-276 in trainval_net.py to
load_name = my path to faster_rcnn_1_10_9771.pth)

Here's the only part I changed. Everything else is the same as the main branch.
if args.resume:
load_name = "path to faster_rcnn_1_10_9771.pth"
print("loading checkpoint %s" % (load_name))
checkpoint = torch.load(load_name)
args.session = checkpoint['session']
args.start_epoch = checkpoint['epoch']
fasterRCNN.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])

my default network is res101 and I'm training on coco2014.

@AlexanderHustinx
Copy link

Could you try to add the following tag to your command:
--dataset coco
I think you might be loading the pascal_voc dataset by default.

Additionally, you could try using all the command-line parameters:
python trainval_net.py --dataset coco --net res101 --r True --checksession 1 --checkepoch 10 --checkpoint 9771 --bs 1 --cuda
It is possible you might be overlooking something in your adjusted code.

@zychen2016
Copy link

have you solved this problem ? @jsuit

@zychen2016
Copy link

zychen2016 commented Nov 23, 2019

Hello, @AlexanderHustinx , I have the same problem

Using res101 pretrained model with trainval_net.py.

But using the same model with test_net.py and demo.py , there is no problem

Using pytorch-1.0 branch

@AlexanderHustinx
Copy link

Could you show the stack trace and error?
Maybe I can help spot the issue

@zychen2016
Copy link

zychen2016 commented Nov 24, 2019

Could you show the stack trace and error?
Maybe I can help spot the issue

Thanks. @AlexanderHustinx

Using CUDA_VISBLE_DEVICES=0 python3 trainval_net.py --r True --dataset pascal_voc --net vgg16 --bs 1 --nw 0 --lr 1e-3 --lr_decay_step 5 --cuda --checksession 1 --checkepoch 6 --checkpoint 10021 --use_tfb

Called with args:
Namespace(batch_size=1, checkepoch=6, checkpoint=10021, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='vgg16', num_workers=0, optimizer='sgd', resume=True, save_dir='models', session=1, start_epoch=1, use_tfboard=True)
/data2/CZY/data/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
yaml_cfg = edict(yaml.load(f))
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/data2/CZY/data/faster-rcnn.pytorch/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'vgg16',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 20,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/data2/CZY/data/faster-rcnn.pytorch',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 256,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 10,
'DOUBLE_BIAS': True,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.01,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0005},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /data2/CZY/data/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/pascal_voc/faster_rcnn_1_6_10021.pth
loaded checkpoint models/vgg16/pascal_voc/faster_rcnn_1_6_10021.pth
Traceback (most recent call last):
File "trainval_net.py", line 332, in
optimizer.step()
File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (4096) must match the size of tensor b (3) at non-singleton dimension 3

Using CUDA_VISBLE_DEVICES=0 python3 trainval_net.py --r True --dataset pascal_voc --net res101 --bs 1 --nw 0 --lr 1e-3 --lr_decay_step 5 --cuda --checksession 1 --checkepoch 7 --checkpoint 10021 --use_tfb

Called with args:
Namespace(batch_size=1, checkepoch=7, checkpoint=10021, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='res101', num_workers=0, optimizer='sgd', resume=True, save_dir='models', session=1, start_epoch=1, use_tfboard=True)
/data2/CZY/data/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
yaml_cfg = edict(yaml.load(f))
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/data2/CZY/data/faster-rcnn.pytorch/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'res101',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 20,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/data2/CZY/data/faster-rcnn.pytorch',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 128,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 20,
'DOUBLE_BIAS': False,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.001,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0001},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /data2/CZY/data/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/pascal_voc/faster_rcnn_1_7_10021.pth
loaded checkpoint models/res101/pascal_voc/faster_rcnn_1_7_10021.pth
Traceback (most recent call last):
File "trainval_net.py", line 332, in
optimizer.step()
File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (512) must match the size of tensor b (1024) at non-singleton dimension 1

  1. However , In test_net.py and demo.py , it's Ok.

@AlexanderHustinx
Copy link

Oef, okay this has been a while for me 😅

So you can successfully run the test_net.py and demo.py.
Can you run a training session from scratch? (Not resuming)

@zychen2016
Copy link

Oef, okay this has been a while for me

So you can successfully run the test_net.py and demo.py.
Can you run a training session from scratch? (Not resuming)

Thank you!
Yes, I already train from scratch.

@AlexanderHustinx
Copy link

Have you made any changes to the original resnet.py, faster_rcnn.py, or trainval_net.py?

@zychen2016
Copy link

zychen2016 commented Nov 25, 2019

Have you made any changes to the original resnet.py, faster_rcnn.py, or trainval_net.py?

No, Just download pascal_voc and res101_caffe and faster_1_6_10021.pth.

Using pytorch-1.0 branch

@AlexanderHustinx
Copy link

Can you try and train any model for 1 epoch and then continue training it afterwards?
Also, what OS are you running?
And which version of PyTorch? 1.0.0, 1.0.1, or higher?

@zychen2016
Copy link

zychen2016 commented Nov 25, 2019

Can you try and train any model for 1 epoch and then continue training it afterwards?
Also, what OS are you running?
And which version of PyTorch? 1.0.0, 1.0.1, or higher?

Using my own dataset train from scratch and resume from any epoch ,It's ok.
Maybe the config set is different from authors sets.
we just dowaload pascal_voc data like https://github.com/rbgirshick/py-faster-rcnn#beyond-the-demo-installation-for-training-and-testing-models

user@user:/data2/CZY/data/faster-rcnn.pytorch/data/VOCdevkit2007$ tree -d
.
├── annotations_cache
├── local
│   ├── VOC2006
│   └── VOC2007
├── results
│   ├── VOC2006
│   │   └── Main
│   └── VOC2007
│       ├── Layout
│       ├── Main
│       └── Segmentation
├── VOC2007
│   ├── Annotations
│   ├── ImageSets
│   │   ├── Layout
│   │   ├── Main
│   │   └── Segmentation
│   ├── JPEGImages
│   ├── SegmentationClass
│   └── SegmentationObject
└── VOCcode

Ubuntu 16.04
PyTorch 1.0.0

Here, I have another question:
After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object,
It will harm for the model performance?
for example :
https://imgaug.readthedocs.io/en/latest/_images/rotation.jpg

@AlexanderHustinx
Copy link

I assume it is indeed something with the configs, maybe it's related to the number of anchor scales and ratios you're using?

After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object,
It will harm for the model performance?

Regarding this question, you won't know for sure until you try. It is possible that your model will not mind if trained enough.
But, I would say that it can indeed slow down your training and even be harmful for performance. This is because you'll be training the R-CNN to learn parts of a feature map (after RoI Pooling) that contain no distinct features of the object you're trying to find.

@zychen2016
Copy link

zychen2016 commented Nov 26, 2019

I assume it is indeed something with the configs, maybe it's related to the number of anchor scales and ratios you're using?

After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object,
It will harm for the model performance?

Regarding this question, you won't know for sure until you try. It is possible that your model will not mind if trained enough.
But, I would say that it can indeed slow down your training and even be harmful for performance. This is because you'll be training the R-CNN to learn parts of a feature map (after RoI Pooling) that contain no distinct features of the object you're trying to find.

Thanks.
Using default configs for pascal_voc.
In my opinion, YOLOs need set different anchor scales and ratio for different datasets, faster rcnn also needed?

Could you have another way to do data augment?

@AlexanderHustinx
Copy link

In my opinion, YOLOs need set different anchor scales and ratio for different datasets, faster rcnn also needed?

You are correct. It depends on your dataset, e.g. if you have a dataset of only small objects you'll want smaller anchor scales; if there are only big objects using larger scales will suffice; when the objects range from small to large, using more scales likely benefits your performance.
Same goes for ratios, if you assume only long/narrow objects, i.e. objects where width > height, you might want to consider ratios that give e.g. w\h = [1, 1.5, 2], instead of the default [0.5, 1, 2]

Note that when changing the scales (for PASCAL VOC by default: [8, 16, 32]), these values aren't the final sizes in pixes. Instead they will still be multiplied by 16 as part of another hyperparameter in the code. So the final scales would be [128, 256, 512].

Could you have another way to do data augment?

This really depends on your dataset, the expected orientation of the objects etc. Faster RCNN is not rotation invariant, so if you want to find e.g. a quokka (as in your example) in cases where it is standing and it is laying, you need to train on both cases. e.g. by rotating the image (though just more data is obviously better)

An easy and almost always valid data augmentation for object detection is (horizontal) flipping.
You can also consider random cropping: leaving out parts of the image, sometimes also leaving out parts of the objects (in the PASCAL VOC dataset these objects would be labeled as 'truncated')

@zychen2016
Copy link

I happend another question,could you help me?
For the test result, I want to get coco metrics.
So I want to use --dataset coco parameters to train my own data, It must be that my image name must be the same format as COCO?
Or how to train my own datasets with coco format?

@AlexanderHustinx

@AlexanderHustinx
Copy link

You'll need to have a look at the lib/datasets/imdb.py and lib/datasets/coco.py to mimic the structure of your dataset (images and annotations), and you'll need to add your own dataset to lib/datasets/factory.py.

An example of adding a dataset that uses the PASCAL VOC eval metrics can be found here:
https://github.com/haleuh/face-faster-rcnn.pytorch/tree/master/lib/datasets
You can look at what they changed between PASCAL VOC and WIDER FACE to get it working. This should probably be enough to get it working for your own dataset too.

Alternatively, you could have a look at:
https://github.com/deboc/py-faster-rcnn/tree/master/help

@zychen2016
Copy link

zychen2016 commented Dec 5, 2019

You'll need to have a look at the lib/datasets/imdb.py and lib/datasets/coco.py to mimic the structure of your dataset (images and annotations), and you'll need to add your own dataset to lib/datasets/factory.py.

An example of adding a dataset that uses the PASCAL VOC eval metrics can be found here:
https://github.com/haleuh/face-faster-rcnn.pytorch/tree/master/lib/datasets
You can look at what they changed between PASCAL VOC and WIDER FACE to get it working. This should probably be enough to get it working for your own dataset too.

Alternatively, you could have a look at:
https://github.com/deboc/py-faster-rcnn/tree/master/help

Thank you very much for your help!
When run trainval_net.py with --use_tfb,There are only has train loss curve ,no val loss curve.
It is right?In my opinions,There should be val loss curve.

update:
I know , may be I should write some code in test_net.py to save losses and use tensorboard to visualizing. Right?

@AlexanderHustinx
Copy link

AlexanderHustinx commented Dec 5, 2019

You're right, there is no actual validation set used.
You can write a small if-statement and for-loop in the trainval_net.py to e.g. once every 500 epochs run a validation set.
To do so you must make sure that the validation set is not in the training set, and during the validation set you use with torch.no_grad(): in order to make sure you don't train on the validation set.

The test_net.py doesn't use tensorboard because during testing you don't calculate losses, instead you only 'infer' the boxes and classes. This is because technically you don't have the targets during test time, only during training and validation.

@zychen2016
Copy link

You're right, there is no actual validation set used.
You can write a small if-statement and for-loop in the trainval_net.py to e.g. once every 500 epochs run a validation set.
To do so you must make sure that the validation set is not in the training set, and during the validation set you use with torch.no_grad(): in order to make sure you don't train on the validation set.

The test_net.py doesn't use tensorboard because during testing you don't calculate losses, instead you only 'infer' the boxes and classes. This is because technically you don't have the targets during test time, only during training and validation.

Thank you! I will have a try.

@zychen2016
Copy link

zychen2016 commented Dec 6, 2019

I have a try with this code in trainval_net.py
@AlexanderHustinx

1.I want to run a validation every 500 steps. But all val data are needed to be run at a validation? Or just one batch size one validation?
2. Could you review my code? Thanks

   if args.dataset == "pascal_voc":
       args.imdb_name = "voc_2007_trainval"
       args.imdbval_name = "voc_2007_test"
       ###############################
       # validation dataset
       ###############################
+      args.imdbval_name_val="voc_2007_val"
       args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'MAX_NUM_GT_BOXES', '20']
   imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name)
+  imdb_val, roidb_val, ratio_list_val, ratio_index_val = combined_roidb(args.imdbval_name_val)
   train_size = len(roidb)
+  val_size=len(roidb_val)
   sampler_batch = sampler(train_size, args.batch_size)
+  sampler_batch_val = sampler(val_size, args.batch_size)
 
   dataset = roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \
                            imdb.num_classes, training=True)
+  dataset_val = roibatchLoader(roidb_val, ratio_list_val, ratio_index_val, args.batch_size, imdb_val.num_classes, training=False)
   dataloader = torch.utils.data.DataLoader(dataset,batch_size=args.batch_size,sampler=sampler_batch, num_workers=args.num_workers)
+  dataloader_val = torch.utils.data.DataLoader(dataset_val,batch_size=args.batch_size,sampler=sampler_batch_val, num_workers=args.num_workers)
     data_iter = iter(dataloader)
+    data_iter_val=iter(dataloader_val)
 
+      if step%500==0:
+            while True:
+                data_val=next(data_iter_val)
+                with torch.no_grad():
+                       im_val_data.resize_(data_val[0].size()).copy_(data_val[0])
+                       im_val_info.resize_(data_val[1].size()).copy_(data_val[1])
+                       gt_val_boxes.resize_(data_val[2].size()).copy_(data_val[2])
+                       num_val_boxes.resize_(data_val[3].size()).copy_(data_val[3])
+                       rois_val, cls_prob_val, bbox_pred_val, \
+                       rpn_loss_cls_val, rpn_loss_box_val, \
+                       RCNN_loss_cls_val, RCNN_loss_bbox_val, \
+                       rois_label_val = fasterRCNN(im_val_data, im_val_info, gt_val_boxes, num_val_boxes)
+                loss_rpn_cls_val = rpn_loss_cls_val.item()
+                loss_rpn_box_val = rpn_loss_box_val.item()
+                loss_rcnn_cls_val = RCNN_loss_cls_val.item()
+                loss_rcnn_box_val = RCNN_loss_bbox_val.item()
+                loss_val = rpn_loss_cls_val.mean() + rpn_loss_box_val.mean() \
+                 + RCNN_loss_cls_val.mean() + RCNN_loss_bbox_val.mean()
+                loss_temp_val += loss_val.item()
+            val_info = {
+                'loss_val': loss_temp_val,
+                'loss_rpn_cls_val': loss_rpn_cls_val,
+                'loss_rpn_box_val': loss_rpn_box_val,
+                'loss_rcnn_cls_val': loss_rcnn_cls_val,
+                'loss_rcnn_box_val': loss_rcnn_box_val
+            }
+            logger.add_scalar("logs_s_{}/val_losses".format(args.session), val_info, (epoch - 1) * iters_per_epoch + step)
 
     save_name = os.path.join(output_dir, 'faster_rcnn_{}_{}_{}.pth'.format(args.session, epoch, step))

errors:

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch  1][iter    0/1673] loss: 4.0778, lr: 1.00e-03
			fg/bg=(119/393), time cost: 2.804799
			rpn_cls: 0.6306, rpn_box: 0.0667, rcnn_cls: 2.7759, rcnn_box 0.6047
Traceback (most recent call last):
  File "trainval_net.py", line 377, in <module>
    data_val=next(data_iter_val)
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 800 and 1067 in dimension 2 at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333

@AlexanderHustinx
Copy link

  1. You don't need to use the entire valuation set when validating. I would very much recommend it though! That way you can more accurately track your model's progression over time.

I would run the validation using a batch size of 1 instead of your normal batch size, that more closely resembles the results you would get during test time.

  1. I can't really spot an error immediately, but as it's complaining about the size I assume that using a batch size of 1 during validation will likely solve the problem.

3?. You need to make sure that your training set doe not include your validation set, otherwise the added value of the validation set is negated.
I mention this because your code states to use voc_2007_trainval for training and voc_2007_val for validation. voc_2007_trainval includes the validation set, so be careful there.

@Saharkakavand
Copy link

I have a try with this code in trainval_net.py
@AlexanderHustinx

1.I want to run a validation every 500 steps. But all val data are needed to be run at a validation? Or just one batch size one validation?
2. Could you review my code? Thanks

   if args.dataset == "pascal_voc":
       args.imdb_name = "voc_2007_trainval"
       args.imdbval_name = "voc_2007_test"
       ###############################
       # validation dataset
       ###############################
+      args.imdbval_name_val="voc_2007_val"
       args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'MAX_NUM_GT_BOXES', '20']
   imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name)
+  imdb_val, roidb_val, ratio_list_val, ratio_index_val = combined_roidb(args.imdbval_name_val)
   train_size = len(roidb)
+  val_size=len(roidb_val)
   sampler_batch = sampler(train_size, args.batch_size)
+  sampler_batch_val = sampler(val_size, args.batch_size)
 
   dataset = roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \
                            imdb.num_classes, training=True)
+  dataset_val = roibatchLoader(roidb_val, ratio_list_val, ratio_index_val, args.batch_size, imdb_val.num_classes, training=False)
   dataloader = torch.utils.data.DataLoader(dataset,batch_size=args.batch_size,sampler=sampler_batch, num_workers=args.num_workers)
+  dataloader_val = torch.utils.data.DataLoader(dataset_val,batch_size=args.batch_size,sampler=sampler_batch_val, num_workers=args.num_workers)
     data_iter = iter(dataloader)
+    data_iter_val=iter(dataloader_val)
 
+      if step%500==0:
+            while True:
+                data_val=next(data_iter_val)
+                with torch.no_grad():
+                       im_val_data.resize_(data_val[0].size()).copy_(data_val[0])
+                       im_val_info.resize_(data_val[1].size()).copy_(data_val[1])
+                       gt_val_boxes.resize_(data_val[2].size()).copy_(data_val[2])
+                       num_val_boxes.resize_(data_val[3].size()).copy_(data_val[3])
+                       rois_val, cls_prob_val, bbox_pred_val, \
+                       rpn_loss_cls_val, rpn_loss_box_val, \
+                       RCNN_loss_cls_val, RCNN_loss_bbox_val, \
+                       rois_label_val = fasterRCNN(im_val_data, im_val_info, gt_val_boxes, num_val_boxes)
+                loss_rpn_cls_val = rpn_loss_cls_val.item()
+                loss_rpn_box_val = rpn_loss_box_val.item()
+                loss_rcnn_cls_val = RCNN_loss_cls_val.item()
+                loss_rcnn_box_val = RCNN_loss_bbox_val.item()
+                loss_val = rpn_loss_cls_val.mean() + rpn_loss_box_val.mean() \
+                 + RCNN_loss_cls_val.mean() + RCNN_loss_bbox_val.mean()
+                loss_temp_val += loss_val.item()
+            val_info = {
+                'loss_val': loss_temp_val,
+                'loss_rpn_cls_val': loss_rpn_cls_val,
+                'loss_rpn_box_val': loss_rpn_box_val,
+                'loss_rcnn_cls_val': loss_rcnn_cls_val,
+                'loss_rcnn_box_val': loss_rcnn_box_val
+            }
+            logger.add_scalar("logs_s_{}/val_losses".format(args.session), val_info, (epoch - 1) * iters_per_epoch + step)
 
     save_name = os.path.join(output_dir, 'faster_rcnn_{}_{}_{}.pth'.format(args.session, epoch, step))

errors:

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch  1][iter    0/1673] loss: 4.0778, lr: 1.00e-03
			fg/bg=(119/393), time cost: 2.804799
			rpn_cls: 0.6306, rpn_box: 0.0667, rcnn_cls: 2.7759, rcnn_box 0.6047
Traceback (most recent call last):
  File "trainval_net.py", line 377, in <module>
    data_val=next(data_iter_val)
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 800 and 1067 in dimension 2 at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333

Hi, did you solve the problem for validation?

@tyshiwo1
Copy link

tyshiwo1 commented Jul 14, 2020

I think you may try these codes in trainval_net.py. It seems to work :

if args.resume:
    load_name = os.path.join(output_dir,
      'faster_rcnn_{}_{}_{}.pth'.format(args.checksession, args.checkepoch, args.checkpoint))
    print("loading checkpoint %s" % (load_name))
    checkpoint = torch.load(load_name)
    #args.session = checkpoint['session']
    #args.start_epoch = checkpoint['epoch']
    #fasterRCNN.load_state_dict(checkpoint['model'])
    #optimizer.load_state_dict(checkpoint['optimizer'])
    #lr = optimizer.param_groups[0]['lr'] 
    for k, v in fasterRCNN.state_dict().items():
      if k in checkpoint['model']:
        param = torch.from_numpy(np.asarray(checkpoint['model'][k].cpu()))
        v.copy_(param)
      else:
        print('lose [in faster_cnn]: {}'.format(k))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants