Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low accuracy re-train with the provided config files of the pretrained models #8

Closed
JackHenry1992 opened this issue Jan 6, 2020 · 12 comments

Comments

@JackHenry1992
Copy link

JackHenry1992 commented Jan 6, 2020

Thank you for sharing your perfect job.
I use the pretrained_models/regat_implicit/ban_1_implicit_vqa_196/hps.json for the training phase with 2 GPU, each 10GB. The CUDA version is 10.0 and python is 3.6. All datasets are downloaded.
The trained dataset is VQA 2.0 dataset .

Here is my detail config:

     epochs: 20  
     base_lr: 0.001  
     lr_decay_start: 15  
     lr_decay_rate: 0.25  
     lr_decay_step: 2  
     lr_decay_based_on_val: false  
     grad_accu_steps: 1  
     grad_clip: 0.25  
     weight_decay: 0  
     batch_size: 128  
     output: "saved_models/regat_implicit/ban_1_implicit_vqa_196"  
     save_optim: false  
     log_interval: -1  
     seed: 196  
     checkpoint: ""  
     dataset: "vqa"  
     data_folder: "./data"  
     use_both: false  
     use_vg: false  
     adaptive: true  
     relation_type: "implicit"  
     fusion: "ban"  
     tfidf: true  
     op: "c"  
     num_hid: 1024  
     ban_gamma: 1  
     mutan_gamma: 2  
     imp_pos_emb_dim: 64  
     spa_label_num: 11  
     sem_label_num: 15  
     dir_num: 2  
     relation_dim: 1024  
     nongt_dim: 20  
     num_heads: 16  
     num_steps: 1  
     residual_connection: true  
     label_bias: false  
     lr_decrease_start: 15

The results are very poor. After 20 epochs, the log.txt shows:

epoch 15, time: 746.65  
	train_loss: 4.19, norm: 3.2776, score: 44.76  
	eval score: 43.99 (92.66)  
	entropy:  0.00  
saving current model weights to folder  
lr: 0.0005  
epoch 16, time: 726.26  
	train_loss: 4.16, norm: 4.1697, score: 45.21  
	eval score: 44.36 (92.66)  
	entropy:  0.01  
saving current model weights to folder  
decreased lr: 0.0001  
epoch 17, time: 750.88  
	train_loss: 4.12, norm: 2.4974, score: 45.56  
	eval score: 44.41 (92.66)  
	entropy:  0.01  
saving current model weights to folder  
lr: 0.0001  
epoch 18, time: 745.58  
	train_loss: 4.16, norm: 4.6050, score: 45.61  
	eval score: 44.47 (92.66)  
	entropy:  0.01  
saving current model weights to folder  
decreased lr: 0.0000  
epoch 19, time: 743.70  
	train_loss: 4.10, norm: 3.3921, score: 45.80  
	eval score: 44.46 (92.66)  
	entropy:  0.01

I also train the dataset using pretrained_models/regat_implicit/butd_implicit_vqa_6371, and it reach 58

Can you give me some advice about reproducing the accuracy score of the paper?

@linjieli222
Copy link
Owner

Hi, thank you for your interests. Were you able to evaluate the pretrained model? If so, could you share the evaluated results?

Thanks.

@JackHenry1992
Copy link
Author

JackHenry1992 commented Jan 6, 2020

I have already evaluated the pretrained models provided by your project, and it's the same as your paper.

Evalutation VQA-ReGAT
Found 2 GPU cards for eval
loading dictionary from ./data/glove/dictionary.pkl
Evaluating on vqa dataset with model trained on vqa dataset
loading features from h5 file ./data/Bottom-up-features-adaptive/val.hdf5
Setting semantic adj matrix to None...
Setting spatial adj matrix to None...
Building ReGAT model with implicit relation and ban fusion method
In ImplicitRelationEncoder, num of graph propogate steps: 1, residual_connection: True
Loading weights from pretrained_models/regat_implicit/ban_1_implicit_vqa_196/model.pth
        Unexpected_keys: []
        Missing_keys: []
100%|
eval score: 65.96

But training with the same hps.json provided by the pretrained models, the result is too poor..
The training code is the original main.py. I just run python3 main.py --config config/xxx.json
And I have followed all your steps.

@linjieli222
Copy link
Owner

linjieli222 commented Jan 6, 2020

The pretrained models are trained with 4 GPUs (each of 16GB). Therefore, the effective batch size is 64x4 = 256. Since you are using 2GPUs, the effective batch size for you is 64x2 = 128. My assumption is that the learning rate may be too big for your batch size.
But the accuracy should not be as low. From the log you shown above, even the training accuracy is not improving for the last 6 epochs. I would suspect that there is something wrong with training data. Could you try evaluate the pretrained model on training dataset and share your results?
Thanks!

@linjieli222
Copy link
Owner

FYI, if you run into similar errors as follows during evaluating, please pull the repo again.

 File "./model/graph_att_layer.py", line 87, in forward
    self.dim_group[1])
RuntimeError: shape '[64, 20, 16, 64]' is invalid for input of size 1245184

Sorry about the inconvenience.

@JackHenry1992
Copy link
Author

JackHenry1992 commented Jan 7, 2020

I sincerely appreciate your reply.
The model is trained with batchsize=128*2 (2GPUs, each with 128 batches).
By modifying the args.split='Train', I re-run the evaluation CMD python3 eval.py --output_folder pretrained_models/regat_implicit/ban_1_implicit_vqa_196/, and the score is shown as follows:

2020-01-07-11-21-19
Evaluation VQA-ReGAT
Found 2 GPU cards for eval
loading dictionary from ./data/glove/dictionary.pkl
Evaluating on vqa dataset with model trained on vqa dataset
loading features from h5 file ./data/Bottom-up-features-adaptive/train.hdf5
Setting semantic adj matrix to None...
Setting spatial adj matrix to None...
Building ReGAT model with implicit relation and ban fusion method
In ImplicitRelationEncoder, num of graph propogate steps: 1, residual_connection: True
Loading weights from pretrained_models/regat_implicit/ban_1_implicit_vqa_196/model.pth
        Unexpected_keys: []
        Missing_keys: []
100%|████████████████████████████████| 3467/3467 [05:55<00:00,  9.72it/s
eval score: 83.84

No errors encountered at the evaluation phase. But during training, one error (just epoch 0 will occur) is shown as:

nParams=        46455506
optim: adamax lr=0.0010, decay_step=2, decay_rate=0.25,grad_clip=0.25
LR decay epochs: 15,17,19
  0%|                                                                                              | 0/1734 [00:00<?, ?it/s]gradual warmup lr: 0.0005
100%|████████████████████████████████████| 1734/1734 [11:29<00:00,  1.92it/s]
epoch 0, time: 825.39████████████████████████████████████| 838/838 [02:13<00:00,  5.89it/s]
        train_loss: 100227.52, norm: 12704909.1170, score: 24.98
        eval score: 28.65 (92.66)
        entropy:  0.02
saving current model weights to folder

gradual warmup lr: 0.0010
  0%|                                                                                              | 0/1734 [00:00<?, ?it/s]
Exception in thread Thread-1:                                                            | 2/1734 [00:04<1:22:20,  2.85s/it]
Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/admin/miniconda3/lib/python3.6/site-packages/tqdm/_monitor.py", line 62, in run
    for instance in self.tqdm_cls._instances:
  File "/home/admin/miniconda3/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

100%|███████████████████████████████████| 1734/1734 [10:37<00:00,  1.94it/s]
epoch 1, time: 774.55███████████████████████████████████| 838/838 [02:16<00:00,  6.16it/s]
        train_loss: 22.79, norm: 4514.3939, score: 31.47
        eval score: 33.77 (92.66)
        entropy:  0.04
saving current model weights to folder

Then the model will continue to train as normal. I am not sure whether it affects performance?

@linjieli222
Copy link
Owner

Thank you for your patience.

I have never seen this error before. It seems to happen inside tqdm package, I don't think it would affect the model performance.

One thing I do notice is that the train_loss and norm are extremely large at epoch 0. Usually they are around 10, not 100, 000. Can you share the exact config file and cmd you used for training? I will try to run it at my end to replicate the error.

Thanks!

@JackHenry1992
Copy link
Author

Here is my detail config (Cause *.json can not upload, I have changed the filename to hps.txt):

hps.txt

Did you use some pretrained models as the initial parameters of network?

@xuewyang
Copy link

I am running the same codes. But I get better results than yours. I run 20 epochs. But I think more epochs should be run.
epoch 15, time: 997.29
train_loss: 2.95, norm: 1.6762, score: 66.00
eval score: 58.68 (92.66)
entropy: 4.76
saving current model weights to folder
lr: 0.0005
epoch 16, time: 996.12
train_loss: 2.91, norm: 1.9155, score: 66.71
eval score: 58.87 (92.66)
entropy: 4.74
saving current model weights to folder
decreased lr: 0.0001
epoch 17, time: 999.82
train_loss: 2.86, norm: 1.8121, score: 67.62
eval score: 58.88 (92.66)
entropy: 4.74
saving current model weights to folder
lr: 0.0001
epoch 18, time: 994.42
train_loss: 2.85, norm: 1.7187, score: 67.76
eval score: 58.86 (92.66)
entropy: 4.74
saving current model weights to folder
decreased lr: 0.0000
epoch 19, time: 1010.46
train_loss: 2.84, norm: 1.7219, score: 68.01
eval score: 58.84 (92.66)
entropy: 4.74
saving current model weights to folder

@xuewyang
Copy link

Probably lr should be adjusted.

@linjieli222
Copy link
Owner

Closed due to inactivity. The aforementioned error is not reproducible on my end.

@alice-cool
Copy link

I am running the same codes. But I get better results than yours. I run 20 epochs. But I think more epochs should be run.
epoch 15, time: 997.29
train_loss: 2.95, norm: 1.6762, score: 66.00
eval score: 58.68 (92.66)
entropy: 4.76
saving current model weights to folder
lr: 0.0005
epoch 16, time: 996.12
train_loss: 2.91, norm: 1.9155, score: 66.71
eval score: 58.87 (92.66)
entropy: 4.74
saving current model weights to folder
decreased lr: 0.0001
epoch 17, time: 999.82
train_loss: 2.86, norm: 1.8121, score: 67.62
eval score: 58.88 (92.66)
entropy: 4.74
saving current model weights to folder
lr: 0.0001
epoch 18, time: 994.42
train_loss: 2.85, norm: 1.7187, score: 67.76
eval score: 58.86 (92.66)
entropy: 4.74
saving current model weights to folder
decreased lr: 0.0000
epoch 19, time: 1010.46
train_loss: 2.84, norm: 1.7219, score: 68.01
eval score: 58.84 (92.66)
entropy: 4.74
saving current model weights to folder

How to encode the relation type into the explicit encoder. I found the code didn't represent it. If you have answer, please help me. I am sorry to bother you. I guess the label of relation type is from datasets and the code didn't include the auxiliary classifier for the 15 semantic type and 11 geo type.

@linjieli222
Copy link
Owner

To replicate results from our paper, please follow the instructions to download the exact data.

For spatial adj matrix, please refer to #9 .
For semantic adj matrix, we are not releasing the model at the moment. But it is a very small and simple classification model trained on Visual Genome, you can refer to this paper:
Ting_Yao_Exploring_Visual_Relationship_ECCV_2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants