Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirmation for some training details #7

Closed
strongwolf opened this issue Jun 16, 2020 · 9 comments
Closed

Confirmation for some training details #7

strongwolf opened this issue Jun 16, 2020 · 9 comments

Comments

@strongwolf
Copy link

Hi. I want to confirm some details in the second self-training stage. Are all the hyper-parameters (including the batch size, threshold for positive and negative, number of proposals in the RCNN head, etc. ) are the same for both supervised and unsupervised loss? Also, the unsupervised loss is imposed on both RPN and RCNN head? Thanks.

@zizhaozhang
Copy link
Collaborator

Yes, all hyper-paramters for data with human labels and data with pseodo labels (predicted offline) are treated the same.

It is simply by calling the forward function multiple times given pair of them https://github.com/google-research/ssl_detection/blob/master/detection/modeling/generalized_stac_rcnn.py#L181

@strongwolf
Copy link
Author

Thank you very much. I have two more questions.
First, in the first stage, the learning schedule for '1x' is still 12k, 16k 18k iterations? If so, I think the number is too large for the 1,2,5,10% dataset. There will be over 100 epochs for batch size 8.
Second, the second training stage is fine-tuned based on the first stage or trained based on Imagenet weights?

@zizhaozhang
Copy link
Collaborator

1: Yes your understanding is right, we kept the iterations for both 1st and 2nd stages. So for smaller number of labeled data, it is equal to have more epochs.

  1. We train from imagenet weights with unlabeled data (with pseudo labels) and labeled data.

@strongwolf
Copy link
Author

strongwolf commented Jul 1, 2020

I have one more question. The number of unlabeled images is much larger than the labeled ones. If you have one labeled image and one unlabeled image in a batch for each GPU, one problem may arise that when the model under-fit the unlabeled data, the model has over-fitted the labeled data. I don't know whether this is an issue. When I try to reproduce the method using another framework, I found the model (10% label + 90% unlabel) performance doesn't increase a lot when decaying the learning rate at the 9th epoch. In contrast, with 10% label + 20 % unlabel setting, the model's performance will increase when decaying lr and results in a higher mAP than (10% label + 90% unlabel) .

@zizhaozhang
Copy link
Collaborator

zizhaozhang commented Jul 2, 2020

Hi Thanks for the followup.

I am not quite sure what is the question here? underfitting or overfitting, generalization to your new framework? Or learning rate decay does not increase performance a lot? Would you mind elaborate more and seperate questions?

@strongwolf
Copy link
Author

Hi Thanks for the followup.

I am not quite sure what is the question here? underfitting or overfitting, generalization to your new framework? Or learning rate decay does not increase performance a lot? Would you mind elaborate more and seperate questions?

I have trained your code and everything is fine. But when I reproduce it using another framework, some problems confuse me. The learning schedule is based on the unlabeled data. I decay lr at the 8th epoch. For the case 10% label and 90% unlabel, since the unlabeled data is 9 times of labeled data, 8 epochs for unlabeled data means 72 epochs for labeled data. I think 72 epochs are too long for 10% labeled data and the model has overfitted the 10% labeled data after 72 epochs, which I guess is the reason why the performance doesn't increase when decaying the lr. For the case 10% label and 20% unlabel, 8 epochs for unlabeled data means 16 epochs for labeled data. The performance will increase when decaying lr because 16 epochs is acceptable.

I am not sure how important the ration between the label data and unlabeled data in a batch. In the classification task, many papers claim that the batch size of unlabeled data should be larger than labeled data.

@kihyuks
Copy link

kihyuks commented Jul 8, 2020

We decided to rely on training "steps" (e.g., 12k, 16k, 18k iterations) to determine training schedule instead of "epoch" in this work. In our experiments, we used the exact same number of training steps for 1, 2, 5, 10% labeled data settings. This might be suboptimal for certain settings, but we observe consistent performance improvement while preventing an additional effort for hyperparameter tuning.

We haven't tried to increase the size of unlabeled batch in this work due to a tight GPU memory budget, but it could be a good addition for possible performance boost.

@Chrisfsj2051
Copy link

Hi @kihyuks and @zizhaozhang , it's great to see such an interesting work with remarkable results.

However, I meet some difficults in understanding training configs. According to the code, only one image is processed in a single "step". However, it seems like in stage2, a "step" contains two images (one labeled and one unlabeled). In this case, the number of training samples is steps in stage1 and steps*2 in stage2.

I would be really appreciate if you can point out whether I've correctly understood it or not.

@zizhaozhang
Copy link
Collaborator

@Chrisfsj2051 You understanding is correct, although the steps are the same, stage2 (ssl setting) view images more times than stage1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants