Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes #39

Closed
atonkamanda opened this issue Apr 30, 2021 · 2 comments

Comments

@atonkamanda
Copy link

Hello, I face exactly the same problem as here #25, I tried every fixes your proposed without success.

Let me clarify the situation, I did run the process data script, with only python which was supposed to make 17 batch txt file. I deleted 8 of them in order to free memory so only 9 of them are left but I don't think they are the reason for which the script get stuck. (Correct me if I'm wrong).

My torch and transofmers package are up to date.

When I run

python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 1 \
--per_gpu_eval_batch_size 1 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \
--data_dir ./data/codesearch/train_valid/python \
--output_dir ./models/python  \
--model_name_or_path microsoft/codebert-base

So with batch size one as u required I get the following

- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
04/30/2021 14:41:43 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='./data/codesearch/train_valid/python', dev_file='valid.txt', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_predict=False, do_train=True, eval_all_checkpoints=True, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_seq_length=200, max_steps=-1, model_name_or_path='microsoft/codebert-base', model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=8.0, output_dir='./models/python', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=1, pred_model_dir=None, save_steps=50, seed=42, server_ip='', server_port='', start_epoch=0, start_step=0, task_name='codesearch', test_file='shared_task_dev_top10_concat.tsv', test_result_dir='test_results.tsv', tokenizer_name='', train_file='train.txt', warmup_steps=0, weight_decay=0.0)
04/30/2021 14:41:43 - INFO - __main__ -   Loading features from cached file ./data/codesearch/train_valid/python/cached_train_train_codebert-base_200_codesearch
04/30/2021 14:43:07 - INFO - __main__ -   ***** Running training *****
04/30/2021 14:43:07 - INFO - __main__ -     Num examples = 824342
04/30/2021 14:43:07 - INFO - __main__ -     Num Epochs = 8
04/30/2021 14:43:07 - INFO - __main__ -     Instantaneous batch size per GPU = 1
04/30/2021 14:43:07 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 1
04/30/2021 14:43:07 - INFO - __main__ -     Gradient Accumulation steps = 1
04/30/2021 14:43:07 - INFO - __main__ -     Total optimization steps = 6594736
Epoch:   0%|                                              | 0/8 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  warnings.warn("To get the last learning rate computed by the scheduler, "
Epoch:   0%|                                              | 0/8 [05:51<?, ?it/s]
Traceback (most recent call last):
  File "run_classifier.py", line 596, in <module>
    main()
  File "run_classifier.py", line 544, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, optimizer)
  File "run_classifier.py", line 127, in train
    optimizer.step()
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/optimization.py", line 345, in step
    exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
KeyboardInterrupt
root@C.936370:~/CodeBERT/CodeBERT/codesearch$ 

It stay stuck on Epoch: 0%| | 0/8 [00:00<?, ?it/s] and output me the thing just above when I keyboard interrupt the whole thing.

The GPU is working while I do it (I don't know if it means it's training but not giving any feedback as I don't wait a long time before interrupting and the previous guy was suggesting that it doesn't), and here is the hardware configuration
image

The GPU is running at 66% while batch size = 1 and was 99% when batch size = 32.

I will try later on with better GPU and more memory to stock the 17 batch files but I would be glad if u could help me as nothing guarantee it will fix it.

Make me know if I can give u more information to help you figure it out I check the issue channel everyday.

(If the codesearch script is intrinsically buggy may I ask you to share a link to an already trained codesearch model so we can download it ? )

Thank you keep me updated.

@fengzhangyin
Copy link
Collaborator

fengzhangyin commented Apr 30, 2021

  1. After you did run the process data script, the 17 batch files are test set. And the number of files does not matter.
  2. Although there is no any feedback, I think it's training. You can sample some examples (such as 1000, now the number of training examples is 824342. ) from training file and validation file to speed up the detection of whether the code is correct. And you can try to print something during traing: https://github.com/microsoft/CodeBERT/blob/master/CodeBERT/codesearch/run_classifier.py#L98-L105
  3. You can change the transformers version to 2.5.0 (the version I used) and try it again.

@atonkamanda
Copy link
Author

Switching to transformers==2.5.0 didn't fix the problem.

I will try my luck with the CodeXGlue pipeline as it take less time to train anyway https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-WebQuery

Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants