Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes #39

atonkamanda · 2021-04-30T15:12:45Z

Hello, I face exactly the same problem as here #25, I tried every fixes your proposed without success.

Let me clarify the situation, I did run the process data script, with only python which was supposed to make 17 batch txt file. I deleted 8 of them in order to free memory so only 9 of them are left but I don't think they are the reason for which the script get stuck. (Correct me if I'm wrong).

My torch and transofmers package are up to date.

When I run

python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 1 \
--per_gpu_eval_batch_size 1 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \
--data_dir ./data/codesearch/train_valid/python \
--output_dir ./models/python  \
--model_name_or_path microsoft/codebert-base

So with batch size one as u required I get the following

- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
04/30/2021 14:41:43 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='./data/codesearch/train_valid/python', dev_file='valid.txt', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_predict=False, do_train=True, eval_all_checkpoints=True, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_seq_length=200, max_steps=-1, model_name_or_path='microsoft/codebert-base', model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=8.0, output_dir='./models/python', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=1, pred_model_dir=None, save_steps=50, seed=42, server_ip='', server_port='', start_epoch=0, start_step=0, task_name='codesearch', test_file='shared_task_dev_top10_concat.tsv', test_result_dir='test_results.tsv', tokenizer_name='', train_file='train.txt', warmup_steps=0, weight_decay=0.0)
04/30/2021 14:41:43 - INFO - __main__ -   Loading features from cached file ./data/codesearch/train_valid/python/cached_train_train_codebert-base_200_codesearch
04/30/2021 14:43:07 - INFO - __main__ -   ***** Running training *****
04/30/2021 14:43:07 - INFO - __main__ -     Num examples = 824342
04/30/2021 14:43:07 - INFO - __main__ -     Num Epochs = 8
04/30/2021 14:43:07 - INFO - __main__ -     Instantaneous batch size per GPU = 1
04/30/2021 14:43:07 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 1
04/30/2021 14:43:07 - INFO - __main__ -     Gradient Accumulation steps = 1
04/30/2021 14:43:07 - INFO - __main__ -     Total optimization steps = 6594736
Epoch:   0%|                                              | 0/8 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  warnings.warn("To get the last learning rate computed by the scheduler, "
Epoch:   0%|                                              | 0/8 [05:51<?, ?it/s]
Traceback (most recent call last):
  File "run_classifier.py", line 596, in <module>
    main()
  File "run_classifier.py", line 544, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, optimizer)
  File "run_classifier.py", line 127, in train
    optimizer.step()
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/optimization.py", line 345, in step
    exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
KeyboardInterrupt
root@C.936370:~/CodeBERT/CodeBERT/codesearch$

It stay stuck on Epoch: 0%| | 0/8 [00:00<?, ?it/s] and output me the thing just above when I keyboard interrupt the whole thing.

The GPU is working while I do it (I don't know if it means it's training but not giving any feedback as I don't wait a long time before interrupting and the previous guy was suggesting that it doesn't), and here is the hardware configuration

The GPU is running at 66% while batch size = 1 and was 99% when batch size = 32.

I will try later on with better GPU and more memory to stock the 17 batch files but I would be glad if u could help me as nothing guarantee it will fix it.

Make me know if I can give u more information to help you figure it out I check the issue channel everyday.

(If the codesearch script is intrinsically buggy may I ask you to share a link to an already trained codesearch model so we can download it ? )

Thank you keep me updated.

The text was updated successfully, but these errors were encountered:

fengzhangyin · 2021-04-30T15:55:51Z

After you did run the process data script, the 17 batch files are test set. And the number of files does not matter.
Although there is no any feedback, I think it's training. You can sample some examples (such as 1000, now the number of training examples is 824342. ) from training file and validation file to speed up the detection of whether the code is correct. And you can try to print something during traing: https://github.com/microsoft/CodeBERT/blob/master/CodeBERT/codesearch/run_classifier.py#L98-L105
You can change the transformers version to 2.5.0 (the version I used) and try it again.

atonkamanda · 2021-05-01T04:18:00Z

Switching to transformers==2.5.0 didn't fix the problem.

I will try my luck with the CodeXGlue pipeline as it take less time to train anyway https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-WebQuery

Thank you very much for your help.

atonkamanda mentioned this issue Apr 30, 2021

Error running inference with code search #37

Closed

atonkamanda closed this as completed May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes #39

Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes #39

atonkamanda commented Apr 30, 2021

fengzhangyin commented Apr 30, 2021 •

edited

atonkamanda commented May 1, 2021

Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes #39

Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes #39

Comments

atonkamanda commented Apr 30, 2021

fengzhangyin commented Apr 30, 2021 • edited

atonkamanda commented May 1, 2021

fengzhangyin commented Apr 30, 2021 •

edited