-
Couldn't load subscription status.
- Fork 387
Closed
Description
Hi there!
Happy new year!
When running code completion token level (java), after fintuning and model ckpt is saved.
Getting error with evaluation and inference with the following error msg.
01/05/2022 10:45:21 - INFO - __main__ - 3855034, 0.7707994274499265
01/05/2022 10:48:48 - INFO - __main__ - 400 are done!
01/05/2022 10:48:48 - INFO - __main__ - 5134001, 0.7672330020971948
Traceback (most recent call last):
File "run_lm.py", line 715, in <module>
main()
File "run_lm.py", line 710, in main
test_total, test_cr = eval_acc(args, model, tokenizer, 'test')
File "run_lm.py", line 459, in eval_acc
total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file)
File "run_lm.py", line 478, in post_process
assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt"
AssertionError: 1382 sample gt_str != true_gt
The finetuning cmd I used was
LANG=java # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=microsoft/CodeGPT-small-java # microsoft/CodeGPT-small-py for py150
LOGFILE=completion_javaCorpus.log
PER_NODE_GPU=1 # modify YOUR_GPU_NUM
CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
--data_dir=$DATADIR \
--lit_file=$LITFILE \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_train \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=8e-5 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=1 \
--per_gpu_eval_batch_size=2 \
--gradient_accumulation_steps=8 \
--num_train_epochs=5 \
--logging_steps=100 \
--save_steps=1000 \
--seed=42 \
--overwrite_output_dir \
--not_pretrain
the evaluation and inference cmd I used was
export CUDA_VISIBLE_DEVICES=2
LANG=java # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=../save/javaCorpus/checkpoint-3000-3.3398 # directory of your saved model
LOGFILE=completion_javaCorpus_eval.log
python -u run_lm.py \
--data_dir=$DATADIR \
--lit_file=$LITFILE \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_eval \
--per_gpu_eval_batch_size=16 \
--logging_steps=100 \
--seed=42
Any comments or suggestions will be appreciated, thanks in advance!
Metadata
Metadata
Assignees
Labels
No labels