AssertionError: 1382 sample gt_str != true_gt

Hi there!
Happy new year!

When running code completion token level (java), after fintuning and model ckpt is saved.
Getting error with evaluation and inference with the following error msg.

```01/05/2022 10:45:21 - INFO - __main__ -   300 are done!
01/05/2022 10:45:21 - INFO - __main__ -   3855034, 0.7707994274499265
01/05/2022 10:48:48 - INFO - __main__ -   400 are done!
01/05/2022 10:48:48 - INFO - __main__ -   5134001, 0.7672330020971948
Traceback (most recent call last):
  File "run_lm.py", line 715, in <module>
    main()
  File "run_lm.py", line 710, in main
    test_total, test_cr = eval_acc(args, model, tokenizer, 'test')
  File "run_lm.py", line 459, in eval_acc
    total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file)
  File "run_lm.py", line 478, in post_process
    assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt"
AssertionError: 1382 sample gt_str != true_gt
```


The finetuning cmd I used was
```
LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=microsoft/CodeGPT-small-java        # microsoft/CodeGPT-small-py for py150
LOGFILE=completion_javaCorpus.log
PER_NODE_GPU=1       # modify YOUR_GPU_NUM

CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=8e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=1 \
        --per_gpu_eval_batch_size=2 \
        --gradient_accumulation_steps=8 \
        --num_train_epochs=5 \
        --logging_steps=100 \
        --save_steps=1000 \
        --seed=42 \
        --overwrite_output_dir \
        --not_pretrain
```

the evaluation and inference cmd I used was
```
export CUDA_VISIBLE_DEVICES=2
LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=../save/javaCorpus/checkpoint-3000-3.3398       # directory of your saved model
LOGFILE=completion_javaCorpus_eval.log

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_eval \
        --per_gpu_eval_batch_size=16 \
        --logging_steps=100 \
        --seed=42 
```

Any comments or suggestions will be appreciated, thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AssertionError: 1382 sample gt_str != true_gt #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

AssertionError: 1382 sample gt_str != true_gt #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions