Skip to content

AssertionError: 1382 sample gt_str != true_gt #98

@changranelk

Description

@changranelk

Hi there!
Happy new year!

When running code completion token level (java), after fintuning and model ckpt is saved.
Getting error with evaluation and inference with the following error msg.

01/05/2022 10:45:21 - INFO - __main__ -   3855034, 0.7707994274499265
01/05/2022 10:48:48 - INFO - __main__ -   400 are done!
01/05/2022 10:48:48 - INFO - __main__ -   5134001, 0.7672330020971948
Traceback (most recent call last):
  File "run_lm.py", line 715, in <module>
    main()
  File "run_lm.py", line 710, in main
    test_total, test_cr = eval_acc(args, model, tokenizer, 'test')
  File "run_lm.py", line 459, in eval_acc
    total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file)
  File "run_lm.py", line 478, in post_process
    assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt"
AssertionError: 1382 sample gt_str != true_gt

The finetuning cmd I used was

LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=microsoft/CodeGPT-small-java        # microsoft/CodeGPT-small-py for py150
LOGFILE=completion_javaCorpus.log
PER_NODE_GPU=1       # modify YOUR_GPU_NUM

CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=8e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=1 \
        --per_gpu_eval_batch_size=2 \
        --gradient_accumulation_steps=8 \
        --num_train_epochs=5 \
        --logging_steps=100 \
        --save_steps=1000 \
        --seed=42 \
        --overwrite_output_dir \
        --not_pretrain

the evaluation and inference cmd I used was

export CUDA_VISIBLE_DEVICES=2
LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=../save/javaCorpus/checkpoint-3000-3.3398       # directory of your saved model
LOGFILE=completion_javaCorpus_eval.log

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_eval \
        --per_gpu_eval_batch_size=16 \
        --logging_steps=100 \
        --seed=42 

Any comments or suggestions will be appreciated, thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions