Hello,
I have been trying to reproduce the SFT and KD experiments by fine-tuning OPT and GPT models and then evaluating them. However, I noticed that the evaluation results are different depending on the evaluation method used.
First, I used the evaluate() function in finetune.py (as you provided in line 551). Then, I used the standalone evaluation script provided in evaluate_main.py. Unfortunately, the obtained results are not consistent.
I compared both implementations to identify the source of the discrepancy. The main difference I found is that finetune.py uses the LMDataset class for all splits (train, dev, and test) around lines 147–151, while evaluate_main.py uses PromptDataset (line 29).
Could you please clarify the inconsistency in the evaluation results? I would also appreciate any guidance on how to properly align both evaluation procedures.
Thank you
Hello,
I have been trying to reproduce the SFT and KD experiments by fine-tuning OPT and GPT models and then evaluating them. However, I noticed that the evaluation results are different depending on the evaluation method used.
First, I used the evaluate() function in finetune.py (as you provided in line 551). Then, I used the standalone evaluation script provided in evaluate_main.py. Unfortunately, the obtained results are not consistent.
I compared both implementations to identify the source of the discrepancy. The main difference I found is that finetune.py uses the LMDataset class for all splits (train, dev, and test) around lines 147–151, while evaluate_main.py uses PromptDataset (line 29).
Could you please clarify the inconsistency in the evaluation results? I would also appreciate any guidance on how to properly align both evaluation procedures.
Thank you