MiniLLM : Results on test split between evaluate() in finetune.py and evaluate_main() in evaluate_main.py are different

Hello,

I have been trying to reproduce the SFT and KD experiments by fine-tuning OPT and GPT models and then evaluating them. However, I noticed that the evaluation results are different depending on the evaluation method used.

First, I used the evaluate() function in finetune.py (as you provided in line 551). Then, I used the standalone evaluation script provided in evaluate_main.py. Unfortunately, the obtained results are not consistent.

I compared both implementations to identify the source of the discrepancy. The main difference I found is that finetune.py uses the LMDataset class for all splits (train, dev, and test) around lines 147–151, while evaluate_main.py uses PromptDataset (line 29).

Could you please clarify the inconsistency in the evaluation results? I would also appreciate any guidance on how to properly align both evaluation procedures.

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MiniLLM : Results on test split between evaluate() in finetune.py and evaluate_main() in evaluate_main.py are different #418

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MiniLLM : Results on test split between evaluate() in finetune.py and evaluate_main() in evaluate_main.py are different #418

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions