Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Task] Add AlpacaEval LC #139

Open
YannDubs opened this issue Apr 2, 2024 · 9 comments
Open

[New Task] Add AlpacaEval LC #139

YannDubs opened this issue Apr 2, 2024 · 9 comments
Labels
feature request New feature/request new task

Comments

@YannDubs
Copy link

YannDubs commented Apr 2, 2024

Great library, a light library for all the main evals was really needed!💯

I just came across this line: is there any interest of adding length-controlled AlpacaEval to lighteval? If so I'm happy to help, e.g., if you want a minimal script that doesn't depend on alpaca_eval.

Let me know

@clefourrier
Copy link
Member

Hi!
This would be amazing, thanks for the suggestion!

Ideal would be to add it as a community task for now, and once we have non regression tests on the results, we'll move it to extended tasks.

However, since it's using LLM as a judge, we would want to first move the LLM-as-a-judge code that @NathanHB developed for MTBench to the metrics, and allow to select several judges. (We will want this to be homogeneous for easier debugging).

If you are interested in this, you can start by it, else you can wait for us to add it, it should be integrated soon.

@clefourrier clefourrier added feature request New feature/request new task labels Apr 2, 2024
@YannDubs
Copy link
Author

YannDubs commented Apr 2, 2024

I saw the PR, it looks great and homogeneity definitely makes sense.

Adding AlpacaEval might require a few changes for homogenization though.

The pipeline for AlpacaEval at a high level is:

  1. For each instruction, decode the model outputs and add the reference outputs.
  2. Randomize the order of the model and the reference. One becomes M and the other m but the mapping is random. This is important given that LLM judges typically prefer the last output.
  3. OpenAI's GPT4 Preview judges its preference by asking a single token (M or m) with logprobs. Outputing only a single token decreases the eval time, the cost, and simplifies logprob decoding. Using logprobs improves statistical efficiency and alleviates decoding issues.
  4. Extract the raw preference by taking the logprob of the evaluated model (say M) normalized by the probability of M and m.
  5. Control the length bias of the preference by fitting a simple GLM on all the preferences from that model. This takes seconds even on a single CPU.
  6. Average all the length-controlled preferences over the AlpacaEval set to get the final LC win rate.

I only had a quick skim through the MTBench PR, but my understanding is that steps 2, 3 and 4 would all require slight changes to JudgeOpenAI. Step 5 would require a more significant change as it requires processing all the preferences together. I'm not sure where you'd want that step.

I'm curious to hear your thoughts!

I won't have the time to do such homoneigenzation, and in any case I guess you'd prefer choosing the right abstraction yourselves! But I'm happy to help if there's interest in supporting AlpacaEval, e.g., by writing some minimal implementation.

@clefourrier
Copy link
Member

clefourrier commented Apr 4, 2024

Thanks for detailing these steps!
We'll edit the LLM as a judge metric on our side, and come back to you once we're good, we'd love to support AlpacaEval with your help :)

@NathanHB
Copy link
Member

NathanHB commented Apr 4, 2024

Hi ! Thanks for your interest in lighteval !
It seems like integrating alpaca_eval would require a custom function in the JudgeOpenAi class, as it is not as simple as calling the judge and extracting an answer.
I opened a PR to move the extended code to the metrics.

I only had a quick skim through the MTBench PR, but my understanding is that steps 2, 3 and 4 would all require slight changes to JudgeOpenAI. Step 5 would require a more significant change as it requires processing all the preferences together. I'm not sure where you'd want that step.

Step 5 should be easy to add. We have a system that allows to plug functions acting on the whole corpus instead of the individual samples.

For example,

    mt_bench_metric = SampleLevelMetricGrouping(
        metric=["single_turn", "multi_turn"],
        higher_is_better=True,
        category=MetricCategory.GENERATIVE_MULTI_TURN,
        use_case=MetricUseCase.SUMMARIZATION,
        sample_level_fn=LlmAsJudge(
            judge_model_name="gpt-3.5-turbo", template_path="src/lighteval/tasks/extended/mt_bench/judge_prompts.jsonl"
        ).compute_multi_turn,
        corpus_level_fn={
            "single_turn": np.mean,
            "multi_turn": np.mean,
        },
    )

Here, the sample is evaluated by the judge and the whole corpus is evaluated using the mean of all samples. We could replace np.mean by a function doing the step 5. :)

That would make a metric for Alpaca look like:

    alpaca_metric = SampleLevelMetric(
        metric="lc_alpaca",
        higher_is_better=True,
        category=MetricCategory.GENERATIVE,
        use_case=MetricUseCase.SUMMARIZATION,
        sample_level_fn=LlmAsJudge(
            judge_model_name="gpt-4", template_path="path/to/alpaca_judge_template.jsonl"
        ).compute_alpaca,
        corpus_level_fn=length_controlled_mean,
    )

@YannDubs
Copy link
Author

YannDubs commented Apr 8, 2024

Great, to know that there's a place for a corpus level function, I can write a minimal length_controlled_mean when the times come. Let me know if you have questions for the rest!

@clefourrier
Copy link
Member

clefourrier commented Apr 17, 2024

Hi @YannDubs !
We have now extended and merged the model-as-a-judge metrics, do you think they would work for you in their current state?

@YannDubs
Copy link
Author

Hey @clefourrier!

So the current JudgeOpenAI still seems pretty specialized to MT-bench. E.g. it makes a few assumptions that will not be true for AlpacaEval and more generally for other LLMJudge benchmark. E.g.:

  1. regular expression for __process_judge_response
  2. working only with the text of the output
  3. "single-math-v1-multi-turn" for the reference prompt

1 and 2 is what we did in AlpacaEval but switched to logprobs of tokens as it's cheaper and gives more statistical efficiency.

Do you want different classes (say an MTBenchJudge class and an AlpacaEvalJudge class) or different parameters in the main Judge class? I can implement something minimal next weekend. But it will probably be easier if you end up writing the final abstraction that you would like to keep!

@clefourrier
Copy link
Member

Tagging @NathanHB since he worked on it the most, but imo it would be great to have the option to pass different parameters in the main Judge class, and we'll load it with different metric definitions like the above example for mt_bench_metric vs alpaca_metric.

@NathanHB
Copy link
Member

hi @YannDubs ! having multiple parameters passed to the judge would be our preferred way. for example using a parameter to switch from using logprobs and text.
as for the point 3 it is simply a matter of changing the llm_judge prompt

don't hesitate to tell us if you have more questions !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request new task
Projects
None yet
Development

No branches or pull requests

3 participants