[Evaluation] Consider "variance" over LLM output trials #204

jonathanlastmileai · 2023-11-13T15:20:56Z

@rben01 pointed out something very interesting: due to the stochasticity of LLM outputs, any given metric will in fact be an RV, i.e. have some nonzero variance, which is undesirable because it implies low precision in evaluating your LLM. A lower-variance estimator can be implemented by brute force, analogous to bootstrapping: run the LLM N times and measure the sample's "variance" using some technique simpler than the LLM itself.

This will apply when either the LLM API used does not guarantee reproducible outputs, or stochasticity is explicitly requested via temperature or some related inference parameter.

The text was updated successfully, but these errors were encountered:

jonathanlastmileai added the pri:mid label Nov 13, 2023

jonathanlastmileai mentioned this issue Dec 6, 2023

[rfc] batch execution lastmile-ai/aiconfig#408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Consider "variance" over LLM output trials #204

[Evaluation] Consider "variance" over LLM output trials #204

jonathanlastmileai commented Nov 13, 2023

[Evaluation] Consider "variance" over LLM output trials #204

[Evaluation] Consider "variance" over LLM output trials #204

Comments

jonathanlastmileai commented Nov 13, 2023