# Evaluation of LLMs Should Not Ignore Non-Determinism

```{note}
We aim to compare the performance of
LLMs under different decoding configurations. We
select greedy decoding and sampling generation
for the main comparison. For sampling, we set the
temperature to 1.0 and top-p to 1.0.
```

## Experimental Results

![](../images/non-determin1.png)

* For most evaluated tasks and models, greedy decoding
outperforms sampling. However, AlpacaEval
serves as a notable exception, where sampling
demonstrates superior performance.

* GSM8K and HumanEval are relatively
less stable with respect to non-deterministic
generations. The performance gap between the best
and worst samplings can exceed 10.0 points.

## How Various Factors Influence Non-Determinism?

### Scaling Effect on Non-Determinism

No pattern related to the number of model parameters could be identified.

### Alignment Effect on Non-Determinism

Alignment methods, such as DPO, enhance LLMs
by learning from preference data. We evaluate
the effects of alignment methods such as DPO, KTO, using Llama-3-8B-Instruct as
the training starting point.

![](../images/non-determin2.png)

After applying these methods,
both greedy decoding and sampling performances
are affected. In several tasks, including
AlpacaEval, MMLU, GSM8K, and HumanEval, a
decrease in standard deviation is observed, suggesting
that alignment may reduce the diversity of sampling
outputs.

### Temperature Effect on Non-Determinism

![](../images/non-determin3.png)

A high temperature
significantly impacts the reasoning and code generation
capabilities of LLMs and the model struggles
to solve questions in GSM8K and HumanEval.

### Surface Patterns in Non-Determinism Generation?

![](../images/non-determin4.png)

We observe that the completions
generated by greedy decoding are typically
marginally shorter than those produced via sampling generation.

## What is the Full Potential of Non-Determinism?

![](../images/non-determin5.png)

We adopt a Best-of-N setting, selecting the best answer
from N sampled responses. To accomplish this,
we employ off-the-shelf reward models, such as
ArmoRM and FsfairX, to rank the responses of Llama-3-8BInstruct,
selecting the one with the highest reward.
We also include an “oracle” baseline which directly
picks the best response as the upper bound of bestof-
N strategy.

Building upon these promising findings, there
are two ways to further enhance the performance
of smaller LLMs.

1. Probability calibration
techniques can guide LLMs towards generating
superior answers with higher likelihoods. Alignment
methods, specifically preference optimization play a pivotal role
in this process.

2. Strategies for ensemble
learning or selecting the best answer from
multiple completions.