MAUVE can vary greatly when computed with different K-Means random seeds #11

nostalgebraist · 2022-08-02T20:26:37Z

While using MAUVE in a real use case, I decided to compute MAUVE multiple times per comparison with different K-Means random seeds.

I've noticed that the value of the MAUVE metric varies a lot across these K-Means seeds.

In particular, MAUVE varies about as much across K-Means seeds as it does across GPT sampling seeds. Typical values for std. dev. across 5 seeds are ~0.005 to ~0.01, for either type of seed (while holding the other constant).

This is also comparable in size to the MAUVE differences reported in some model/sampler comparisons, e.g. between nucleus GPT-2-large and nucleus GPT-2-xl in Table 6 of the original paper.

Do you have recommendations about what to do about this variability?

Am I doing something wrong?
Is this less of an issue with the DRMM or SPV algorithms? I haven't tried them.
I have an (untested) hypothesis that MAUVE would be less variable if fewer clusters were used.
- The rule k = n/10 gives us an average of 10 members per cluster for each of p and q, with many clusters having fewer than this. The small counts mean there is is high uncertainty in the individual terms of the KL-divergence sum.
- By the same token, we are averaging over a large number of bins, and one might hope the errors would wash out in the average, but perhaps they don't as much as we would hope.
- Using fewer clusters would tend to push MAUVE estimates close to one another (Fig. 8 in the original paper), but maybe we could compensate for this by using a higher scaling constant (Fig. 5). What do you think about this idea?

Colab notebook with an example: https://colab.research.google.com/drive/1wh38JRSr5vkOqlWUxNkP4tUgkFwZwAD0?usp=sharing

The text was updated successfully, but these errors were encountered:

krishnap25 · 2022-08-06T15:43:36Z

Hi @nostalgebraist,

The settings in your notebook look reasonable to me.

In general, MAUVE is meant to compare two or more settings. In contrast, the absolute value of MAUVE is not very meaningful (and dependent on a number of hyperparameters, as you pointed out). The key question is: how large is the standard deviation in comparison to the gaps between the models you wish to test?

Some factors to keep in mind:

What settings are you comparing? Is there enough signal to be gleaned? For instance, there is not much signal between nucleus sampling with p=0.95 and p=0.96.
Are the text generations long enough? Most modern models are capable of generating a perfectly decent first sentence, but cracks start to appear once you go longer.
MAUVE has a non-linear scale: a standard deviation of 0.01 can mean very different things for 0.95 +/- 0.01 and for 0.37 +/- 0.01.

Regarding your idea: the scaling constant also has a non-linear effect on the standard deviation (see Figure 5 of the paper), so I doubt whether it would produce meaningfully smaller deviations.

Best,
Krishna

nostalgebraist · 2022-08-06T18:47:19Z

Thanks for the reply! I basically agree with everything you say here.

I am using GPT-2 with the maximum text length (1024 tokens), so there is little room to improve on that front.

You write:

MAUVE has a non-linear scale: a standard deviation of 0.01 can mean very different things for 0.95 +/- 0.01 and for 0.37 +/- 0.01.

The core of the problem I'm having is that

As models and samplers get better, the gaps we want to measure get smaller.
- For example (Table 6), moving from pure sampling to nucleus with GPT-2-small yields a MAUVE improvement of ~0.3, from 0.589 to 0.878.
- But to demonstrate any further improvement over nucleus with GPT-2-small, we have to measure a difference of size <= (1 - 0.878) = 0.122.
But the variance of the MAUVE estimator does not go down as the gaps get smaller. (Cf. std devs in Table 6)

Because nucleus sampling already does so well (relative to the variance of the estimator), it is difficult to show with high confidence that any new sampler improves upon nucleus sampling.

As a real example, consider Table 5 in the paper on Typical Decoding, where the best reported MAUVE is 0.96 and a nucleus baseline has 0.95. I don't bring this up to critique that paper -- these numbers are broadly representative of the situation faced by anyone trying to compare a new sampler to nucleus sampling using MAUVE.

This difficulty increases with larger models, since their baseline MAUVE is even closer to 1. So it is especially difficult to show that any new sampler is helpful across model scales, rather than only helping for small models.

Once can imagine lots of things that might improve the variance:

Using a larger number of texts
Averaging over a larger number of generation seeds
Averaging over a larger number of KMeans seeds
Using a smaller number of KMeans clusters (maybe?)

For future research that has to measure small differences -- e.g. research on new samplers, or on large models -- it would be practically useful to have an agreed-upon "recipe" for achieving a tolerable variance given the small differences we have to measure.

It would also be helpful to have a notice in the README indicating that the default settings are not sufficient for measuring differences between pairs of distributions which are all individually close (i.e. for the case where all measured MAUVEs are near 1).

Some background on my use case:

I'm evaluating a new sampling method that augments nucleus sampling to decrease pathological repetition.

Pathological repetition is relatively rare with nucleus sampling but very noticeable when it occurs. So I am trying to measure a difference that is meaningful in human terms, but "small" in the sense that it shows up in a relatively small fraction of generated samples.

I am especially interested in measuring how the effect varies with model size, since it has been observed that large models still suffer from pathological repetition, despite improving in many other respects. (This might imply that repetition constitutes a larger fraction of the difference between these models' predictive distributions and the ground truth, so that a fix for repetition would "fix more of the remaining problem" for larger models.)

krishnap25 · 2022-08-18T18:42:06Z

Hi @nostalgebraist,

Thank you for these detailed notes (and sorry for the slow response).

Here is my intuition: in order to quantify subtle errors, the number of errors must be large enough to significantly alter the output of the clustering. There are two factors at play here: the quality of the embeddings, and the average number of samples per cluster.

Quality of the embeddings:
As the models get better, I think it might become more important to consider better embeddings. We used GPT-2 large because it provided a good trade-off between embedding quality and ease of use (publicly available in late 2020, fit on GPU memory, etc.). We observed that better embeddings do help, and it might make sense to go for embeddings from GPT-3 if you have access to it.

Alternatively, you might want to consider alternate embedding models --- we found different models to be sensitive to different properties of text. Based on anecdotal evidence, it appears to me that RoBERTa, for instance, is more critical of inconsistencies that GPT-2, while the latter cares more about repetitions.

Sample size to number of clusters:
First, I assume the number of samples you have is large enough so that you observe the errors at least a few times. If your number of clusters is too small, then the clustering will not particularly change with a few errors --- although this reduces the variance, I'm not sure it'll give you any signal. If the number of clusters is too large, too many clusters will be singletons, so you get no signal. For your case, you might want to try out various cluster sizes to balance out these two factors. We have a heuristic, but you could potentially do better by searching over cluster sizes.

If the error is so rare that this approach does not help, here is a potential alternative.
You might try filtering out prompts which might result in undesirable behavior for nucleus sampling that your method fixes. That is, you split P and Q into P1, P2 and Q1, Q2 respectively. You could establish that P1 and Q1 are very similar with high confidence, while P2 and Q2 and quite different with high confidence (with your method being better on this set).

Other variance reduction approaches:
After all these, if you still wish to reduce the variance, I would try averaging over a larger number of k-means seeds (most inexpensive) , followed by using a larger number of generation seeds. I've added some notes on this in the readme.

Hope this helps. I think it is super interesting to try and quantify subtler differences using MAUVE, but the I'm afraid there is not too much we can do if the embeddings are not sensitive to this property.

Best,
Krishna

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAUVE can vary greatly when computed with different K-Means random seeds #11

MAUVE can vary greatly when computed with different K-Means random seeds #11

nostalgebraist commented Aug 2, 2022

krishnap25 commented Aug 6, 2022

nostalgebraist commented Aug 6, 2022

krishnap25 commented Aug 18, 2022

MAUVE can vary greatly when computed with different K-Means random seeds #11

MAUVE can vary greatly when computed with different K-Means random seeds #11

Comments

nostalgebraist commented Aug 2, 2022

krishnap25 commented Aug 6, 2022

nostalgebraist commented Aug 6, 2022

krishnap25 commented Aug 18, 2022