New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAUVE can vary greatly when computed with different K-Means random seeds #11
Comments
Hi @nostalgebraist, The settings in your notebook look reasonable to me. In general, MAUVE is meant to compare two or more settings. In contrast, the absolute value of MAUVE is not very meaningful (and dependent on a number of hyperparameters, as you pointed out). The key question is: how large is the standard deviation in comparison to the gaps between the models you wish to test? Some factors to keep in mind:
Regarding your idea: the scaling constant also has a non-linear effect on the standard deviation (see Figure 5 of the paper), so I doubt whether it would produce meaningfully smaller deviations. Best, |
Thanks for the reply! I basically agree with everything you say here. I am using GPT-2 with the maximum text length (1024 tokens), so there is little room to improve on that front. You write:
The core of the problem I'm having is that
Because nucleus sampling already does so well (relative to the variance of the estimator), it is difficult to show with high confidence that any new sampler improves upon nucleus sampling. As a real example, consider Table 5 in the paper on Typical Decoding, where the best reported MAUVE is 0.96 and a nucleus baseline has 0.95. I don't bring this up to critique that paper -- these numbers are broadly representative of the situation faced by anyone trying to compare a new sampler to nucleus sampling using MAUVE. This difficulty increases with larger models, since their baseline MAUVE is even closer to 1. So it is especially difficult to show that any new sampler is helpful across model scales, rather than only helping for small models. Once can imagine lots of things that might improve the variance:
For future research that has to measure small differences -- e.g. research on new samplers, or on large models -- it would be practically useful to have an agreed-upon "recipe" for achieving a tolerable variance given the small differences we have to measure. It would also be helpful to have a notice in the README indicating that the default settings are not sufficient for measuring differences between pairs of distributions which are all individually close (i.e. for the case where all measured MAUVEs are near 1). Some background on my use case: I'm evaluating a new sampling method that augments nucleus sampling to decrease pathological repetition. Pathological repetition is relatively rare with nucleus sampling but very noticeable when it occurs. So I am trying to measure a difference that is meaningful in human terms, but "small" in the sense that it shows up in a relatively small fraction of generated samples. I am especially interested in measuring how the effect varies with model size, since it has been observed that large models still suffer from pathological repetition, despite improving in many other respects. (This might imply that repetition constitutes a larger fraction of the difference between these models' predictive distributions and the ground truth, so that a fix for repetition would "fix more of the remaining problem" for larger models.) |
Hi @nostalgebraist, Thank you for these detailed notes (and sorry for the slow response). Here is my intuition: in order to quantify subtle errors, the number of errors must be large enough to significantly alter the output of the clustering. There are two factors at play here: the quality of the embeddings, and the average number of samples per cluster. Quality of the embeddings: Alternatively, you might want to consider alternate embedding models --- we found different models to be sensitive to different properties of text. Based on anecdotal evidence, it appears to me that RoBERTa, for instance, is more critical of inconsistencies that GPT-2, while the latter cares more about repetitions. Sample size to number of clusters: If the error is so rare that this approach does not help, here is a potential alternative. Other variance reduction approaches: Hope this helps. I think it is super interesting to try and quantify subtler differences using MAUVE, but the I'm afraid there is not too much we can do if the embeddings are not sensitive to this property. Best, |
While using MAUVE in a real use case, I decided to compute MAUVE multiple times per comparison with different K-Means random seeds.
I've noticed that the value of the MAUVE metric varies a lot across these K-Means seeds.
In particular, MAUVE varies about as much across K-Means seeds as it does across GPT sampling seeds. Typical values for std. dev. across 5 seeds are ~0.005 to ~0.01, for either type of seed (while holding the other constant).
This is also comparable in size to the MAUVE differences reported in some model/sampler comparisons, e.g. between nucleus GPT-2-large and nucleus GPT-2-xl in Table 6 of the original paper.
Do you have recommendations about what to do about this variability?
k = n/10
gives us an average of 10 members per cluster for each ofp
andq
, with many clusters having fewer than this. The small counts mean there is is high uncertainty in the individual terms of the KL-divergence sum.Colab notebook with an example: https://colab.research.google.com/drive/1wh38JRSr5vkOqlWUxNkP4tUgkFwZwAD0?usp=sharing
The text was updated successfully, but these errors were encountered: