Skip to content

Latest commit

 

History

History
177 lines (122 loc) · 7.73 KB

results.md

File metadata and controls

177 lines (122 loc) · 7.73 KB

Preset Arena results

Description of the experiment: Preset Arena: 17,205 comparisons between 241 different presets.

Some numbers:

  • 7215 valid votes
  • 951 voting sessions
  • 288 users with usernames

Final results

  • Sorted by instruct performance: instruct
  • Sorted by chat performance: chat

Preset definitions: presets

Dataset

https://huggingface.co/datasets/oobabooga/preset-arena

Ranking the presets

Bad voting sessions

The first step in the analysis of the votes was to try to identify suspicious voters. Each voting session received a unique uuid string, allowing the frequency of left/right votes to be analyzed.

I have used the following code to calculate the probability that a voting session was biased. It was obtained by asking ChatGPT for a fair coin test:

from scipy.stats import beta

def compute_bias_probability(outcomes, prior_alpha=1, prior_beta=1, _print=False):
    # Count the number of heads and tails
    num_heads = outcomes.count('left')
    num_tails = outcomes.count('right')

    if _print:
        print(num_heads, num_tails)

    # Update the prior with the observed outcomes
    posterior_alpha = prior_alpha + num_heads
    posterior_beta = prior_beta + num_tails

    # Calculate the bias probability using the Beta distribution
    bias_probability = beta.cdf(0.5, posterior_alpha, posterior_beta)

    return bias_probability

A session was disconsidered if bias_probability > 0.99, which happened for 0.6% of all sessions.

Estimating the elo scores

The basic formula is

def update_rating(rating, opponent_rating, outcome, k=32):
    expected_score = 1 / (1 + 10**((opponent_rating - rating) / 400))
    new_rating = rating + k * (outcome - expected_score)
    return new_rating

where the ratings are initialized as 1000 for all presets, and outcome is 1 in case of winning and 0 in case of losing.

To make things more robust, I have used the following procedure instead of calculating the elo scores just once:

  • take a random subsample containing 90% of votes
  • using that sample, calculate the elo scores for chat and instruct prompts separately
  • repeat 200 times
  • take the averages of the elo scores for each preset

Additionally, I have not counted votes where both completions are identical.

Comments

  1. I find that the top chat presets are all kind of the same. It may be due to the chat prompts being too simple and short, causing presets with low top_p to be favored.

  2. 5 variations of the Mirostat preset were included. It turned out that Mirostat-5 was a bit better than the Mirostat preset originally included in text-generation-webui:

presetparamselo score (chat)elo score (instruct)elo score (all)matches (chat)matches (instruct)
Mirostat-521012.7237566361541100.01710060555771056.37042862085583623
Mirostat1993.05643275770291109.1726029333061051.11451784550452722
Mirostat-221067.88247701562481028.2141560253211048.04831652047282925
Mirostat-421031.92199272369451020.19654616437921026.0592694440373735
Mirostat-32988.16641649540031021.21037911015171004.68839780277612929
  1. Similarly, 5 Contrastive Search variations were included, Contrastive Search-3 ended up being a bit better than the original Contrastive Search:
presetparamselo score (chat)elo score (instruct)elo score (all)matches (chat)matches (instruct)
Special-Contrastive Search-331077.67027592971641115.81517213936881096.74272403454262718
Special-Contrastive Search31077.34150402956421095.46547295389311086.40348849172873531
Special-Contrastive Search-13899.7205727080627851.8635177853589875.79204524671081610
Special-Contrastive Search-43765.788679774467790.9640810990088778.37638043673793319
Special-Contrastive Search-23801.0156035678388736.8621355164904768.93886954216462725
  1. Eta Sampling (another special technique) did not perform very well (but its parameters are present in other top-performing presets):
presetparamselo score (chat)elo score (instruct)elo score (all)matches (chat)matches (instruct)
Special-Eta Sampling31018.52697968969211016.45190095972491017.48944032470852925
  1. The best preset overall, considering the average of the chat and instruct elo scores, was also perhaps the most obvious. I originally named it simple-1 not expecting it to get anywhere:
temperature: 0.7
top_p: 0.9
repetition_penalty: 1.15
top_k: 20

The StarChat preset, also very simple, also performed well:

temperature: 0.2
top_p: 0.95
top_k: 50

This demonstrates that fancy samplers may not be that necessary.

Presets that I chose

For the purpose of including better presets in text-generation-webui, I removed presets with top_p < 0.05 or top_k < 3 because that seemed too low and artificial. That left me with the following (in decreasing order of elo score):

Instruct

Preset New name
random_preset_066 Divine Intellect
random_preset_134 Big O
simple-1
random_preset_035 Space Alien
starchat StarChat
random_preset_183 Titanic
tfs-with-top-a
random_preset_002 Asterism
Special-Contrastive Search-3 Contrastive Search

Chat

Preset New name
random_preset_101 Midnight Enigma
random_preset_161 Yara
random_preset_120 Shortwave
Kobold-Godlike

I took the liberty of giving gave some cheesy names for the new random presets.

Sampler frequency (outdated, see below)

In those 13 new presets, these are the sampling parameters that are present and how many times they appear:

     12 temperature
     11 top_p
     11 top_k
     11 repetition_penalty
      5 top_a
      3 tfs
      2 typical_p
      2 eta_cutoff
      1 penalty_alpha
      1 epsilon_cutoff
      1 encoder_repetition_penalty

Sampler frequency (updated)

In a follow-up analysis, I have tried removing samplers from the presets and seeing if the resulting logits changed.

For that, I took some random story that I copied and pasted from the internet, split it by spaces, and computed the logits using as input the first N words for N <= 200. That is, 200 logit vectors were computed for each preset. Then I considered a parameter as redundant if its removal kept the logits identical 90% of the time or more.

The resulting parameter frequency after this clean-up was:

     12 temperature
     11 top_p
     11 top_k
     11 repetition_penalty
      2 typical_p
      2 tfs
      1 top_a
      1 penalty_alpha
      1 encoder_repetition_penalty

Note that the eta sampling parameters (epsilon_cutoff and eta_cutoff) disappeared completely.