<a href="https://colab.research.google.com/github/microprediction/winningnotebooks/blob/main/LLM_Quinellas_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install winning
!pip install pandas
!pip install scipy

Collecting winning
  Downloading winning-1.0.3-py3-none-any.whl.metadata (6.7 kB)
Downloading winning-1.0.3-py3-none-any.whl (23 kB)
Installing collected packages: winning
Successfully installed winning-1.0.3


# Luce's Choice Axiom versus the Standard Normal Race model
The methodology is as follows.





## A contest model for choice

Luce is trivial. Let's just implement the second here using the `winning` package:;

# Quinella pricing

In [2]:
import numpy as np
from winning.lattice_conventions import STD_L, STD_A
from winning.lattice import skew_normal_density, densities_from_offsets, get_the_rest, _loser_of_two_pdf,\
    beats, winner_of_many, cdf_to_pdf
from winning.lattice_calibration import state_price_implied_ability


def compute_skew_normal_quinellas(p:[float], L=551, a=0):
    """ Produce quinella table, and also return densities

    :param p:  Vector of state prices
    :param L:  500 by default, half that is probably fine
    :return: quinellas, densities
    """

    # Calibration
    unit = 1.0
    density = skew_normal_density(L=L, unit=unit, loc=0, scale=50.0, a=a)
    offsets = state_price_implied_ability(prices=p, density=density)
    densities = densities_from_offsets(density, offsets)
    densityAll, multiplicityAll = winner_of_many(densities)

    n = len(p)
    quinellas = np.zeros(shape=(n, n))
    for h0 in range(n):
        density0 = densities[h0]
        cdfRest0, multiplicityRest0 = get_the_rest(density=density0, densityAll=densityAll,
                                                   multiplicityAll=multiplicityAll, cdf=None, cdfAll=None)
        for h1 in range(n):
            if h1 > h0:
                density1 = densities[h1]
                cdfRest01, multiplicityRest01 = get_the_rest(density=density1, densityAll=None,
                                                             multiplicityAll=multiplicityRest0, cdf=None,
                                                             cdfAll=cdfRest0)
                pdfRest01 = cdf_to_pdf(cdfRest01)
                loser01, loser_multiplicity01 = _loser_of_two_pdf(density0, density1)
                quinellas[h0, h1] = beats(loser01, loser_multiplicity01, pdfRest01, multiplicityRest01)
                quinellas[h1, h0] = quinellas[h0, h1]

    return quinellas

qins = compute_skew_normal_quinellas(p=[0.5,0.3,0.1,0.1,0.1,0.001,0.001,0.001,0.001,0.001,0.001])
qins[:4,:4]

array([[0.        , 0.33139024, 0.12176597, 0.12176597],
       [0.33139024, 0.        , 0.06868425, 0.06868425],
       [0.12176597, 0.06868425, 0.        , 0.02362145],
       [0.12176597, 0.06868425, 0.02362145, 0.        ]])

# Quinella pricing (Luce / Harville)

In [3]:
def compute_harville_quinellas(p):
    """
    Compute Harville Quinellas (joint probabilities for unordered pairs) from individual probabilities assuming independence.

    Args:
        p (list of float): List of individual probabilities for each state/event. Should sum to <=1.

    Returns:
        list of lists: Matrix-like structure where the element at [i][j] is the joint probability P({i, j}),
                       with diagonal elements being 1.0 to match the format requested.
    """
    from itertools import combinations

    n = len(p)
    quinella_matrix = [[0.0] * n for _ in range(n)]

    # Compute unnormalized joint probabilities
    for i, j in combinations(range(n), 2):
        if i != j:
          joint_prob = p[i] * p[j]
          quinella_matrix[i][j] = quinella_matrix[j][i] = joint_prob

    # Normalize the joint probabilities so that their sum equals the total probability of any two events occurring
    total_joint_prob = sum(sum(row) for row in quinella_matrix) / 2  # Divide by 2 to avoid double-counting pairs
    if total_joint_prob > 0:
        quinella_matrix = [
            [cell / total_joint_prob if i != j else 0.0 for j, cell in enumerate(row)]
            for i, row in enumerate(quinella_matrix)
        ]

    return quinella_matrix

# Example usage
result = compute_harville_quinellas(p=[0.5, 0.3, 0.2])
for row in result:
    print(row)




[0.0, 0.48387096774193544, 0.32258064516129037]
[0.48387096774193544, 0.0, 0.1935483870967742]
[0.32258064516129037, 0.1935483870967742, 0.0]


In [4]:
!pip install transformers numpy pandas scipy



# Missing word utility

In [18]:
def llm_quinellas(prompt_pair, top_k=10):
    """
    Receives a prompt pair like the following:
      - "I visited the state called [MASK] last year and it is one of my favorite states in the U.S.A."
      - "I visited two states called [ANSWER] and [MASK] last year and they are my two favorite states in the U.S.A."

    First, it will ask an LLM to fill in the missing token and extract the token probabilities.
    We will take the top 10 and create a list called 'names' (lowercase) and one called 'p' where the latter holds
    renormalized probabilities adding to unity.

    Second, for each i, name in enumerate(names), we will substitute into the second prompt.
    So if the name is 'arizona', we get something like:
      - "I visited two states called arizona and [MASK] last year and they are my two favorite states in the U.S.A."

    Eliminate any responses that are not in the set NAMES / {name}.
    Renormalize the token probabilities.

    This gives a way of assigning 'exacta' probabilities ex[i, :] where diagonals are zero.
    When we have done this for all names, we also add ex to its own transpose to get qu[:, :].

    Return this quinella probability table.
    """
    from transformers import pipeline
    import numpy as np

    # Initialize the fill-mask pipeline with a model that supports mask filling
    fill_mask = pipeline('fill-mask', model='roberta-base')

    # Unpack the prompt pair
    prompt_single, prompt_double = prompt_pair

    # Adjust the mask tokens for the model
    mask_token = fill_mask.tokenizer.mask_token  # This will be '<mask>' for roberta-base

    # Step 1: Get top 10 names and their probabilities from the first prompt
    prompt_single = prompt_single.replace('[MASK]', mask_token)

    # Get predictions
    results = fill_mask(prompt_single, top_k=top_k)

    # Extract tokens and their probabilities
    names = []
    probs = []
    for res in results:
        token_str = res['token_str'].strip().lower()
        names.append(token_str)
        probs.append(res['score'])

    # Normalize probabilities to sum to 1
    total_prob = sum(probs)
    p = [prob / total_prob for prob in probs]
    print({'prob':total_prob})

    # Initialize the exacta probability matrix
    n = len(names)
    ex = np.zeros((n, n))

    # Step 2: For each name, get probabilities for the second [MASK]
    for i, name in enumerate(names):
        # Substitute the name into the second prompt
        prompt = prompt_double.replace('[ANSWER]', name)
        prompt = prompt.replace('[MASK]', mask_token)

        # Get predictions
        results = fill_mask(prompt, top_k=2*top_k)  # Increase top_k to ensure coverage

        # Extract tokens and their probabilities
        other_names = []
        probs_others = []
        for res in results:
            token_str = res['token_str'].strip().lower()
            other_names.append(token_str)
            probs_others.append(res['score'])

        # Filter out the current name and names not in the list
        allowed_names = set(names) - {name}
        filtered_probs = {}
        for other_name, prob in zip(other_names, probs_others):
            if other_name in allowed_names:
                filtered_probs[other_name] = prob

        # Renormalize probabilities
        total_prob = sum(filtered_probs.values())
        if total_prob > 0.4:
            filtered_probs = {k: v / total_prob for k, v in filtered_probs.items()}
        else:
            # If no allowed names are found, skip this iteration
            continue

        # Map other_name to index j
        name_to_index = {n: idx for idx, n in enumerate(names)}
        # Fill the exacta matrix
        for other_name, prob in filtered_probs.items():
            j = name_to_index[other_name]
            ex[i, j] = p[i]*prob

    # Zero out the diagonal
    np.fill_diagonal(ex, 0)

    # Compute the quinella probability table
    qu = ex + ex.T


    return p, qu, names


def quinella_comparison(prompt_pair, top_k=10):
    # Get LLM, normal model and harville implied quinellas
    # Compute several measures of discrepancy between LLM and estimated probabilities
    p, qu_llm, names = llm_quinellas(prompt_pair=prompt_pair,top_k=top_k)
    qu_normal = compute_skew_normal_quinellas(p)
    qu_harville = compute_harville_quinellas(p)
    srt = sorted(list(zip(p,names)), reverse=True)
    print({'probs':srt})

    # Compute RMSE between qu_llm and qu_normal
    rmse_normal = np.sqrt(np.mean((qu_llm - qu_normal) ** 2))

    # Compute RMSE between qu_llm and qu_harville
    rmse_harville = np.sqrt(np.mean((qu_llm - qu_harville) ** 2))

    # Compute RMSE between qu_llm and qu_harville
    rmse_diff = np.sqrt(np.mean((qu_harville - qu_normal) ** 2))


    # Print the RMSE values
    print(f"RMSE between LLM quinellas and Skew Normal quinellas: {rmse_normal:.6f}")
    print(f"RMSE between LLM quinellas and Harville quinellas: {rmse_harville:.6f}")
    print(f"RMSE between Skew and Harville quinellas: {rmse_diff:.6f}")
    if rmse_normal < rmse_harville:
        better_model = "Skew Normal"
    else:
        better_model = "Harville"

    print(f"The {better_model} model better predicts the actual quinella probabilities.")
    return rmse_harville, rmse_normal




# Example usage
prompt_pair = (
    "I visited the state called [MASK] last year and it is one of my favorite states in the U.S.A.",
    "I visited two states called [ANSWER] and [MASK] last year and they are my two favorite states in the U.S.A."
)
quinella_comparison(prompt_pair=prompt_pair)


{'prob': 0.43570252507925034}
{'probs': [(0.19252209207465457, 'arizona'), (0.11820234007751466, 'texas'), (0.10717900748068498, 'oregon'), (0.10114625244209448, 'indiana'), (0.10052167784050628, 'florida'), (0.0904307397306346, 'california'), (0.08755574201564277, 'georgia'), (0.06912437443572696, 'wisconsin'), (0.06699298122438921, 'arkansas'), (0.0663247926781515, 'utah')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.020204
RMSE between LLM quinellas and Harville quinellas: 0.020440
RMSE between Skew and Harville quinellas: 0.000664
The Skew Normal model better predicts the actual quinella probabilities.


(0.02043978672031971, 0.02020359508723999)

In [15]:
prompt_pair_template = (
    "I visited the country called [MASK] last year and it is one of my favorite countries in REGION",
    "I visited two countries called [ANSWER] and [MASK] last year and they are my two favorite countries in REGION"
)
for region in ['Asia','Europe','the Americas','Africa','the Southern Hemisphere','the World']:
     prompt_pair = [ pp.replace('REGION',region) for pp in prompt_pair_template]
     quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.1737887911458916, 'vietnam'), (0.16538282901836396, 'myanmar'), (0.1207179720561625, 'cambodia'), (0.11862311323457679, 'burma'), (0.09436936278318456, 'bangladesh'), (0.0715250775055774, 'thailand'), (0.06873332038425296, 'pakistan'), (0.06643431411960922, 'singapore'), (0.06336770805747138, 'india'), (0.05705751169490962, 'nepal')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.022876
RMSE between LLM quinellas and Harville quinellas: 0.023320
RMSE between Skew and Harville quinellas: 0.000913
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.19640808758081216, 'poland'), (0.17165197096800303, 'slovenia'), (0.14913808179303678, 'luxembourg'), (0.08243369708777519, 'romania'), (0.0789737171802471, 'slovakia'), (0.07434079512972087, 'hungary'), (0.06540018934582313, 'estonia'), (0.06462851383624654, 'latvia'), (0.05890455746495978, 'croatia'), (0.058120389613375394, 'malta')]}
RMSE between LLM quinellas and Skew Normal quinel

In [19]:
prompt_pair_template = (
    "I learned the sport called [MASK] last year and it is one of my favorite forms of SOMETHING now",
    "I visited two sports called [ANSWER] and [MASK] last year and they are my two favorite forms of SOMETHING now"
)
for something in ['exercise','sport','competition']:
     prompt_pair = [ pp.replace('SOMETHING',something) for pp in prompt_pair_template]
     quinella_comparison(prompt_pair=prompt_pair)

{'prob': 0.7291152961552143}
{'probs': [(0.31490120061426197, 'yoga'), (0.18065156489540535, 'boxing'), (0.14635860359510716, 'cycling'), (0.14360732162218126, 'running'), (0.05646777836854684, 'swimming'), (0.05439274170284868, 'walking'), (0.03278993195584005, 'stretching'), (0.029526465246235936, 'spinning'), (0.02180299072252546, 'squats'), (0.019501401277047276, 'spin')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.032999
RMSE between LLM quinellas and Harville quinellas: 0.033499
RMSE between Skew and Harville quinellas: 0.002477
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.6585923079401255}
{'probs': [(0.4018063366440432, 'wrestling'), (0.24638855515006994, 'boxing'), (0.0618012006497455, 'cycling'), (0.051275469647236144, 'fencing'), (0.04742978934331837, 'football'), (0.04520924773221789, 'swimming'), (0.044353682343280666, 'skiing'), (0.03578242798223683, 'soccer'), (0.033566108906344067, 'tennis'), (0.03238718160150737, 'run

In [65]:
prompt_pair_template = (
    "I was offered a choice of hearts, diamonds, spades, clubs. I chose [MASK] because SOMETHING.",
    "I was offered two choices from hearts, diamonds, spades, clubs. I chose [ANSWER] and [MASK] because SOMETHING")

something_list = ['I do not remember why', 'hoped for good luck']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=4)


{'prob': 0.9210821837186813}
{'probs': [(0.553769116253453, 'hearts'), (0.25541818787988807, 'diamonds'), (0.15441832828774774, 'clubs'), (0.03639436757891115, 'heart')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.136686
RMSE between LLM quinellas and Harville quinellas: 0.129029
RMSE between Skew and Harville quinellas: 0.011703
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.9376174062490463}
{'probs': [(0.7971310838729567, 'hearts'), (0.1188191465368976, 'diamonds'), (0.04446832949371209, 'heart'), (0.03958144009643364, 'clubs')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.146891
RMSE between LLM quinellas and Harville quinellas: 0.122783
RMSE between Skew and Harville quinellas: 0.025056
The Harville model better predicts the actual quinella probabilities.


In [66]:
prompt_pair_template = (
    "My favourite color in the rainbow is [MASK] because it SOMETHING.",
    "My favourite two colors in the rainbow are [ANSWER] and [MASK] because they SOMETHING."
)

something_list = ['is vibrant', 'makes me happy', 'stands out']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=8)

{'prob': 0.9026298671960831}
{'probs': [(0.25003826282246155, 'orange'), (0.16230844135395997, 'yellow'), (0.16105576830104895, 'blue'), (0.13220900093793647, 'green'), (0.10527440069012578, 'purple'), (0.09369320059203164, 'red'), (0.07774213111085669, 'pink'), (0.01767879419157891, 'violet')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.016413
RMSE between LLM quinellas and Harville quinellas: 0.016230
RMSE between Skew and Harville quinellas: 0.001458
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.9205932281911373}
{'probs': [(0.2253067289860063, 'green'), (0.1918231882477209, 'blue'), (0.16908910312887312, 'purple'), (0.12433202101605244, 'pink'), (0.11932492047391594, 'yellow'), (0.0952419757532805, 'orange'), (0.049928714147919384, 'red'), (0.024953348246231424, 'black')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.013450
RMSE between LLM quinellas and Harville quinellas: 0.013529
RMSE between Skew and Harville quinellas:

In [72]:
prompt_pair_template = (
    "The best continent to visit is [MASK], and I SOMETHING.",
    "The two best continents to visit is [ANSWER] and [MASK], because I SOMETHING."
)

something_list = ['love the culture', 'enjoy the scenery', 'find it fascinating']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=5)


{'prob': 0.6127169504761696}
{'probs': [(0.6949973915177383, 'africa'), (0.16043386155924994, 'india'), (0.054173105286291795, 'china'), (0.046720538982488485, 'morocco'), (0.043675102654231475, 'ethiopia')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.141360
RMSE between LLM quinellas and Harville quinellas: 0.147879
RMSE between Skew and Harville quinellas: 0.015877
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.6130431368947029}
{'probs': [(0.7164325649521859, 'africa'), (0.09304626325774289, 'antarctica'), (0.06667865441093103, 'europe'), (0.06294732468659296, 'morocco'), (0.06089519269254722, 'india')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.127769
RMSE between LLM quinellas and Harville quinellas: 0.126264
RMSE between Skew and Harville quinellas: 0.005456
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.726015742868185}
{'probs': [(0.7075188022895179, 'africa'), (0.1379553249579027,

In [75]:
prompt_pair_template = (
    "My favourite animal in the zodiac is the [MASK] because it SOMETHING.",
    "My favourite two animals in the zodiac are the [ANSWER] and the [MASK] because they SOMETHING."
)

something_list = ['represents strength', 'symbolizes wisdom', 'brings good fortune']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)

{'prob': 0.8777751573361456}
{'probs': [(0.571967976851822, 'lion'), (0.14386123624081754, 'tiger'), (0.06407799489860677, 'elephant'), (0.04373666789665804, 'wolf'), (0.028869767056990056, 'horse'), (0.02238964744186679, 'eagle'), (0.018748635446625687, 'unicorn'), (0.013610346886008132, 'fox'), (0.012717807341780454, 'dragon'), (0.009497413235368173, 'unicorn'), (0.009142959529095253, 'donkey'), (0.009141154762024574, 'ram'), (0.008064879107765304, 'sun'), (0.007261621422354254, 'cat'), (0.00724283635416708, 'bull'), (0.006736737121305392, 'monkey'), (0.006661344860209085, 'lion'), (0.005968816690257966, 'camel'), (0.005442745126215518, 'snake'), (0.004859411730061931, 'rabbit')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.012909
RMSE between LLM quinellas and Harville quinellas: 0.011952
RMSE between Skew and Harville quinellas: 0.002988
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.8333255369216204}
{'probs': [(0.4232781433341249, 'li

In [76]:
prompt_pair_template = (
    "My favourite season is [MASK] because I SOMETHING.",
    "My favourite two seasons are [ANSWER] and [MASK] because I SOMETHING."
)

something_list = ['enjoy the weather', 'love the activities', 'feel most alive']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=4)

{'prob': 0.5817121863365173}
{'probs': [(0.2795416440434362, 'fall'), (0.2502103332841164, 'spring'), (0.23776893125075632, 'spring'), (0.2324790914216911, 'summer')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.144806
RMSE between LLM quinellas and Harville quinellas: 0.144823
RMSE between Skew and Harville quinellas: 0.000339
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.5516736805438995}
{'probs': [(0.5569230716040222, 'summer'), (0.18905682629181836, 'winter'), (0.13329403168272524, 'spring'), (0.12072607042143424, 'summer')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.171216
RMSE between LLM quinellas and Harville quinellas: 0.170714
RMSE between Skew and Harville quinellas: 0.005367
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.30970796197652817}
{'probs': [(0.2882184089370604, 'winter'), (0.27731283880535984, 'spring'), (0.21975967792187023, 'summer'), (0.21470907433570952, 'spring'

In [81]:
prompt_pair_template = (
    "Of the five, I rely most on my sense of [MASK], SOMETHING.",
    "Of the five, I rely most on my sense of [ANSWER] and my sense of [MASK], SOMETHING."
)

something_list = ['to get by', 'quite heavily']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=10)

{'prob': 0.812021978199482}
{'probs': [(0.5639103001901727, 'humor'), (0.17042629884671895, 'self'), (0.11818759072407636, 'humour'), (0.03366775548939011, 'smell'), (0.02861544892042679, 'direction'), (0.026344069693340293, 'purpose'), (0.018493396895219734, 'community'), (0.017234784057602072, 'identity'), (0.012596591708544847, 'taste'), (0.010523763474508074, 'routine')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.051315
RMSE between LLM quinellas and Harville quinellas: 0.055535
RMSE between Skew and Harville quinellas: 0.005921
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.7740780469030142}
{'probs': [(0.44746994036553545, 'smell'), (0.19806599589695834, 'humour'), (0.13678020544333863, 'humor'), (0.07168639716040993, 'self'), (0.05729597849956138, 'taste'), (0.0385350678592711, 'direction'), (0.015797606249985278, 'intuition'), (0.014851945938388029, 'place'), (0.010166449617796442, 'balance'), (0.0093504129687554, 'proportion')

In [84]:
prompt_pair_template = (
    "My favourite gemstone is [MASK] because it is SOMETHING.",
    "My favourite two gemstones are [ANSWER] and [MASK] because they are SOMETHING."
)

something_list = ['beautiful', 'sparkly', 'elegant']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=40)

{'prob': 0.5764846715610474}
{'probs': [(0.10664366764862812, 'quartz'), (0.10171585314773819, 'gold'), (0.08333058173237859, 'white'), (0.07890499365829402, 'blue'), (0.07349355059799727, 'silver'), (0.057756213558786296, 'black'), (0.04481966494556573, 'pink'), (0.04300427269807892, 'yellow'), (0.03698245060231963, 'orange'), (0.02800162250399403, 'red'), (0.026825803355708354, 'green'), (0.024605004416567412, 'this'), (0.023080648115210198, 'purple'), (0.015236926456998735, 'copper'), (0.015215058776916526, 'brown'), (0.014679531634276762, 'diamond'), (0.013945395681504996, 'titanium'), (0.013203574699614504, 'violet'), (0.012749578817653593, 'platinum'), (0.012342581646460389, 'coral'), (0.011481172596880359, 'quartz'), (0.011459381654002928, 'ivory'), (0.01106482156430447, 'glass'), (0.009851801026901395, 'ruby'), (0.009408941732754219, 'limestone'), (0.009282002240846736, 'simply'), (0.009009355760022682, 'just'), (0.008992256287517498, 'chocolate'), (0.008818583030138815, 'silve

In [78]:
prompt_pair_template = (
    "I learned the programming language called [MASK] last year and it is now my favorite SOMETHING.",
    "I learned the programming languages called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
)

something_list = ['language', 'coding language', 'programming language']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=10)


{'prob': 0.7597371097654104}
{'probs': [(0.22781528880910304, 'rust'), (0.16638576634977545, 'scala'), (0.14707542649321614, 'python'), (0.12362194435969405, 'ruby'), (0.09638619012559198, 'haskell'), (0.07649844621867487, 'java'), (0.07150112292010444, 'swift'), (0.03548430717208023, 'scheme'), (0.032691856226460106, 'perl'), (0.02253965132529972, 'lua')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.027727
RMSE between LLM quinellas and Harville quinellas: 0.028633
RMSE between Skew and Harville quinellas: 0.001462
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.7033234219998121}
{'probs': [(0.1766142999025829, 'python'), (0.17565917850783427, 'rust'), (0.1420977559416898, 'scala'), (0.13396954172618797, 'ruby'), (0.10482129441888978, 'java'), (0.10416637915539136, 'swift'), (0.05898481084255477, 'haskell'), (0.039083519379115375, 'scheme'), (0.038904091156539304, 'perl'), (0.025699128969214485, 'lua')]}
RMSE between LLM quinellas and Sk

In [24]:
prompt_pair_template = (
    "I learned to play the musical instrument called [MASK] last year and it is now my favorite SOMETHING.",
    "I learned to play the musical instruments called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
)
something_list = ['instrument', 'musical instrument', '']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'prob': 0.7032256666570902}
{'probs': [(0.31470805169038346, 'guitar'), (0.1966230538343205, 'trumpet'), (0.1582646120628809, 'violin'), (0.10118884841253371, 'piano'), (0.07132150710728007, 'whistle'), (0.06482666450849278, 'drums'), (0.03784971346387822, 'recorder'), (0.01993355139418417, 'keyboard'), (0.019763493225332465, 'bass'), (0.015520504300713719, 'string')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.029906
RMSE between LLM quinellas and Harville quinellas: 0.030765
RMSE between Skew and Harville quinellas: 0.002491
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.7062003659084439}
{'probs': [(0.3283645778729405, 'guitar'), (0.17740532185480656, 'violin'), (0.1615663684447692, 'trumpet'), (0.11894970252613626, 'piano'), (0.07700960214836779, 'drums'), (0.054873392161986154, 'whistle'), (0.028044555846790574, 'recorder'), (0.021989932281581966, 'keyboard'), (0.017940865079981754, 'bass'), (0.013855681782639212, 'string')]}
RMSE

In [88]:
prompt_pair_template = (
    "The most useful tool is a [MASK] because it is SOMETHING.",
    "The two most useful tools are a [ANSWER] and a [MASK] because they are SOMETHING."
)

tool_adjectives = [
    'versatile',
    'essential',
    'accurate',
    'durable',
    'reliable',
    'efficient',
    'portable',
    'powerful',
    'precise',
    'sturdy',
    'compact',
    'user-friendly',
    'lightweight',
    'robust'
]

something_list = tool_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)



{'prob': 0.7001342461444438}
{'probs': [(0.31100467115370994, 'calculator'), (0.14575147010428005, 'dictionary'), (0.10376935837151614, 'debugger'), (0.10128389559948296, 'spreadsheet'), (0.07854008619314011, 'tool'), (0.038794864351091786, 'timer'), (0.028815078456414258, 'database'), (0.025747895442465477, 'compiler'), (0.023983364261115005, 'generator'), (0.022979758635436935, 'parser'), (0.022384137743812614, 'map'), (0.01508061666833757, 'library'), (0.015056100977033103, 'program'), (0.012743775120290585, 'checklist'), (0.012039723840115405, 'file'), (0.008884116066447753, 'logger'), (0.008840337001099572, 'compass'), (0.008350615781655697, 'list'), (0.008170690161504898, 'filter'), (0.007779444071050158, 'browser')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.008942
RMSE between LLM quinellas and Harville quinellas: 0.009661
RMSE between Skew and Harville quinellas: 0.001269
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.56612818

In [89]:
prompt_pair_template = (
    "The greatest invention is the [MASK] because it is SOMETHING.",
    "The greatest two inventions are the [ANSWER] and a [MASK] because they are SOMETHING."
)

something_list = tool_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.6513024647720158}
{'probs': [(0.16882963180495825, 'automobile'), (0.16036472667548635, 'computer'), (0.12676292981799633, 'transistor'), (0.12372896604289127, 'telephone'), (0.06415718375376649, 'bicycle'), (0.05454871091876917, 'smartphone'), (0.04162006922571859, 'iphone'), (0.037588731813948055, 'internet'), (0.026504922624971602, 'refrigerator'), (0.025264004695380905, 'internet'), (0.024372180551064944, 'machine'), (0.022840195747077435, 'phone'), (0.022591805392028864, 'airplane'), (0.01934531126362476, 'printer'), (0.01599601884485561, 'camera'), (0.015619571762927754, 'wheel'), (0.014187482473822741, 'calculator'), (0.012338082745664232, 'car'), (0.011810707782937996, 'invention'), (0.011528766062108644, 'radio')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.012170
RMSE between LLM quinellas and Harville quinellas: 0.012408
RMSE between Skew and Harville quinellas: 0.000878
The Skew Normal model better predicts the actual quinella probabilities.
{'prob':

In [91]:
prompt_pair_template = (
    "The best kitchen appliance is the [MASK] because it is SOMETHING.",
    "The best two kitchedn appliances are the [ANSWER] and the [MASK] because they are SOMETHING."
)
something_list = tool_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=10)


{'prob': 0.8302618805319071}
{'probs': [(0.500565925305918, 'microwave'), (0.12030848605644018, 'blender'), (0.10278995168999151, 'refrigerator'), (0.1022235451655641, 'grill'), (0.043240684457119774, 'oven'), (0.03710350626282782, 'stove'), (0.03355310077880157, 'fridge'), (0.020449175279305783, 'appliance'), (0.020266520885950063, 'kitchen'), (0.019499104118081124, 'cooker')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.035508
RMSE between LLM quinellas and Harville quinellas: 0.036692
RMSE between Skew and Harville quinellas: 0.002887
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.5574994925409555}
{'probs': [(0.21939340384665007, 'best'), (0.19159534920436966, 'microwave'), (0.12759655434188977, 'smallest'), (0.11416027865945197, 'blender'), (0.07085196541199455, 'same'), (0.07074589310342928, 'refrigerator'), (0.05962704410999311, 'oven'), (0.050098217284593606, 'kitchen'), (0.047972073568062004, 'grill'), (0.04795922046956602, 'fri

In [94]:
prompt_pair_template = (
    "The best gardening tool is the [MASK] because it is SOMETHING.",
    "The best two gardening tools are the [ANSWER] and the [MASK] because they are SOMETHING."
)
something_list = tool_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=10)

{'prob': 0.35679337941110134}
{'probs': [(0.15213267404007666, 'knife'), (0.12716604187457475, 'tool'), (0.12546488489310215, 'tractor'), (0.12111886889484472, 'shovel'), (0.11187731506016693, 'rake'), (0.08607718592864297, 'router'), (0.0832621172444429, 'pot'), (0.08284833401388836, 'brush'), (0.06297097299230835, 'vacuum'), (0.04708160505795223, 'drill')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.034519
RMSE between LLM quinellas and Harville quinellas: 0.034588
RMSE between Skew and Harville quinellas: 0.000651
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.3358518397435546}
{'probs': [(0.1863755022635895, 'soil'), (0.15198966945001416, 'knife'), (0.13411609404581107, 'tractor'), (0.1332852651319699, 'shovel'), (0.09208640340074714, 'rake'), (0.08309306280734326, 'tool'), (0.07859901301090025, 'pot'), (0.04969640922166853, 'drill'), (0.0458145360158741, 'vacuum'), (0.0449440446520821, 'garden')]}
RMSE between LLM quinellas and Ske

In [54]:
prompt_pair_template = (
    "My favourite day of the week is called [MASK] because that day of the week reminds me SOMETHING.",
    "My favourite two days of the week are called [ANSWER] and [MASK] that those two days of the week remind me SOMETHING."
)

something_list = ['of fond memories', 'of having fun']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=7)



{'prob': 0.6697142235934734}
{'probs': [(0.21454757766349136, 'friday'), (0.20121731802339216, 'monday'), (0.1746813099079609, 'mondays'), (0.14064578474332504, 'thursday'), (0.12512539908000467, 'wednesday'), (0.07534575308097688, 'sunday'), (0.06843685750084899, 'saturday')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.048927
RMSE between LLM quinellas and Harville quinellas: 0.049615
RMSE between Skew and Harville quinellas: 0.001607
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.5596927478909492}
{'probs': [(0.2973044083767655, 'friday'), (0.17955819068698903, 'saturday'), (0.14404959730867217, 'that'), (0.12261725927460543, 'monday'), (0.10628370967151272, 'thursday'), (0.07990576337764953, 'wednesday'), (0.07028107130380554, 'sunday')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.051218
RMSE between LLM quinellas and Harville quinellas: 0.052032
RMSE between Skew and Harville quinellas: 0.001761
The Skew Normal model be

In [95]:
prompt_pair_template = (
    "My favorite primary color is [MASK] because it is SOMETHING.",
    "My favorite two primary colors are [ANSWER] and [MASK] because they are SOMETHING."
)

color_adjectives = [
    'bright',
    'bold',
    'vibrant',
    'warm',
    'cool',
    'soothing',
    'strong'
]

something_list = color_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.9462765378411859}
{'probs': [(0.17273980943597358, 'blue'), (0.1458640808332078, 'green'), (0.14190312076391826, 'orange'), (0.13349799924292266, 'yellow'), (0.08360468107273925, 'red'), (0.07438365232583036, 'white'), (0.0611858573050499, 'pink'), (0.04621312983459073, 'purple'), (0.044496552396249486, 'black'), (0.023918992022448862, 'brown'), (0.019978908740929873, 'gold'), (0.01833600435677263, 'silver'), (0.00784607496151132, 'gray'), (0.005927093955672109, 'violet'), (0.004747969301495546, 'peach'), (0.003735672936241265, 'grey'), (0.0037326957401727255, 'cream'), (0.003046839820011097, 'ivory'), (0.0024949107275247244, 'orange'), (0.002345954226737844, 'tan')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.006560
RMSE between LLM quinellas and Harville quinellas: 0.006520
RMSE between Skew and Harville quinellas: 0.000781
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.8999107910785824}
{'probs': [(0.1652338307255419, 'orange

In [53]:
prompt_pair_template = (
    "My favourite month of the year is [MASK] because I SOMETHING.",
    "My favourite two months of the year are [ANSWER] and [MASK] because I SOMETHING."
)

something_list = ['enjoy the weathr', 'get a lot done', 'have fond memories']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=12)



{'prob': 0.970632191747427}
{'probs': [(0.1659103451077605, 'may'), (0.10261308395927501, 'june'), (0.09697105678616785, 'august'), (0.09602062113528582, 'march'), (0.09434877888815286, 'october'), (0.08876256397017264, 'september'), (0.08279531968453373, 'april'), (0.08235495477401554, 'july'), (0.0498270067642327, 'february'), (0.049487178370221895, 'november'), (0.04596929070940688, 'december'), (0.044939799850774564, 'january')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.013099
RMSE between LLM quinellas and Harville quinellas: 0.013188
RMSE between Skew and Harville quinellas: 0.000576
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.9766253493726254}
{'probs': [(0.13355771922195056, 'december'), (0.13070496702766907, 'january'), (0.10198007009215346, 'march'), (0.09527006034397864, 'may'), (0.08676853991685128, 'october'), (0.08064810709299176, 'june'), (0.0745621340264772, 'april'), (0.06464221733532462, 'september'), (0.063758611

In [96]:
prompt_pair_template = (
    "The best day of the week is [MASK] because it is SOMETHING.",
    "The two best days of the week are [ANSWER] and [MASK] because they are SOMETHING."
)

day_adjectives = [
    'relaxing',
    'productive',
    'fun',
    'busy',
    'exciting',
    'stressful',
    'enjoyable',
    'hectic',
    'calm',
    'rewarding'
]

something_list = day_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.9729472977342084}
{'probs': [(0.22584706473596458, 'friday'), (0.17500975781052985, 'monday'), (0.17046660252913443, 'sunday'), (0.16553290234224616, 'saturday'), (0.06415901572634942, 'wednesday'), (0.06063422160151305, 'thursday'), (0.05219614283327947, 'tuesday'), (0.02328303905333387, 'mondays'), (0.013807458599300462, 'payday'), (0.011672631023035387, 'fridays'), (0.008343788018440514, 'sundays'), (0.007754979354246201, 'saturdays'), (0.006281701345259291, 'today'), (0.004147024052205399, 'always'), (0.003104271786888771, 'thanksgiving'), (0.0026608671549073295, 'everyday'), (0.002271203129192141, 'work'), (0.0011065181736651512, 'easter'), (0.0008669983232314636, 'working'), (0.0008538124072770759, 'just')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.009370
RMSE between LLM quinellas and Harville quinellas: 0.009178
RMSE between Skew and Harville quinellas: 0.001219
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.9423919253

In [52]:
prompt_pair_template = (
    "The most interesting chemical element has the name [MASK].",
    "The two most interesting chemical elements have the names [ANSWER] and [MASK]."
)

something_list = ['like the sound']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=50)

{'prob': 0.3795658331364393}
{'probs': [(0.11579575219299466, 'lithium'), (0.09582526030613091, 'water'), (0.0720433581951287, 'hydrogen'), (0.04146957479751664, 'oxygen'), (0.03368290190315331, 'methane'), (0.03240269410039673, 'carbon'), (0.024383122984585347, 'gold'), (0.024082898998029005, 'dna'), (0.02381726660141751, 'salt'), (0.023712130054563584, 'sulfur'), (0.022951110168472427, 'phosphorus'), (0.020419077185791362, 'graphene'), (0.0197395970725082, 'lsd'), (0.01953684442306059, 'helium'), (0.017535024843597456, 'lead'), (0.017095077627546917, 'iron'), (0.016986162456170715, 'arsenic'), (0.016125699723358065, 'uranium'), (0.015409806183306817, 'chlorine'), (0.015362177111933974, 'calcium'), (0.015005941150965594, 'nitrogen'), (0.014754963192639357, 'co'), (0.014496731011326034, 'c'), (0.01409304632242792, 'rust'), (0.014007845700407437, 'silicon'), (0.013964163328738542, 'z'), (0.012893155146814161, 'itself'), (0.012880921236510218, 'iodine'), (0.012409282647518298, 'cho'), (0

In [55]:
prompt_pair_template = (
    "I engaged in the vacation activity called [MASK] last summer and it is now my favorite SOMETHING.",
    "I engaged in the vacation activities called [ANSWER] and [MASK] last summer and they are now my two favorite SOMETHINGs."
)

something_list = ['activity', 'vacation activity', '']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'prob': 0.6315941195935011}
{'probs': [(0.42812472787941647, 'swimming'), (0.14368691703526382, 'surfing'), (0.08261757494431823, 'camping'), (0.07949105053636614, 'hiking'), (0.06531014918578808, 'skiing'), (0.05010741407766632, 'sailing'), (0.0488411574777003, 'yoga'), (0.0430763645231793, 'fishing'), (0.033104422749643085, 'meditation'), (0.025640221590658267, 'diving')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.023784
RMSE between LLM quinellas and Harville quinellas: 0.024442
RMSE between Skew and Harville quinellas: 0.002185
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.5109076499938965}
{'probs': [(0.29392189306837174, 'swimming'), (0.18406638229373065, 'surfing'), (0.08161915914726851, 'sailing'), (0.08071128544375804, 'yoga'), (0.07826499706939469, 'camping'), (0.07196480082850265, 'hiking'), (0.06175679784936502, 'meditation'), (0.05460636367762035, 'blogging'), (0.04788634393336699, 'skiing'), (0.04520197668862132, 'fishi

In [28]:
prompt_pair_template = (
    "I switched to the smartphone brand called [MASK] this year and it is now my favorite SOMETHING.",
    "I switched to the smartphone brands called [ANSWER] and [MASK] this year and they are now my two favorite SOMETHINGs."
)

something_list = ['smartphone brand', 'brand']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'prob': 0.7154721049591899}
{'probs': [(0.2930844866312661, 'xiaomi'), (0.15531123649427597, 'honor'), (0.12775639485817364, 'nokia'), (0.12639666034984054, 'oneplus'), (0.07624195213665853, 'huawei'), (0.07291735367401353, 'samsung'), (0.06670145275832139, 'motorola'), (0.03532356708714012, 'lg'), (0.03278034163110838, 'essential'), (0.013486554379201772, 'lenovo')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.025941
RMSE between LLM quinellas and Harville quinellas: 0.026609
RMSE between Skew and Harville quinellas: 0.001684
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.6884101834148169}
{'probs': [(0.25390568844685435, 'xiaomi'), (0.15295738853956556, 'honor'), (0.1487764235297814, 'nokia'), (0.13147140731958334, 'oneplus'), (0.08408810405261567, 'motorola'), (0.07404694472230158, 'samsung'), (0.06828638583225738, 'huawei'), (0.03737659045700754, 'essential'), (0.034095702120020575, 'lg'), (0.014995364980012589, 'blackberry')]}
RMSE

In [97]:
prompt_pair_template = (
    "The best mode of transportation is [MASK] because it is SOMETHING.",
    "The two best modes of transportation are [ANSWER] and [MASK] because they are SOMETHING."
)

transport_adjectives = [
    'fast',
    'efficient',
    'convenient',
    'affordable',
    'reliable',
    'comfortable',
    'eco-friendly',
    'versatile',
    'essential',
    'accessible'
]

something_list = transport_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.800025041680783}
{'probs': [(0.2593603197507496, 'cycling'), (0.19216904521577705, 'bicycle'), (0.11441489166820928, 'transit'), (0.09236855940714489, 'walking'), (0.07666961579160964, 'biking'), (0.03957732561160317, 'rail'), (0.0381996865330364, 'bus'), (0.030054962114852583, 'transportation'), (0.026711560519816442, 'bicycles'), (0.01690635273339669, 'driving'), (0.014834916352199677, 'transport'), (0.013543630292180012, 'bike'), (0.013302029495872914, 'buses'), (0.012605693254547733, 'uber'), (0.012369541688882987, 'electric'), (0.012137236365055952, 'train'), (0.009928311971678878, 'car'), (0.009194401533078935, 'underground'), (0.007859425588286196, 'automobile'), (0.00779249411202102, 'cars')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.014350
RMSE between LLM quinellas and Harville quinellas: 0.015152
RMSE between Skew and Harville quinellas: 0.001322
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.6656899466179311}
{'

In [98]:
prompt_pair_template = (
    "The most challenging language to learn is [MASK] because it is SOMETHING.",
    "The two most challenging languages to learn are [ANSWER] and [MASK] because they are SOMETHING."
)

language_adjectives = [
    'complex',
    'intricate',
    'unique',
    'difficult',
    'uncommon',
    'rich',
    'nuanced',
    'ancient',
    'expressive',
    'essential'
]

something_list = language_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.8598209386691451}
{'probs': [(0.2419088975472339, 'english'), (0.15717430692026563, 'chinese'), (0.08877156662090895, 'japanese'), (0.08410655785219186, 'spanish'), (0.06219077286672986, 'korean'), (0.057407461499151956, 'russian'), (0.04744182907567945, 'french'), (0.040595789175076724, 'greek'), (0.03545365547518774, 'mandarin'), (0.034800198798191385, 'german'), (0.02520611365604495, 'latin'), (0.022287295439390512, 'italian'), (0.01616240313881296, 'vietnamese'), (0.014607419351711875, 'portuguese'), (0.01370643165466549, 'hebrew'), (0.013153104338993906, 'python'), (0.012238990924194405, 'arabic'), (0.010995082748470664, 'klingon'), (0.010951056677849014, 'swedish'), (0.010841066239248758, 'dutch')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.008457
RMSE between LLM quinellas and Harville quinellas: 0.009148
RMSE between Skew and Harville quinellas: 0.000951
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.8178384946659207

In [29]:
prompt_pair_template = (
    "I brewed the tea variety called [MASK] yesterday and it is now my favorite SOMETHING.",
    "I brewed the tea varieties called [ANSWER] and [MASK] yesterday and they are now my two favorite SOMETHINGs."
)

something_list = ['tea', 'tea variety', 'favorite tea']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'prob': 0.09648891072720289}
{'probs': [(0.16504693309716784, 'cinnamon'), (0.13301019450190324, 'ginger'), (0.1060471991192556, 'cascade'), (0.10083864474231295, 'rose'), (0.09705181873020778, 'tea'), (0.09551536526340312, 'ginger'), (0.07966699876073531, 'clover'), (0.07584137122499662, 'peach'), (0.07572219166950785, 'vanilla'), (0.07125928289050966, 'vanilla')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.021880
RMSE between LLM quinellas and Harville quinellas: 0.021960
RMSE between Skew and Harville quinellas: 0.000563
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.0998301962390542}
{'probs': [(0.174924761309061, 'cinnamon'), (0.12780889690156222, 'ginger'), (0.1221456961977998, 'cascade'), (0.11385447870354555, 'tea'), (0.09130866162716073, 'rose'), (0.07780159550739008, 'clover'), (0.0776424602850196, 'vanilla'), (0.07507351514583799, 'peach'), (0.07041582925464199, 'ginger'), (0.06902410506798105, 'vanilla')]}
RMSE between LLM 

In [99]:
prompt_pair_template = (
    "The most important human organ is the [MASK] because it is SOMETHING.",
    "The two most important human organs are the [ANSWER] and the [MASK] because they are SOMETHING."
)

organ_adjectives = [
    'vital',
    'complex',
    'essential',
    'sensitive',
    'strong',
    'efficient',
    'delicate',
    'versatile',
    'powerful',
    'remarkable'
]

something_list = organ_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.968490767176263}
{'probs': [(0.5136260595670225, 'heart'), (0.30944774084312004, 'brain'), (0.05036030060399719, 'liver'), (0.03874833118923415, 'kidney'), (0.022174568206704038, 'organ'), (0.014354381378135874, 'eye'), (0.009996271377024206, 'ear'), (0.007718267818702663, 'lung'), (0.0064907796948654355, 'body'), (0.0045946177050162715, 'bladder'), (0.003334466387957748, 'mind'), (0.002981648511708063, 'soul'), (0.0029556303708087417, 'cell'), (0.002870826800146264, 'penis'), (0.002474356788826669, 'skull'), (0.0020755396972343115, 'blood'), (0.001581316905807179, 'thyroid'), (0.0015286811725427848, 'stomach'), (0.0014074962950158104, 'kidneys'), (0.0012787186861301675, 'prostate')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.021030
RMSE between LLM quinellas and Harville quinellas: 0.015773
RMSE between Skew and Harville quinellas: 0.005756
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.9821859890944324}
{'probs': [(0.76626450

In [30]:
prompt_pair_template = (
    "I ate the fruit called [MASK] this afternoon and it is now my favorite SOMETHING.",
    "I ate the fruits called [ANSWER] and [MASK] this afternoon and they are now my two favorite SOMETHINGs."
)

something_list = ['fruit', 'type of fruit', 'favorite fruit']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'prob': 0.2631769422441721}
{'probs': [(0.16551755620778225, 'mango'), (0.14393737453513394, 'pineapple'), (0.1328730998266746, 'apple'), (0.13285384891980295, 'banana'), (0.09005343506777044, 'orange'), (0.07505096978389689, 'pumpkin'), (0.06780742680877752, 'strawberry'), (0.06537321333271098, 'pear'), (0.0639128904957105, 'lemon'), (0.0626201850217399, 'grapes')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.023441
RMSE between LLM quinellas and Harville quinellas: 0.023067
RMSE between Skew and Harville quinellas: 0.000842
The Harville model better predicts the actual quinella probabilities.
{'prob': 0.2485560579225421}
{'probs': [(0.1717720440938174, 'mango'), (0.15037793332998234, 'pineapple'), (0.14376653228233646, 'banana'), (0.1224847845065573, 'apple'), (0.07613743498207008, 'strawberry'), (0.07598010881351965, 'orange'), (0.0682770167311929, 'grapes'), (0.0673325576310414, 'pumpkin'), (0.06207267859735935, 'pear'), (0.06179890903212314, 'lemon')]}
RMSE between LL

In [None]:
prompt_pair_template = (
    "I played the board game called [MASK] last weekend and it is now my favorite SOMETHING.",
    "I played the board games called [ANSWER] and [MASK] last weekend and they are now my two favorite SOMETHINGs."
)

something_list = ['board game', 'game', 'favorite board game']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.18014872053863223, 'solitaire'), (0.14387330036504362, 'risk'), (0.141436089053964, 'chess'), (0.1251084095848629, 'labyrinth'), (0.11078781341128204, 'cthulhu'), (0.07533666897853732, 'minecraft'), (0.07213013533745527, 'magic'), (0.0520308342274773, 'survivor'), (0.051515416145592595, 'journey'), (0.04763261235715275, 'pathfinder')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.318397
RMSE between LLM quinellas and Harville quinellas: 0.318429
RMSE between Skew and Harville quinellas: 0.001007
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.21528250885499112, 'risk'), (0.21089580387718687, 'solitaire'), (0.14691972499085737, 'chess'), (0.09138884652531944, 'labyrinth'), (0.08618000077499434, 'cthulhu'), (0.057190390092902055, 'magic'), (0.04879540922956983, 'pathfinder'), (0.04834397849477075, 'chess'), (0.047902211549789665, 'dice'), (0.04710112560961856, 'survivor')]}
RMSE between LLM quinellas and Skew Normal quinella

In [108]:
prompt_pair_template = (
    "The best file compression format is [MASK] because it is SOMETHING.",
    "The two best file compression formats are [ANSWER] and [MASK] because they are SOMETHING."
)

compression_adjectives = [
    'efficient',
    'fast',
    'widely-used',
    'reliable',
    'versatile',
    'secure',
    'flexible',
    'compact',
    'popular',
    'simple'
]

something_list = compression_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING',something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.625015867408365}
{'probs': [(0.25375046242251464, 'jpeg'), (0.10849393536832357, 'png'), (0.1008718392193903, 'fat'), (0.09675858538687448, 'json'), (0.0913690735390925, 'xml'), (0.08318027081509827, 'svg'), (0.06240498539069504, 'compressed'), (0.03918844212132298, 'csv'), (0.022014863672495156, 'mov'), (0.018601012356954764, 'ascii'), (0.014829171222441342, 'aac'), (0.014059191147283953, 'python'), (0.013827476522875937, ','), (0.013110182635878535, 'this'), (0.01287630292771588, 'aes'), (0.012661184795296267, 'java'), (0.012376466596052602, 'pdf'), (0.010865105518348114, 'compression'), (0.009827447779265733, 'doc'), (0.008934000562079882, 'raw')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.015070
RMSE between LLM quinellas and Harville quinellas: 0.015109
RMSE between Skew and Harville quinellas: 0.000884
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.6032590437680483}
{'probs': [(0.20939538833512783, 'jpeg'), (0.14543552

In [111]:
prompt_pair_template = (
    "The most reliable database is [MASK] because it is SOMETHING.",
    "The two most reliable databases are [ANSWER] and [MASK] because they are SOMETHING."
)

database_adjectives = [
    'robust',
    'scalable',
    'efficient',
    'secure',
    'versatile',
    'fast',
    'flexible',
    'popular',
    'stable',
    'powerful'
]

something_list = database_adjectives
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair, top_k=20)


{'prob': 0.8615010119974613}
{'probs': [(0.8394767117890969, 'mysql'), (0.040853193478330684, 'sql'), (0.02228943428871732, 'cassandra'), (0.015289383129277103, 'oracle'), (0.010334398269267187, 'relational'), (0.0098534147771788, 'immutable'), (0.009130073652945799, 'xml'), (0.007795294842370396, 'mongo'), (0.0065209449044395405, 'jenkins'), (0.005169197989442215, 'json'), (0.004215026971437634, 'simple'), (0.003953979938727759, 'prometheus'), (0.0038357658924297335, 'known'), (0.0038039596152714965, 'python'), (0.003642486956429164, 'this'), (0.00353047649494416, 'available'), (0.002826119086161029, 'reliable'), (0.002776949311857098, 'ruby'), (0.0024175357043791593, 'amazon'), (0.0022856529072968067, 'java')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.019821
RMSE between LLM quinellas and Harville quinellas: 0.022003
RMSE between Skew and Harville quinellas: 0.003554
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.8414492784067988}
{

In [32]:
prompt_pair_template = (
    "I bought clothes from the American fashion brand called [MASK] this season and it is now my favorite SOMETHING.",
    "I bought clothes from the American fashion brands called [ANSWER] and [MASK] this season and they are now my two favorite SOMETHINGs."
)
something_list = ['fashion brand', 'brand', 'clothing option']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'prob': 0.33633195143193007}
{'probs': [(0.26993113240007577, 'guess'), (0.16116632355226748, 'coach'), (0.15070151182768515, 'gap'), (0.08649240840821819, 'mac'), (0.07487993303576726, 'supreme'), (0.06341791164353949, 'adidas'), (0.06178505370821364, 'nike'), (0.04898453628874048, 'benefit'), (0.04860889699950807, 'diesel'), (0.034032292135984445, 'versus')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.027655
RMSE between LLM quinellas and Harville quinellas: 0.028713
RMSE between Skew and Harville quinellas: 0.001660
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.3634163774549961}
{'probs': [(0.27488883805200454, 'guess'), (0.20045599229660266, 'coach'), (0.1348978150434897, 'gap'), (0.08031843154181904, 'mac'), (0.07466322622806826, 'supreme'), (0.05620257433021012, 'diesel'), (0.05129796990771486, 'adidas'), (0.04867750927408384, 'benefit'), (0.045462781931876986, 'nike'), (0.033134861394130036, 'versus')]}
RMSE between LLM quinell

In [33]:
prompt_pair_template = (
    "I planted the flower called [MASK] this spring and it is now my favorite SOMETHING.",
    "I planted the flowers called [ANSWER] and [MASK] this spring and they are now my two favorite SOMETHINGs."
)

something_list = ['flower', 'type of flower', 'thing']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)



{'prob': 0.15694523323327303}
{'probs': [(0.23674427368017123, 'iris'), (0.15368748983829156, 'rose'), (0.11892663922839976, 'lily'), (0.10107148905010585, 'orange'), (0.07791660533072951, 'roses'), (0.07302101694245007, 'willow'), (0.06770247836232214, 'violet'), (0.06718625066692169, 'ivy'), (0.05577244647107658, 'yellow'), (0.0479713104295316, 'purple')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.026688
RMSE between LLM quinellas and Harville quinellas: 0.027231
RMSE between Skew and Harville quinellas: 0.001157
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.15492503624409437}
{'probs': [(0.24355152370138602, 'iris'), (0.17195497159651663, 'rose'), (0.10517101011147668, 'lily'), (0.09947027730911422, 'orange'), (0.08081201763340533, 'roses'), (0.0698827025566312, 'violet'), (0.061748124014398095, 'willow'), (0.059502887207511804, 'yellow'), (0.055677849955068394, 'ivy'), (0.05222863591449162, 'clover')]}
RMSE between LLM quinellas a

In [34]:
prompt_pair_template = (
    "I cooked the pasta called [MASK] last night and it is now my favorite SOMETHING.",
    "I cooked the pastas called [ANSWER] and [MASK] last night and they are now my two favorite SOMETHINGs."
)

something_list = ['pasta', 'type of pasta', 'food']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'prob': 0.1100191529840231}
{'probs': [(0.13890214287614094, 'spaghetti'), (0.1247219068174878, 'salmon'), (0.12297178230017386, 'spinach'), (0.11958182498744668, 'chicken'), (0.10859841510059165, 'this'), (0.08672798736289995, 'chicken'), (0.0861117623120531, 'rice'), (0.08378329404372388, 'mushroom'), (0.06489880156772881, 'kale'), (0.06370208263175332, 'pizza')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.022127
RMSE between LLM quinellas and Harville quinellas: 0.022285
RMSE between Skew and Harville quinellas: 0.000526
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.11125759454444051}
{'probs': [(0.14851822357720523, 'spaghetti'), (0.12220236768729653, 'spinach'), (0.1211850226849461, 'salmon'), (0.11810917256005381, 'chicken'), (0.0994349392430932, 'this'), (0.08684216622730426, 'mushroom'), (0.08494064835256201, 'rice'), (0.08049416076539696, 'chicken'), (0.0711266076329378, 'roma'), (0.0671466912692041, 'kale')]}
RMSE between LL

In [35]:
# The question must evoke an obvious finite set of choices ... say 10 roughly
prompt_pair_template = (
    "My favourite planet in the solar system is called [MASK] and I hope SOMETHING.",
    "My favourite two planets in the solar system are called [ANSWER] and [MASK] and I hope SOMETHING"
)

something_list = ['to visit', 'to view tonight', 'you agree']
# when subbed in the sentence should be gramatical
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'prob': 0.95700728520751}
{'probs': [(0.3707822575559922, 'venus'), (0.3424750002712201, 'pluto'), (0.14857316220011976, 'mars'), (0.05586722001767207, 'jupiter'), (0.023066732399958118, 'earth'), (0.015660153571023443, 'mercury'), (0.012561704905948483, 'saturn'), (0.010904684455385408, 'neptune'), (0.010370272858153739, 'ceres'), (0.009738811764526683, 'pandora')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.040022
RMSE between LLM quinellas and Harville quinellas: 0.045158
RMSE between Skew and Harville quinellas: 0.006048
The Skew Normal model better predicts the actual quinella probabilities.
{'prob': 0.9431076450273395}
{'probs': [(0.3928213503160743, 'pluto'), (0.3827107629324292, 'venus'), (0.0637801422351983, 'jupiter'), (0.05764980033562753, 'mars'), (0.02550365165500837, 'mercury'), (0.02220131779387423, 'ceres'), (0.015804528551749155, 'neptune'), (0.015083749380043752, 'saturn'), (0.013661435514641163, 'europa'), (0.010783261285353993, 'io')]}
RMSE between LLM

In [6]:
prompt_pair_template = (
    "When I go bowling I like to try to hit the pin number [MASK], and I hope SOMETHING.",
    "When I go bowling I like to try to hit the two pins numbered [ANSWER] and [MASK], and I hope SOMETHING"
)

something_list = ['that helps my score', 'to win', 'I do not miss']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.47241685683155266, 'one'), (0.08008466724016236, 'three'), (0.07723511929286091, '1'), (0.07580817371996354, 'five'), (0.0589791946348988, '10'), (0.05832720952020096, 'two'), (0.052688464280762216, '3'), (0.051532524733345964, 'six'), (0.03873512647860087, 'four'), (0.03419266326765175, '20')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.060681
RMSE between LLM quinellas and Harville quinellas: 0.060886
RMSE between Skew and Harville quinellas: 0.001063
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.6717796476490402, 'one'), (0.06669171111444643, '1'), (0.04749896646535182, 'two'), (0.04747288218721718, 'three'), (0.03467246574057959, 'five'), (0.028026662630477348, 'right'), (0.027841517367278742, '3'), (0.02678417194079787, '10'), (0.026054904191682254, 'six'), (0.023177070713128557, '20')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.080839
RMSE between LLM quinellas and Harville quinellas: 0.080610
RMSE 

# All together

In [None]:
import pandas as pd

# List of categories, templates, and substitutions exactly as provided
examples = [
    {
        "category": "Country Visits by Region",
        "prompt_pair_template": (
            "I visited the country called [MASK] last year and it is one of my favorite countries in SOMETHING",
            "I visited two countries called [ANSWER] and [MASK] last year and they are my two favorite countries in SOMETHING"
        ),
        "substitutions": ['Asia', 'Europe', 'the Americas', 'Africa', 'the Southern Hemisphere', 'the World']
    },
    {
        "category": "Favorite Sports",
        "prompt_pair_template": (
            "I learned the sport called [MASK] last year and it is one of my favorite forms of SOMETHING now",
            "I visited two sports called [ANSWER] and [MASK] last year and they are my two favorite forms of SOMETHING now"
        ),
        "substitutions": ['exercise', 'sport', 'relaxation', 'competition']
    },
    {
        "category": "Favorite Hobbies",
        "prompt_pair_template": (
            "I picked up the hobby called [MASK] last year and it is one of my favorite things to do SOMETHING.",
            "I engaged in two hobbies called [ANSWER] and [MASK] last year and they are my two favorite things to do SOMETHING."
        ),
        "substitutions": ['in the evening', 'when I am bored', 'with friends', 'in the evening']
    },
    {
        "category": "Favorite Ice Cream Flavors",
        "prompt_pair_template": (
            "I tried the icecream flavor called [MASK] last year and it is now my favourite SOMETHING.",
            "I tried the icecream flavors called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['treat', 'ice cream', 'guilty pleasure']
    },
    {
        "category": "Favorite Breakfast Cereals",
        "prompt_pair_template": (
            "I ate the breakfast cereal named [MASK] this morning and it is now my favorite SOMETHING.",
            "I ate the breakfast cereals named [ANSWER] and [MASK] this morning and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['meal choice', 'cereal', 'morning staple']
    },
    {
        "category": "Favorite Dog Breeds",
        "prompt_pair_template": (
            "I adopted the dog breed called [MASK] last year and it is now my favorite SOMETHING.",
            "I adopted the dog breeds called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['pet', '', 'canine']
    },
    {
        "category": "Favorite Programming Languages",
        "prompt_pair_template": (
            "I learned the programming language called [MASK] last year and it is now my favorite SOMETHING.",
            "I learned the programming languages called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['language', 'coding language', 'programming language']
    },
    {
        "category": "Favorite Musical Instruments",
        "prompt_pair_template": (
            "I learned to play the musical instrument called [MASK] last year and it is now my favorite SOMETHING.",
            "I learned to play the musical instruments called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['instrument', 'musical instrument', '']
    },
    {
        "category": "Favorite Vacation Activities",
        "prompt_pair_template": (
            "I engaged in the vacation activity called [MASK] last summer and it is now my favorite SOMETHING.",
            "I engaged in the vacation activities called [ANSWER] and [MASK] last summer and they are now my two favorite SOMETHINGs."
        ),
        "substitutions": ['activity', 'vacation activity', '']
    },
    # Continue to add the remaining examples similarly...
]


In [None]:
import pandas as pd

TOP_K = 4

# DataFrame to store accuracy results
results = []

# Loop through each example exactly as provided
for example in examples:
    category = example["category"]
    prompt_pair_template = example["prompt_pair_template"]
    substitutions = example["substitutions"]

    # Track accuracies for each model in the current category
    harville_accuracies = []
    skew_accuracies = []

    for substitution in substitutions:
        # Replace placeholders with current substitution in both prompt templates
        prompt_pair = [pp.replace("SOMETHING", substitution).replace("REGION", substitution) for pp in prompt_pair_template]

        # Run quinella_comparison for Harville and Skew models and capture accuracy
        try:
            harville_accuracy, skew_accuracy = quinella_comparison(prompt_pair=prompt_pair, top_k=TOP_K)

            if harville_accuracy is not None and skew_accuracy is not None:
                harville_accuracies.append(harville_accuracy)
                skew_accuracies.append(skew_accuracy)
        except Exception as e:
            print(e)

    # Calculate average accuracy for each model in the category
    if harville_accuracies and skew_accuracies:
        avg_harville_accuracy = sum(harville_accuracies) / len(harville_accuracies)
        avg_skew_accuracy = sum(skew_accuracies) / len(skew_accuracies)

        # Determine the winning model for the category (1 if Skew wins, 0 if Harville wins)
        model_win = 1 if avg_skew_accuracy < avg_harville_accuracy else 0

        # Append results to the DataFrame
        results.append({
            "Category": category,
            "Harville Accuracy": avg_harville_accuracy,
            "Skew Accuracy": avg_skew_accuracy,
            "Skew Win": model_win
        })

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Check if the required columns are present in the DataFrame before calculating overall metrics
if "Harville Accuracy" in results_df.columns and "Skew Accuracy" in results_df.columns:
    # Calculate overall average accuracies
    overall_harville_accuracy = results_df["Harville Accuracy"].mean()
    overall_skew_accuracy = results_df["Skew Accuracy"].mean()
    overall_model_win = 1 if overall_skew_accuracy > overall_harville_accuracy else 0

    # Add an overall summary row
    summary_row = pd.DataFrame([{
        "Category": "Overall",
        "Harville Accuracy": overall_harville_accuracy,
        "Skew Accuracy": overall_skew_accuracy,
        "Model Win": overall_model_win
    }])

# Display the DataFrame
print(results_df)




{'probs': [(0.30040617864945307, 'vietnam'), (0.28587588044119433, 'myanmar'), (0.2086695260412975, 'cambodia'), (0.20504841486805517, 'burma')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.081318
RMSE between LLM quinellas and Harville quinellas: 0.081727
RMSE between Skew and Harville quinellas: 0.001176
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.32754779736635087, 'poland'), (0.28626227003523325, 'slovenia'), (0.24871608290902272, 'luxembourg'), (0.13747384968939316, 'romania')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.051526
RMSE between LLM quinellas and Harville quinellas: 0.053074
RMSE between Skew and Harville quinellas: 0.003235
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.3442923651911901, 'venezuela'), (0.26691868637341254, 'haiti'), (0.2020611394032034, 'honduras'), (0.18672780903219396, 'guatemala')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.0845