<a href="https://colab.research.google.com/github/microprediction/winningnotebooks/blob/main/LLM_Quinellas_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers
!pip install winning
!pip install pandas
!pip install scipy

Collecting winning
  Downloading winning-1.0.3-py3-none-any.whl.metadata (6.7 kB)
Downloading winning-1.0.3-py3-none-any.whl (23 kB)
Installing collected packages: winning
Successfully installed winning-1.0.3


# Luce's Choice Axiom versus the Standard Normal Race model
The methodology is as follows.





## A contest model for choice

Luce is trivial. Let's just implement the second here using the `winning` package:;

# Quinella pricing

In [104]:
import numpy as np
from winning.lattice_conventions import STD_L, STD_A
from winning.lattice import skew_normal_density, densities_from_offsets, get_the_rest, _loser_of_two_pdf,\
    beats, winner_of_many, cdf_to_pdf
from winning.lattice_calibration import state_price_implied_ability


def compute_skew_normal_quinellas(p:[float], L=551, a=0):
    """ Produce quinella table, and also return densities

    :param p:  Vector of state prices
    :param L:  500 by default, half that is probably fine
    :return: quinellas, densities
    """

    # Calibration
    unit = 1.0
    density = skew_normal_density(L=L, unit=unit, loc=0, scale=50.0, a=a)
    offsets = state_price_implied_ability(prices=p, density=density)
    densities = densities_from_offsets(density, offsets)
    densityAll, multiplicityAll = winner_of_many(densities)

    n = len(p)
    quinellas = np.zeros(shape=(n, n))
    for h0 in range(n):
        density0 = densities[h0]
        cdfRest0, multiplicityRest0 = get_the_rest(density=density0, densityAll=densityAll,
                                                   multiplicityAll=multiplicityAll, cdf=None, cdfAll=None)
        for h1 in range(n):
            if h1 > h0:
                density1 = densities[h1]
                cdfRest01, multiplicityRest01 = get_the_rest(density=density1, densityAll=None,
                                                             multiplicityAll=multiplicityRest0, cdf=None,
                                                             cdfAll=cdfRest0)
                pdfRest01 = cdf_to_pdf(cdfRest01)
                loser01, loser_multiplicity01 = _loser_of_two_pdf(density0, density1)
                quinellas[h0, h1] = beats(loser01, loser_multiplicity01, pdfRest01, multiplicityRest01)
                quinellas[h1, h0] = quinellas[h0, h1]

    return quinellas

qins = compute_skew_normal_quinellas(p=[0.5,0.3,0.1,0.1,0.1,0.001,0.001,0.001,0.001,0.001,0.001])
qins[:4,:4]

array([[0.        , 0.33139024, 0.12176597, 0.12176597],
       [0.33139024, 0.        , 0.06868425, 0.06868425],
       [0.12176597, 0.06868425, 0.        , 0.02362145],
       [0.12176597, 0.06868425, 0.02362145, 0.        ]])

# Quinella pricing (Luce / Harville)

In [109]:
def compute_harville_quinellas(p):
    """
    Compute Harville Quinellas (joint probabilities for unordered pairs) from individual probabilities assuming independence.

    Args:
        p (list of float): List of individual probabilities for each state/event. Should sum to <=1.

    Returns:
        list of lists: Matrix-like structure where the element at [i][j] is the joint probability P({i, j}),
                       with diagonal elements being 1.0 to match the format requested.
    """
    from itertools import combinations

    n = len(p)
    quinella_matrix = [[0.0] * n for _ in range(n)]

    # Compute unnormalized joint probabilities
    for i, j in combinations(range(n), 2):
        if i != j:
          joint_prob = p[i] * p[j]
          quinella_matrix[i][j] = quinella_matrix[j][i] = joint_prob

    # Normalize the joint probabilities so that their sum equals the total probability of any two events occurring
    total_joint_prob = sum(sum(row) for row in quinella_matrix) / 2  # Divide by 2 to avoid double-counting pairs
    if total_joint_prob > 0:
        quinella_matrix = [
            [cell / total_joint_prob if i != j else 0.0 for j, cell in enumerate(row)]
            for i, row in enumerate(quinella_matrix)
        ]

    return quinella_matrix

# Example usage
result = compute_harville_quinellas(p=[0.5, 0.3, 0.2])
for row in result:
    print(row)




[0.0, 0.48387096774193544, 0.32258064516129037]
[0.48387096774193544, 0.0, 0.1935483870967742]
[0.32258064516129037, 0.1935483870967742, 0.0]


In [7]:
!pip install transformers numpy pandas scipy



# Missing word utility

In [110]:
def llm_quinellas(prompt_pair, top_k=10):
    """
    Receives a prompt pair like the following:
      - "I visited the state called [MASK] last year and it is one of my favorite states in the U.S.A."
      - "I visited two states called [ANSWER] and [MASK] last year and they are my two favorite states in the U.S.A."

    First, it will ask an LLM to fill in the missing token and extract the token probabilities.
    We will take the top 10 and create a list called 'names' (lowercase) and one called 'p' where the latter holds
    renormalized probabilities adding to unity.

    Second, for each i, name in enumerate(names), we will substitute into the second prompt.
    So if the name is 'arizona', we get something like:
      - "I visited two states called arizona and [MASK] last year and they are my two favorite states in the U.S.A."

    Eliminate any responses that are not in the set NAMES / {name}.
    Renormalize the token probabilities.

    This gives a way of assigning 'exacta' probabilities ex[i, :] where diagonals are zero.
    When we have done this for all names, we also add ex to its own transpose to get qu[:, :].

    Return this quinella probability table.
    """
    from transformers import pipeline
    import numpy as np

    # Initialize the fill-mask pipeline with a model that supports mask filling
    fill_mask = pipeline('fill-mask', model='roberta-base')

    # Unpack the prompt pair
    prompt_single, prompt_double = prompt_pair

    # Adjust the mask tokens for the model
    mask_token = fill_mask.tokenizer.mask_token  # This will be '<mask>' for roberta-base

    # Step 1: Get top 10 names and their probabilities from the first prompt
    prompt_single = prompt_single.replace('[MASK]', mask_token)

    # Get predictions
    results = fill_mask(prompt_single, top_k=top_k)

    # Extract tokens and their probabilities
    names = []
    probs = []
    for res in results:
        token_str = res['token_str'].strip().lower()
        names.append(token_str)
        probs.append(res['score'])

    # Normalize probabilities to sum to 1
    total_prob = sum(probs)
    p = [prob / total_prob for prob in probs]

    # Initialize the exacta probability matrix
    n = len(names)
    ex = np.zeros((n, n))

    # Step 2: For each name, get probabilities for the second [MASK]
    for i, name in enumerate(names):
        # Substitute the name into the second prompt
        prompt = prompt_double.replace('[ANSWER]', name)
        prompt = prompt.replace('[MASK]', mask_token)

        # Get predictions
        results = fill_mask(prompt, top_k=50)  # Increase top_k to ensure coverage

        # Extract tokens and their probabilities
        other_names = []
        probs_others = []
        for res in results:
            token_str = res['token_str'].strip().lower()
            other_names.append(token_str)
            probs_others.append(res['score'])

        # Filter out the current name and names not in the list
        allowed_names = set(names) - {name}
        filtered_probs = {}
        for other_name, prob in zip(other_names, probs_others):
            if other_name in allowed_names:
                filtered_probs[other_name] = prob

        # Renormalize probabilities
        total_prob = sum(filtered_probs.values())
        if total_prob > 0:
            filtered_probs = {k: v / total_prob for k, v in filtered_probs.items()}
        else:
            # If no allowed names are found, skip this iteration
            continue

        # Map other_name to index j
        name_to_index = {n: idx for idx, n in enumerate(names)}
        # Fill the exacta matrix
        for other_name, prob in filtered_probs.items():
            j = name_to_index[other_name]
            ex[i, j] = p[i]*prob

    # Zero out the diagonal
    np.fill_diagonal(ex, 0)

    # Compute the quinella probability table
    qu = ex + ex.T


    return p, qu, names


def quinella_comparison(prompt_pair, top_k=10):
    # Get LLM, normal model and harville implied quinellas
    # Compute several measures of discrepancy between LLM and estimated probabilities
    p, qu_llm, names = llm_quinellas(prompt_pair=prompt_pair,top_k=top_k)
    qu_normal = compute_skew_normal_quinellas(p)
    qu_harville = compute_harville_quinellas(p)
    srt = sorted(list(zip(p,names)), reverse=True)
    print({'probs':srt})

    # Compute RMSE between qu_llm and qu_normal
    rmse_normal = np.sqrt(np.mean((qu_llm - qu_normal) ** 2))

    # Compute RMSE between qu_llm and qu_harville
    rmse_harville = np.sqrt(np.mean((qu_llm - qu_harville) ** 2))

    # Compute RMSE between qu_llm and qu_harville
    rmse_diff = np.sqrt(np.mean((qu_harville - qu_normal) ** 2))


    # Print the RMSE values
    print(f"RMSE between LLM quinellas and Skew Normal quinellas: {rmse_normal:.6f}")
    print(f"RMSE between LLM quinellas and Harville quinellas: {rmse_harville:.6f}")
    print(f"RMSE between Skew and Harville quinellas: {rmse_diff:.6f}")
    if rmse_normal < rmse_harville:
        better_model = "Skew Normal"
    else:
        better_model = "Harville"

    print(f"The {better_model} model better predicts the actual quinella probabilities.")
    return rmse_harville, rmse_normal




# Example usage
prompt_pair = (
    "I visited the state called [MASK] last year and it is one of my favorite states in the U.S.A.",
    "I visited two states called [ANSWER] and [MASK] last year and they are my two favorite states in the U.S.A."
)
quinella_comparison(prompt_pair=prompt_pair)




{'probs': [(0.19252209207465457, 'arizona'), (0.11820234007751466, 'texas'), (0.10717900748068498, 'oregon'), (0.10114625244209448, 'indiana'), (0.10052167784050628, 'florida'), (0.0904307397306346, 'california'), (0.08755574201564277, 'georgia'), (0.06912437443572696, 'wisconsin'), (0.06699298122438921, 'arkansas'), (0.0663247926781515, 'utah')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.016568
RMSE between LLM quinellas and Harville quinellas: 0.016598
RMSE between Skew and Harville quinellas: 0.000664
The Skew Normal model better predicts the actual quinella probabilities.


(0.016598399026136686, 0.016568333671103393)

In [35]:
prompt_pair_template = (
    "I visited the country called [MASK] last year and it is one of my favorite countries in REGION",
    "I visited two countries called [ANSWER] and [MASK] last year and they are my two favorite countries in REGION"
)
for region in ['Asia','Europe','the Americas','Africa','the Southern Hemisphere','the World']:
     prompt_pair = [ pp.replace('REGION',region) for pp in prompt_pair_template]
     quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.1737887911458916, 'vietnam'), (0.16538282901836396, 'myanmar'), (0.1207179720561625, 'cambodia'), (0.11862311323457679, 'burma'), (0.09436936278318456, 'bangladesh'), (0.0715250775055774, 'thailand'), (0.06873332038425296, 'pakistan'), (0.06643431411960922, 'singapore'), (0.06336770805747138, 'india'), (0.05705751169490962, 'nepal')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.316677
RMSE between LLM quinellas and Harville quinellas: 0.316693
RMSE between Skew and Harville quinellas: 0.000913
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.19640808758081216, 'poland'), (0.17165197096800303, 'slovenia'), (0.14913808179303678, 'luxembourg'), (0.08243369708777519, 'romania'), (0.0789737171802471, 'slovakia'), (0.07434079512972087, 'hungary'), (0.06540018934582313, 'estonia'), (0.06462851383624654, 'latvia'), (0.05890455746495978, 'croatia'), (0.058120389613375394, 'malta')]}
RMSE between LLM quinellas and Skew Normal quinel

In [41]:
prompt_pair_template = (
    "I learned the sport called [MASK] last year and it is one of my favorite forms of SOMETHING now",
    "I visited two sports called [ANSWER] and [MASK] last year and they are my two favorite forms of SOMETHING now"
)
for something in ['exercise','sport','relaxation','competition']:
     prompt_pair = [ pp.replace('SOMETHING',something) for pp in prompt_pair_template]
     quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.31490120061426197, 'yoga'), (0.18065156489540535, 'boxing'), (0.14635860359510716, 'cycling'), (0.14360732162218126, 'running'), (0.05646777836854684, 'swimming'), (0.05439274170284868, 'walking'), (0.03278993195584005, 'stretching'), (0.029526465246235936, 'spinning'), (0.02180299072252546, 'squats'), (0.019501401277047276, 'spin')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317598
RMSE between LLM quinellas and Harville quinellas: 0.317608
RMSE between Skew and Harville quinellas: 0.002477
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.4018063366440432, 'wrestling'), (0.24638855515006994, 'boxing'), (0.0618012006497455, 'cycling'), (0.051275469647236144, 'fencing'), (0.04742978934331837, 'football'), (0.04520924773221789, 'swimming'), (0.044353682343280666, 'skiing'), (0.03578242798223683, 'soccer'), (0.033566108906344067, 'tennis'), (0.03238718160150737, 'running')]}
RMSE between LLM quinellas and Skew Normal quinel

In [42]:
prompt_pair_template = (
    "I picked up the hobby called [MASK] last year and it is one of my favorite things to do SOMETHING.",
    "I engaged in two hobbies called [ANSWER] and [MASK] last year and they are my two favorite things to do SOMETHING."
)

something_list = ['in the evening', 'when I am bored', 'with friends', 'in the evening']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.5146929159055458, 'knitting'), (0.11514992986231744, 'crochet'), (0.08424428440851646, 'reading'), (0.06389407604043783, 'mahjong'), (0.04798369194212889, 'writing'), (0.04160939473145375, 'spinning'), (0.0381807422413647, 'blogging'), (0.03316972615048867, 'gardening'), (0.0329615081370917, 'sewing'), (0.028113730580654823, 'painting')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.318814
RMSE between LLM quinellas and Harville quinellas: 0.318647
RMSE between Skew and Harville quinellas: 0.002412
The Harville model better predicts the actual quinella probabilities.
{'probs': [(0.4568313545684415, 'knitting'), (0.10144079908971584, 'reading'), (0.0821296751503417, 'writing'), (0.07907408111045944, 'crochet'), (0.06998096693982932, 'blogging'), (0.0668316732172794, 'sewing'), (0.04593238238365506, 'gardening'), (0.043453621813048626, 'painting'), (0.030617764099784502, 'photography'), (0.02370768162744463, 'crafting')]}
RMSE between LLM quinellas and Skew Norma

In [46]:
prompt_pair_template = (
    "I tried the icecream flavor [MASK] last year and it is now my favourite SOMETHING.",
    "I tried the icecream flavors [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
)

something_list = ['treat','ice cream','guilty pleasure']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.2133749773068466, 'early'), (0.13383492024662264, 'myself'), (0.11598872212207405, 'only'), (0.10526544703089195, 'again'), (0.1050621550101597, 'late'), (0.09510647601445896, 'just'), (0.06854281821834551, 'first'), (0.05663668872459873, 'sometime'), (0.05566267078174887, 'here'), (0.050525124544253, 'earlier')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.320003
RMSE between LLM quinellas and Harville quinellas: 0.319989
RMSE between Skew and Harville quinellas: 0.000949
The Harville model better predicts the actual quinella probabilities.
{'probs': [(0.16506598364406855, 'early'), (0.1640938064173155, 'again'), (0.10306156704557352, 'from'), (0.10132729818608005, 'just'), (0.09499701806178064, 'only'), (0.0910325538942743, 'myself'), (0.0869534678126535, 'late'), (0.08280842500673945, 'here'), (0.06083962215744039, 'first'), (0.04982025777407413, 'earlier')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.319021
RMSE between LLM quinellas and Harvi

In [48]:
prompt_pair_template = (
    "I ate the breakfast cereal named [MASK] this morning and it is now my favorite SOMETHING.",
    "I ate the breakfast cereals names [ANSWER] and [MASK] this morning and they are now my two favorite SOMETHING."
)

something_list = ['meal choice', 'cereal', 'morning staple']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)



{'probs': [(0.17706169800291022, 'luna'), (0.17312757402888138, 'milk'), (0.10521818995674967, 'vanilla'), (0.09176602172492136, 'cereal'), (0.08727162950459705, 'ginger'), (0.0854854330764298, 'cinnamon'), (0.08139530574158636, 'breakfast'), (0.07314111607072425, 'crunch'), (0.06415594080789408, 'hazel'), (0.061377091085305834, 'honey')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317973
RMSE between LLM quinellas and Harville quinellas: 0.317982
RMSE between Skew and Harville quinellas: 0.000849
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.18637291094472544, 'milk'), (0.18027190611298588, 'luna'), (0.11898745726169162, 'crunch'), (0.09286304188047312, 'ginger'), (0.08771610856815791, 'vanilla'), (0.07709086470826751, 'cereal'), (0.06668566364495387, 'earl'), (0.06491757113147253, 'hazel'), (0.06263104472290311, 'unicorn'), (0.062463431024369, 'cinnamon')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.318264
RMSE betwee

In [51]:
prompt_pair_template = (
    "I adopted the dog breed called [MASK] last year and it is now my favorite SOMETHING.",
    "I adopted the dog breeds called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
)

something_list = ['pet', '', 'canine']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.12095787553907486, 'pug'), (0.11599636839579089, 'bengal'), (0.11329505847934697, 'daisy'), (0.10031031655923951, 'angus'), (0.09889783510897704, 'lab'), (0.0975256130259706, 'lucky'), (0.09459072167531839, 'newfoundland'), (0.08755681541245336, 'dexter'), (0.08713864119954189, 'leopard'), (0.0837307546042865, 'labrador')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317136
RMSE between LLM quinellas and Harville quinellas: 0.317142
RMSE between Skew and Harville quinellas: 0.000247
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.1753027946760353, 'lab'), (0.11660857360273207, 'bengal'), (0.10415608530247529, 'angus'), (0.09802145537953076, 'chow'), (0.09550837811980555, 'newfoundland'), (0.09056447059676653, 'dexter'), (0.08961184186003865, 'pug'), (0.08258612234936483, 'hound'), (0.07531745617078227, 'leopard'), (0.07232282194246875, 'labrador')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317080
RMSE betwe

In [52]:
prompt_pair_template = (
    "I learned the programming language called [MASK] last year and it is now my favorite SOMETHING.",
    "I learned the programming languages called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
)

something_list = ['language', 'coding language', 'programming language']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.22781528880910304, 'rust'), (0.16638576634977545, 'scala'), (0.14707542649321614, 'python'), (0.12362194435969405, 'ruby'), (0.09638619012559198, 'haskell'), (0.07649844621867487, 'java'), (0.07150112292010444, 'swift'), (0.03548430717208023, 'scheme'), (0.032691856226460106, 'perl'), (0.02253965132529972, 'lua')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.316909
RMSE between LLM quinellas and Harville quinellas: 0.316912
RMSE between Skew and Harville quinellas: 0.001462
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.1766142999025829, 'python'), (0.17565917850783427, 'rust'), (0.1420977559416898, 'scala'), (0.13396954172618797, 'ruby'), (0.10482129441888978, 'java'), (0.10416637915539136, 'swift'), (0.05898481084255477, 'haskell'), (0.039083519379115375, 'scheme'), (0.038904091156539304, 'perl'), (0.025699128969214485, 'lua')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.316961
RMSE between LLM quinellas a

In [53]:
prompt_pair_template = (
    "I learned to play the musical instrument called [MASK] last year and it is now my favorite SOMETHING.",
    "I learned to play the musical instruments called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
)
something_list = ['instrument', 'musical instrument', '']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)



{'probs': [(0.31470805169038346, 'guitar'), (0.1966230538343205, 'trumpet'), (0.1582646120628809, 'violin'), (0.10118884841253371, 'piano'), (0.07132150710728007, 'whistle'), (0.06482666450849278, 'drums'), (0.03784971346387822, 'recorder'), (0.01993355139418417, 'keyboard'), (0.019763493225332465, 'bass'), (0.015520504300713719, 'string')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317318
RMSE between LLM quinellas and Harville quinellas: 0.317425
RMSE between Skew and Harville quinellas: 0.002491
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.3283645778729405, 'guitar'), (0.17740532185480656, 'violin'), (0.1615663684447692, 'trumpet'), (0.11894970252613626, 'piano'), (0.07700960214836779, 'drums'), (0.054873392161986154, 'whistle'), (0.028044555846790574, 'recorder'), (0.021989932281581966, 'keyboard'), (0.017940865079981754, 'bass'), (0.013855681782639212, 'string')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317434

In [55]:
prompt_pair_template = (
    "I engaged in the vacation activity called [MASK] last summer and it is now my favorite SOMETHING.",
    "I engaged in the vacation activities called [ANSWER] and [MASK] last summer and they are now my two favorite SOMETHINGs."
)

something_list = ['activity', 'vacation activity', '']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.42812472787941647, 'swimming'), (0.14368691703526382, 'surfing'), (0.08261757494431823, 'camping'), (0.07949105053636614, 'hiking'), (0.06531014918578808, 'skiing'), (0.05010741407766632, 'sailing'), (0.0488411574777003, 'yoga'), (0.0430763645231793, 'fishing'), (0.033104422749643085, 'meditation'), (0.025640221590658267, 'diving')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317087
RMSE between LLM quinellas and Harville quinellas: 0.317140
RMSE between Skew and Harville quinellas: 0.002185
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.29392189306837174, 'swimming'), (0.18406638229373065, 'surfing'), (0.08161915914726851, 'sailing'), (0.08071128544375804, 'yoga'), (0.07826499706939469, 'camping'), (0.07196480082850265, 'hiking'), (0.06175679784936502, 'meditation'), (0.05460636367762035, 'blogging'), (0.04788634393336699, 'skiing'), (0.04520197668862132, 'fishing')]}
RMSE between LLM quinellas and Skew Normal quinella

In [56]:
prompt_pair_template = (
    "I enjoyed the dessert called [MASK] yesterday and it is now my favorite SOMETHING.",
    "I enjoyed the desserts called [ANSWER] and [MASK] yesterday and they are now my two favorite SOMETHINGs."
)

something_list = ['dessert', 'sweet treat', 'favorite dessert']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.13280715076454774, 'cake'), (0.1279075883264226, 'strawberries'), (0.10766821968627073, 'chocolate'), (0.10593521314532785, 'lemon'), (0.10573294805547215, 'cake'), (0.0930621201827402, 'pineapple'), (0.09306122656015185, 'strawberry'), (0.08705346147347914, 'peach'), (0.07365569290184855, 'caramel'), (0.07311637890373922, 'this')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317273
RMSE between LLM quinellas and Harville quinellas: 0.317278
RMSE between Skew and Harville quinellas: 0.000391
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.14637462854963879, 'strawberries'), (0.14298768902995632, 'chocolate'), (0.12506623960593147, 'cake'), (0.09956193989265454, 'pudding'), (0.09594338788422678, 'caramel'), (0.0882824716594701, 'pie'), (0.07798349851637075, 'lemon'), (0.07666056209619111, 'apples'), (0.07658025066561439, 'strawberry'), (0.07055933209994575, 'cake')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.

In [61]:
prompt_pair_template = (
    "I switched to the smartphone brand called [MASK] this year and it is now my favorite SOMETHING.",
    "I switched to the smartphone brands called [ANSWER] and [MASK] this year and they are now my two favorite SOMETHINGs."
)

something_list = ['smartphone brand', 'brand']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.2930844866312661, 'xiaomi'), (0.15531123649427597, 'honor'), (0.12775639485817364, 'nokia'), (0.12639666034984054, 'oneplus'), (0.07624195213665853, 'huawei'), (0.07291735367401353, 'samsung'), (0.06670145275832139, 'motorola'), (0.03532356708714012, 'lg'), (0.03278034163110838, 'essential'), (0.013486554379201772, 'lenovo')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317112
RMSE between LLM quinellas and Harville quinellas: 0.317157
RMSE between Skew and Harville quinellas: 0.001684
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.25390568844685435, 'xiaomi'), (0.15295738853956556, 'honor'), (0.1487764235297814, 'nokia'), (0.13147140731958334, 'oneplus'), (0.08408810405261567, 'motorola'), (0.07404694472230158, 'samsung'), (0.06828638583225738, 'huawei'), (0.03737659045700754, 'essential'), (0.034095702120020575, 'lg'), (0.014995364980012589, 'blackberry')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.316833

In [62]:
prompt_pair_template = (
    "I brewed the tea variety called [MASK] yesterday and it is now my favorite SOMETHING.",
    "I brewed the tea varieties called [ANSWER] and [MASK] yesterday and they are now my two favorite SOMETHINGs."
)

something_list = ['tea', 'tea variety', 'favorite tea']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.16504693309716784, 'cinnamon'), (0.13301019450190324, 'ginger'), (0.1060471991192556, 'cascade'), (0.10083864474231295, 'rose'), (0.09705181873020778, 'tea'), (0.09551536526340312, 'ginger'), (0.07966699876073531, 'clover'), (0.07584137122499662, 'peach'), (0.07572219166950785, 'vanilla'), (0.07125928289050966, 'vanilla')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317094
RMSE between LLM quinellas and Harville quinellas: 0.317104
RMSE between Skew and Harville quinellas: 0.000563
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.174924761309061, 'cinnamon'), (0.12780889690156222, 'ginger'), (0.1221456961977998, 'cascade'), (0.11385447870354555, 'tea'), (0.09130866162716073, 'rose'), (0.07780159550739008, 'clover'), (0.0776424602850196, 'vanilla'), (0.07507351514583799, 'peach'), (0.07041582925464199, 'ginger'), (0.06902410506798105, 'vanilla')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317397
RMSE between 

In [63]:
prompt_pair_template = (
    "I ate the fruit called [MASK] this afternoon and it is now my favorite SOMETHING.",
    "I ate the fruits called [ANSWER] and [MASK] this afternoon and they are now my two favorite SOMETHINGs."
)

something_list = ['fruit', 'type of fruit', 'favorite fruit']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.16551755620778225, 'mango'), (0.14393737453513394, 'pineapple'), (0.1328730998266746, 'apple'), (0.13285384891980295, 'banana'), (0.09005343506777044, 'orange'), (0.07505096978389689, 'pumpkin'), (0.06780742680877752, 'strawberry'), (0.06537321333271098, 'pear'), (0.0639128904957105, 'lemon'), (0.0626201850217399, 'grapes')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317042
RMSE between LLM quinellas and Harville quinellas: 0.317016
RMSE between Skew and Harville quinellas: 0.000842
The Harville model better predicts the actual quinella probabilities.
{'probs': [(0.1717720440938174, 'mango'), (0.15037793332998234, 'pineapple'), (0.14376653228233646, 'banana'), (0.1224847845065573, 'apple'), (0.07613743498207008, 'strawberry'), (0.07598010881351965, 'orange'), (0.0682770167311929, 'grapes'), (0.0673325576310414, 'pumpkin'), (0.06207267859735935, 'pear'), (0.06179890903212314, 'lemon')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317084
RMSE betwe

In [67]:
prompt_pair_template = (
    "I played the board game called [MASK] last weekend and it is now my favorite SOMETHING.",
    "I played the board games called [ANSWER] and [MASK] last weekend and they are now my two favorite SOMETHINGs."
)

something_list = ['board game', 'game', 'favorite board game']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.18014872053863223, 'solitaire'), (0.14387330036504362, 'risk'), (0.141436089053964, 'chess'), (0.1251084095848629, 'labyrinth'), (0.11078781341128204, 'cthulhu'), (0.07533666897853732, 'minecraft'), (0.07213013533745527, 'magic'), (0.0520308342274773, 'survivor'), (0.051515416145592595, 'journey'), (0.04763261235715275, 'pathfinder')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.318397
RMSE between LLM quinellas and Harville quinellas: 0.318429
RMSE between Skew and Harville quinellas: 0.001007
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.21528250885499112, 'risk'), (0.21089580387718687, 'solitaire'), (0.14691972499085737, 'chess'), (0.09138884652531944, 'labyrinth'), (0.08618000077499434, 'cthulhu'), (0.057190390092902055, 'magic'), (0.04879540922956983, 'pathfinder'), (0.04834397849477075, 'chess'), (0.047902211549789665, 'dice'), (0.04710112560961856, 'survivor')]}
RMSE between LLM quinellas and Skew Normal quinella

In [68]:
prompt_pair_template = (
    "I played the card game called [MASK] last night and it is now my favorite SOMETHING.",
    "I played the card games called [ANSWER] and [MASK] last night and they are now my two favorite SOMETHINGs."
)

something_list = ['card game', 'card game', 'favorite card game']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.25152988263453235, 'solitaire'), (0.21770671812832698, 'risk'), (0.10639147140573055, 'magic'), (0.08627634458255548, 'dice'), (0.07774722073480218, 'chess'), (0.062304934765574416, 'poker'), (0.055881644631092714, 'dice'), (0.04953053818973358, 'fish'), (0.048008434867355385, 'werewolf'), (0.04462281006029635, 'hearts')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.319386
RMSE between LLM quinellas and Harville quinellas: 0.319511
RMSE between Skew and Harville quinellas: 0.001848
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.25152988263453235, 'solitaire'), (0.21770671812832698, 'risk'), (0.10639147140573055, 'magic'), (0.08627634458255548, 'dice'), (0.07774722073480218, 'chess'), (0.062304934765574416, 'poker'), (0.055881644631092714, 'dice'), (0.04953053818973358, 'fish'), (0.048008434867355385, 'werewolf'), (0.04462281006029635, 'hearts')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.319386
RMSE between

In [69]:
prompt_pair_template = (
    "I bought clothes from the American fashion brand called [MASK] this season and it is now my favorite SOMETHING.",
    "I bought clothes from the American fashion brands called [ANSWER] and [MASK] this season and they are now my two favorite SOMETHINGs."
)
something_list = ['fashion brand', 'brand', 'clothing option']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.26993113240007577, 'guess'), (0.16116632355226748, 'coach'), (0.15070151182768515, 'gap'), (0.08649240840821819, 'mac'), (0.07487993303576726, 'supreme'), (0.06341791164353949, 'adidas'), (0.06178505370821364, 'nike'), (0.04898453628874048, 'benefit'), (0.04860889699950807, 'diesel'), (0.034032292135984445, 'versus')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.316845
RMSE between LLM quinellas and Harville quinellas: 0.316843
RMSE between Skew and Harville quinellas: 0.001660
The Harville model better predicts the actual quinella probabilities.
{'probs': [(0.27488883805200454, 'guess'), (0.20045599229660266, 'coach'), (0.1348978150434897, 'gap'), (0.08031843154181904, 'mac'), (0.07466322622806826, 'supreme'), (0.05620257433021012, 'diesel'), (0.05129796990771486, 'adidas'), (0.04867750927408384, 'benefit'), (0.045462781931876986, 'nike'), (0.033134861394130036, 'versus')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.316924
RMSE between LLM quinel

In [70]:
prompt_pair_template = (
    "I planted the flower called [MASK] this spring and it is now my favorite SOMETHING.",
    "I planted the flowers called [ANSWER] and [MASK] this spring and they are now my two favorite SOMETHINGs."
)

something_list = ['flower', 'type of flower', 'thing']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)



{'probs': [(0.23674427368017123, 'iris'), (0.15368748983829156, 'rose'), (0.11892663922839976, 'lily'), (0.10107148905010585, 'orange'), (0.07791660533072951, 'roses'), (0.07302101694245007, 'willow'), (0.06770247836232214, 'violet'), (0.06718625066692169, 'ivy'), (0.05577244647107658, 'yellow'), (0.0479713104295316, 'purple')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317216
RMSE between LLM quinellas and Harville quinellas: 0.317236
RMSE between Skew and Harville quinellas: 0.001157
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.24355152370138602, 'iris'), (0.17195497159651663, 'rose'), (0.10517101011147668, 'lily'), (0.09947027730911422, 'orange'), (0.08081201763340533, 'roses'), (0.0698827025566312, 'violet'), (0.061748124014398095, 'willow'), (0.059502887207511804, 'yellow'), (0.055677849955068394, 'ivy'), (0.05222863591449162, 'clover')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317593
RMSE between LLM quinella

In [71]:
prompt_pair_template = (
    "I cooked the pasta called [MASK] last night and it is now my favorite SOMETHING.",
    "I cooked the pastas called [ANSWER] and [MASK] last night and they are now my two favorite SOMETHINGs."
)

something_list = ['pasta', 'type of pasta', 'food']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)


{'probs': [(0.13890214287614094, 'spaghetti'), (0.1247219068174878, 'salmon'), (0.12297178230017386, 'spinach'), (0.11958182498744668, 'chicken'), (0.10859841510059165, 'this'), (0.08672798736289995, 'chicken'), (0.0861117623120531, 'rice'), (0.08378329404372388, 'mushroom'), (0.06489880156772881, 'kale'), (0.06370208263175332, 'pizza')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317555
RMSE between LLM quinellas and Harville quinellas: 0.317569
RMSE between Skew and Harville quinellas: 0.000526
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.14851822357720523, 'spaghetti'), (0.12220236768729653, 'spinach'), (0.1211850226849461, 'salmon'), (0.11810917256005381, 'chicken'), (0.0994349392430932, 'this'), (0.08684216622730426, 'mushroom'), (0.08494064835256201, 'rice'), (0.08049416076539696, 'chicken'), (0.0711266076329378, 'roma'), (0.0671466912692041, 'kale')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317557
RMSE betwee

In [72]:
prompt_pair_template = (
    "My favourite planet in the solar system is called [MASK] and I hope SOMETHING.",
    "My favourite two planets in the solar system are called [ANSWER] and [MASK] and I hope SOMETHING"
)

something_list = ['to visit', 'to view tonight', 'you agree']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)

{'probs': [(0.3707822575559922, 'venus'), (0.3424750002712201, 'pluto'), (0.14857316220011976, 'mars'), (0.05586722001767207, 'jupiter'), (0.023066732399958118, 'earth'), (0.015660153571023443, 'mercury'), (0.012561704905948483, 'saturn'), (0.010904684455385408, 'neptune'), (0.010370272858153739, 'ceres'), (0.009738811764526683, 'pandora')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.317906
RMSE between LLM quinellas and Harville quinellas: 0.318443
RMSE between Skew and Harville quinellas: 0.006048
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.3928213503160743, 'pluto'), (0.3827107629324292, 'venus'), (0.0637801422351983, 'jupiter'), (0.05764980033562753, 'mars'), (0.02550365165500837, 'mercury'), (0.02220131779387423, 'ceres'), (0.015804528551749155, 'neptune'), (0.015083749380043752, 'saturn'), (0.013661435514641163, 'europa'), (0.010783261285353993, 'io')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.319143
RMSE betw

In [73]:
prompt_pair_template = (
    "When I go bowling I like to try to hit the pin number [MASK] and I hope SOMETHING.",
    "When I go bowling I like to try to hit the two pins numbered [ANSWER] and [MASK] and I hope SOMETHING"
)

something_list = ['that helps my score', 'to win', 'I do not miss']
for something in something_list:
    prompt_pair = [pp.replace('SOMETHING', something) for pp in prompt_pair_template]
    quinella_comparison(prompt_pair=prompt_pair)



{'probs': [(0.4317579028031053, 'one'), (0.13222863479926353, ','), (0.07413252747773424, '1'), (0.06252451645518417, 'five'), (0.06237098595207574, 'three'), (0.0567578060984482, '3'), (0.053501373476173626, '10'), (0.04753793665309254, 'two'), (0.04180793757372088, 'six'), (0.037380378711201784, '5')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.321789
RMSE between LLM quinellas and Harville quinellas: 0.321930
RMSE between Skew and Harville quinellas: 0.001849
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.6641621758618294, 'one'), (0.06644786244546677, '1'), (0.06066231130618482, ','), (0.037763196138229004, 'two'), (0.035841078455722014, 'three'), (0.030898769179500744, 'right'), (0.028829563841383247, '3'), (0.028407327026274905, 'five'), (0.02352632423893945, '10'), (0.023461391506469653, '20')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.327252
RMSE between LLM quinellas and Harville quinellas: 0.327378
RMSE betwe

# All together

In [111]:
import pandas as pd

# List of categories, templates, and substitutions exactly as provided
examples = [
    {
        "category": "Country Visits by Region",
        "prompt_pair_template": (
            "I visited the country called [MASK] last year and it is one of my favorite countries in SOMETHING",
            "I visited two countries called [ANSWER] and [MASK] last year and they are my two favorite countries in SOMETHING"
        ),
        "substitutions": ['Asia', 'Europe', 'the Americas', 'Africa', 'the Southern Hemisphere', 'the World']
    },
    {
        "category": "Favorite Sports",
        "prompt_pair_template": (
            "I learned the sport called [MASK] last year and it is one of my favorite forms of SOMETHING now",
            "I visited two sports called [ANSWER] and [MASK] last year and they are my two favorite forms of SOMETHING now"
        ),
        "substitutions": ['exercise', 'sport', 'relaxation', 'competition']
    },
    {
        "category": "Favorite Hobbies",
        "prompt_pair_template": (
            "I picked up the hobby called [MASK] last year and it is one of my favorite things to do SOMETHING.",
            "I engaged in two hobbies called [ANSWER] and [MASK] last year and they are my two favorite things to do SOMETHING."
        ),
        "substitutions": ['in the evening', 'when I am bored', 'with friends', 'in the evening']
    },
    {
        "category": "Favorite Ice Cream Flavors",
        "prompt_pair_template": (
            "I tried the icecream flavor called [MASK] last year and it is now my favourite SOMETHING.",
            "I tried the icecream flavors called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['treat', 'ice cream', 'guilty pleasure']
    },
    {
        "category": "Favorite Breakfast Cereals",
        "prompt_pair_template": (
            "I ate the breakfast cereal named [MASK] this morning and it is now my favorite SOMETHING.",
            "I ate the breakfast cereals named [ANSWER] and [MASK] this morning and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['meal choice', 'cereal', 'morning staple']
    },
    {
        "category": "Favorite Dog Breeds",
        "prompt_pair_template": (
            "I adopted the dog breed called [MASK] last year and it is now my favorite SOMETHING.",
            "I adopted the dog breeds called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['pet', '', 'canine']
    },
    {
        "category": "Favorite Programming Languages",
        "prompt_pair_template": (
            "I learned the programming language called [MASK] last year and it is now my favorite SOMETHING.",
            "I learned the programming languages called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['language', 'coding language', 'programming language']
    },
    {
        "category": "Favorite Musical Instruments",
        "prompt_pair_template": (
            "I learned to play the musical instrument called [MASK] last year and it is now my favorite SOMETHING.",
            "I learned to play the musical instruments called [ANSWER] and [MASK] last year and they are now my two favorite SOMETHING."
        ),
        "substitutions": ['instrument', 'musical instrument', '']
    },
    {
        "category": "Favorite Vacation Activities",
        "prompt_pair_template": (
            "I engaged in the vacation activity called [MASK] last summer and it is now my favorite SOMETHING.",
            "I engaged in the vacation activities called [ANSWER] and [MASK] last summer and they are now my two favorite SOMETHINGs."
        ),
        "substitutions": ['activity', 'vacation activity', '']
    },
    # Continue to add the remaining examples similarly...
]


In [113]:
import pandas as pd

TOP_K = 4

# DataFrame to store accuracy results
results = []

# Loop through each example exactly as provided
for example in examples:
    category = example["category"]
    prompt_pair_template = example["prompt_pair_template"]
    substitutions = example["substitutions"]

    # Track accuracies for each model in the current category
    harville_accuracies = []
    skew_accuracies = []

    for substitution in substitutions:
        # Replace placeholders with current substitution in both prompt templates
        prompt_pair = [pp.replace("SOMETHING", substitution).replace("REGION", substitution) for pp in prompt_pair_template]

        # Run quinella_comparison for Harville and Skew models and capture accuracy
        try:
            harville_accuracy, skew_accuracy = quinella_comparison(prompt_pair=prompt_pair, top_k=TOP_K)

            if harville_accuracy is not None and skew_accuracy is not None:
                harville_accuracies.append(harville_accuracy)
                skew_accuracies.append(skew_accuracy)
        except Exception as e:
            print(e)

    # Calculate average accuracy for each model in the category
    if harville_accuracies and skew_accuracies:
        avg_harville_accuracy = sum(harville_accuracies) / len(harville_accuracies)
        avg_skew_accuracy = sum(skew_accuracies) / len(skew_accuracies)

        # Determine the winning model for the category (1 if Skew wins, 0 if Harville wins)
        model_win = 1 if avg_skew_accuracy < avg_harville_accuracy else 0

        # Append results to the DataFrame
        results.append({
            "Category": category,
            "Harville Accuracy": avg_harville_accuracy,
            "Skew Accuracy": avg_skew_accuracy,
            "Skew Win": model_win
        })

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Check if the required columns are present in the DataFrame before calculating overall metrics
if "Harville Accuracy" in results_df.columns and "Skew Accuracy" in results_df.columns:
    # Calculate overall average accuracies
    overall_harville_accuracy = results_df["Harville Accuracy"].mean()
    overall_skew_accuracy = results_df["Skew Accuracy"].mean()
    overall_model_win = 1 if overall_skew_accuracy > overall_harville_accuracy else 0

    # Add an overall summary row
    summary_row = pd.DataFrame([{
        "Category": "Overall",
        "Harville Accuracy": overall_harville_accuracy,
        "Skew Accuracy": overall_skew_accuracy,
        "Model Win": overall_model_win
    }])

# Display the DataFrame
print(results_df)




{'probs': [(0.30040617864945307, 'vietnam'), (0.28587588044119433, 'myanmar'), (0.2086695260412975, 'cambodia'), (0.20504841486805517, 'burma')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.081318
RMSE between LLM quinellas and Harville quinellas: 0.081727
RMSE between Skew and Harville quinellas: 0.001176
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.32754779736635087, 'poland'), (0.28626227003523325, 'slovenia'), (0.24871608290902272, 'luxembourg'), (0.13747384968939316, 'romania')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.051526
RMSE between LLM quinellas and Harville quinellas: 0.053074
RMSE between Skew and Harville quinellas: 0.003235
The Skew Normal model better predicts the actual quinella probabilities.
{'probs': [(0.3442923651911901, 'venezuela'), (0.26691868637341254, 'haiti'), (0.2020611394032034, 'honduras'), (0.18672780903219396, 'guatemala')]}
RMSE between LLM quinellas and Skew Normal quinellas: 0.0845