# **The Wisdom of the Crowd in LLMs**
### **Abstract** 
We analyze the performance of a group of LLMs as compared to the performance of its constituent individuals.  
We do this by performing a majority vote on the numbers that appear in the final answer.  
In this notebook, we use the DRAW-1K and ALG-514 datasets.  

This method only the numbers that appear the majority of the time. (majority = greater than half the time. ) 
However, this method lacks real-world use-case especially in cases where an explanation is needed.  

Later, we propose a method whereby we instead elect a response where the produced numbers have the lowest levenshtein distance and compare the results.  
This method will provide users with an explanation and is much more useful for real-world use cases.  

**Potential caveats in this experiment:**  
- **We extract all decimals and fractions from ChatGPT's response.** It is possible that ChatGPT simply mentioned the answer in its response.  
However, based on a short preliminary check that I have made of the dataset, it appears that, most of the time, when ChatGPT mentions the correct answer in its response, it usually is correct. Furthermore, we take steps to avoid this.  
- **The majority solution might be larger than any individual ChatGPT response** There certainly might be places where this is the case. However, we show that, in most cases, the majority solution is smaller than any constituent individual's response.

## **1. Data preparation**
### **Download libraries**

In [697]:
%%capture
%pip install pandas==1.3.5
%pip install scipy==1.7.3

### **Load libraries**
In order to replicate the results of the experiment, I have included the Python versions as well as the versions of the libraries used as a comment.

In [698]:
# PYTHON VERSION ---------------- #
# - Python v3.7.8                 #
#                                 #
# LIBRARIES --------------------- #
# - pandas v1.3.5                 #
# - scipy  v1.7.3                 #
# =============================== #

import re

# ------------------------------- #
# Pandas
# ------------------------------- #
import pandas

# ------------------------------- #
# Scipy
# ------------------------------- #
from scipy.spatial.distance import pdist

### **Constants**
Various constants are specified. These are changed from run-to-run.  

In [699]:
DATASET = 'draw'
N_JOBS = 10

### **File-load utility functions**
We provide various utility functions for loading files.  

In [700]:
def load_file(num : int) -> pandas.DataFrame:
    file_path = f'data/{DATASET}/sample_{num}.jsonl'
    data = pandas.read_json(file_path,lines=True)
    data = data[['question_number', 'response']]
    data = data.set_index('question_number')
    data = data.rename(columns={'response': f'sample_{num}'})
    return data

def load_n_jobs_file() -> pandas.DataFrame:
    dataframes = load_file(0)
    for i in range(1, N_JOBS):
        dataframes = dataframes.join(load_file(i))
    return dataframes

### **Load files**

In [701]:
# ground is the question dataset which we will use to compare our solutions to.
ground = pandas.read_json(f'data/{DATASET}/ground.json')
ground = ground[['lSolutions']]
ground.head(1)

Unnamed: 0,lSolutions
0,[2.14285714286]


In [702]:
# responses is the dataframe of N_JOBS responses from ChatGPT
responses = load_n_jobs_file().sort_index()
responses = responses.join(ground)
responses.head(1)

Unnamed: 0_level_0,sample_0,sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8,sample_9,lSolutions
question_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Let's use the formula distance = rate × time.\...,Let's call the speed of the current 'c'.\nWhen...,Let's call the speed of the current 'x.' If Ju...,Let the speed of the current be x miles per ho...,Let's call the speed of the current 'c'. \n\nW...,Let the speed of the current be x miles per ho...,Let's call the speed of the current 'c'. \n\nW...,Let the speed of the current be x miles per ho...,Let the speed of the current be x miles per ho...,"Let the speed of the current be x mph.\nThen, ...",[2.14285714286]


## **2. Majority**
As a first step, we create a majority array consisting of numbers that have appeared a majority number of times.

### **Extract decimals from response**

In [703]:
def number_strip(text : str):
    while len(text) > 0 and (not (text[0].isdigit() or text[0] == '-') or text[0] == '0') and text: text = text[1:]
    while len(text) > 0 and not text[-1].isdigit(): text = text.rstrip(text[-1])
    return text

def extract_decimals(text : str):
    text = str(text)
    # We try to limit the number of lucky responses ChatGPT can get.
    # There are cases, where it names variables like so: t_1, we consider this case and do not consider the 1
    text = re.sub(r'[a-zA-Z]\d*|\d*[a-zA-Z]', ' ', text)

    # Then, we remove things that are not needed to make a number. For example, the only characters 
    # that could possibly constitute ChatGPT's response are assumed to be digits, '.', '/', '-'. However, there are certainly cases 
    # where ChatGPT uses LaTEX. We will consider that too.
    text = re.sub(r'[^0-9\.\/\-]', ' ', text)

    # Split it by spaces
    split_text = text.split(' ')
    
    # Remove cases where it is empty
    split_text = filter(lambda text : len(number_strip(text)) > 0, split_text)
    split_text = [number_strip(text) for text in split_text]
    
    decimals = [eval(text) for text in split_text]

    return list(set(decimals))

In [704]:
decimals = responses.copy()

for i in range(N_JOBS):
    decimals[f'sample_{i}'] = decimals[f'sample_{i}'].apply(lambda row : extract_decimals(row))

decimals.head(1)

Unnamed: 0_level_0,sample_0,sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8,sample_9,lSolutions
question_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,"[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180, 21]","[0.7, 2.14, 7, 9, 12, 12.86, 15, 21, -2.14]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180, 21]","[2.14, 135, 9, 12, 45, 15, 180]",[2.14285714286]


### **Select majority**
We select numbers that appear a majority of the time.

In [705]:

def select_majority(row, cutoff):
    majority_map = dict()

    for i in range(N_JOBS):
        nums = row[f'sample_{i}']
        for j in nums:
            j = j
            if not j in majority_map: majority_map[j] = 1 
            else: majority_map[j] += 1
    
    ret = []
    for key, value in majority_map.items():
        if value > cutoff:
            ret.append(key)
    return ret

decimals['majority'] = decimals.apply(lambda row : select_majority(row, N_JOBS / 2), axis=1)
decimals.head(1)

Unnamed: 0_level_0,sample_0,sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8,sample_9,lSolutions,majority
question_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,"[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180, 21]","[0.7, 2.14, 7, 9, 12, 12.86, 15, 21, -2.14]","[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180, 21]","[2.14, 135, 9, 12, 45, 15, 180]",[2.14285714286],"[2.14, 135, 9, 12, 45, 15, 180]"


### **Evaluate majority performance**
We evaluate the performance of the aggregate opinion as well as the opinions of the constituent individuals

In [706]:
def check_correct(solution, answer, transform_func):
    solution = set([transform_func(s) for s in solution])
    answer = set([transform_func(s) for s in answer])

    if len(answer.intersection(solution)) == len(solution): return 'all'
    elif len(answer.intersection(solution)) > 0: return 'some'
    else: return 'none'

def base_transform_func(x): return round(x, 3)

is_correct = decimals.copy()

for i in range(N_JOBS):
    is_correct[f'sample_{i}'] = is_correct.apply(lambda row : check_correct(row['lSolutions'], row[f'sample_{i}'], base_transform_func), axis=1)

is_correct[f'majority'] = is_correct.apply(lambda row : check_correct(row['lSolutions'], row[f'majority'], base_transform_func), axis=1)

is_correct.head(1)

Unnamed: 0_level_0,sample_0,sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8,sample_9,lSolutions,majority
question_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,none,none,none,none,none,none,none,none,none,none,[2.14285714286],none


In [707]:
value_cnt = pandas.DataFrame(is_correct[f'sample_0'].value_counts())
for i in range(1, N_JOBS):
    value_cnt = value_cnt.join(pandas.DataFrame(is_correct[f'sample_{i}'].value_counts()))
value_cnt = value_cnt.join(pandas.DataFrame(is_correct['majority'].value_counts()))

### **Majority summary**
It looks like the majority consistently performs better than any constituent individual.  
However, it is possible that the majority array is simply larger so of course, ChatGPT could get luckier, right?
Most of the time, it is smaller.

In [708]:
def len_difference(row, column_name):
    majority_len = len(row[column_name])
    total = 0
    for i in range(N_JOBS):
        nums = len(row[f'sample_{i}'])
        total += majority_len - nums
    return total / N_JOBS

len_diff = pandas.DataFrame()
len_diff['len_difference'] = decimals.apply(lambda row : len_difference(row, 'majority'), axis=1)
print('NUMBER OF CASES WHERE AVERAGE INDIVIDUAL SIZE > MAJORITY SIZE:', len(len_diff[len_diff['len_difference'] > 0]))
print('AVERAGE MAJORITY - INDIVIDUAL SIZE:', len_diff['len_difference'].sum()/len(len_diff))

NUMBER OF CASES WHERE AVERAGE INDIVIDUAL SIZE > MAJORITY SIZE: 218
AVERAGE MAJORITY - INDIVIDUAL SIZE: -1.0439


## **3. Majority election**
Now, we instead elect a response based on the levenshtein distance between its response and the majority response. 

In [709]:
def levenshtein_helper(initial, target):
    if len(target) == 0: return len(initial)
    if len(initial) == 0: return len(target)
    if initial[0] == target[0]: return levenshtein_helper(initial[1:], target[1:])
    return 1 + min(levenshtein_helper(initial[1:], target), levenshtein_helper(initial, target[1:]), levenshtein_helper(initial[1:], target[1:]))

# Pick the one with the smallest levenshtein distance
def levenshtein_distance(initial, target):
    initial.sort()
    target.sort()
    return levenshtein_helper(initial=initial, target=target)

# Pick the one with the most majority elements
def most_majority(initial, target):
    return -len(set(initial).intersection(set(target)))

# Pick the largest array
def max_size(initial, target):
    return -len(initial)

### **Elect representative**
We select response with the smallest levenshtein distance. We also select responses that have most of the numbers in the majority array. 

In [710]:
def select_representative(row):
    min_val = float('inf')
    for i in range(N_JOBS):
        min_val = min(row[f'sample_{i}'], min_val)
    
    for i in range(N_JOBS):
        if row[f'sample_{i}'] == min_val: return f'sample_{i}'
    return 'sample_0'

distance = decimals.copy()
for i in range(N_JOBS):
    distance[f'sample_{i}'] = decimals.apply(lambda row : levenshtein_distance(row[f'sample_{i}'], row['lSolutions']), axis=1)
distance['levenshtein_representative'] = distance.apply(lambda row: select_representative(row), axis=1) 

for i in range(N_JOBS):
    distance[f'sample_{i}'] = decimals.apply(lambda row : max_size(row[f'sample_{i}'], row['lSolutions']), axis=1)
distance['max_size_representative'] = distance.apply(lambda row: select_representative(row), axis=1) 

for i in range(N_JOBS):
    distance[f'sample_{i}'] = decimals.apply(lambda row : most_majority(row[f'sample_{i}'], row['lSolutions']), axis=1)
distance['most_majority_representative'] = distance.apply(lambda row: select_representative(row), axis=1) 

In [711]:
elected = responses.copy()
elected = elected.join(distance[['levenshtein_representative', 'most_majority_representative', 'max_size_representative']])
elected['levenshtein_sample'] = elected.apply(lambda row : row[row['levenshtein_representative']], axis=1)
elected['most_majority_sample'] = elected.apply(lambda row : row[row['most_majority_representative']], axis=1)
elected['max_size_sample'] = elected.apply(lambda row : row[row['max_size_representative']], axis=1)

elected['levenshtein_decimals'] = elected['levenshtein_sample'].apply(lambda row : extract_decimals(row))
elected['most_majority_decimals'] = elected['most_majority_sample'].apply(lambda row : extract_decimals(row))
elected['max_size_decimals'] = elected['max_size_sample'].apply(lambda row : extract_decimals(row))

elected = elected[['levenshtein_sample', 'most_majority_sample', 'levenshtein_decimals', 'most_majority_decimals', 'max_size_decimals', 'lSolutions']]
elected['levenshtein_correct'] = elected.apply(lambda row : check_correct(row['lSolutions'], row['levenshtein_decimals'], base_transform_func), axis=1)
elected['most_majority_correct'] = elected.apply(lambda row : check_correct(row['lSolutions'], row['most_majority_decimals'], base_transform_func), axis=1)
elected['max_size_correct'] = elected.apply(lambda row : check_correct(row['lSolutions'], row['max_size_decimals'], base_transform_func), axis=1)
elected.head(1)

Unnamed: 0_level_0,levenshtein_sample,most_majority_sample,levenshtein_decimals,most_majority_decimals,max_size_decimals,lSolutions,levenshtein_correct,most_majority_correct,max_size_correct
question_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,Let's use the formula distance = rate × time.\...,Let's use the formula distance = rate × time.\...,"[2.14, 135, 9, 12, 45, 15, 180]","[2.14, 135, 9, 12, 45, 15, 180]","[0.7, 2.14, 7, 9, 12, 12.86, 15, 21, -2.14]",[2.14285714286],none,none,none


In [712]:
def get_average(row):
    average = 0
    for i in range(N_JOBS):
        average += row[f'sample_{i}']
    return average / N_JOBS

#### Stats
The important columns are:  
- `majority` - This is the results of the arrays that are created by selecting values that appear a majority amount of times.  
- `levenshtein_correct` - This is the results of the arrays that are created by electing the array with the shortest levenshtein distance to the majority among the samples.  
- `most_majority_correct` - This is the results of the arrays that are created by electing the array with the most elements from the majority array among the samples.
- `max_size_correct` - This is the results of the arrays that are created by electing the array with the maximum size among the samples.

In [713]:
value_cnt = value_cnt.join(pandas.DataFrame(elected['levenshtein_correct'].value_counts()))
value_cnt = value_cnt.join(pandas.DataFrame(elected['most_majority_correct'].value_counts()))
value_cnt = value_cnt.join(pandas.DataFrame(elected['max_size_correct'].value_counts()))
value_cnt['sample_average'] = value_cnt.apply(lambda row : get_average(row),axis=1)
value_cnt

Unnamed: 0,sample_0,sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8,sample_9,majority,levenshtein_correct,most_majority_correct,max_size_correct,sample_average
all,739,753,726,726,733,756,738,731,727,728,750,753,874,745,735.7
none,174,167,180,187,170,165,177,184,178,184,165,169,86,178,176.6
some,87,80,94,87,97,79,85,85,95,88,85,78,40,77,87.7


From this, we can see that getting the largest array is often not helpful. Since levenshtein and most majority outperform it a lot.