This notebook combines all the datasets processed in [cleanDatasets.ipynb](cleanDatasets.ipynb). In the process, it does the following:

- Analyzes some properties of duplication - e.g. how consistently duplicates are correlated
- Combines the scores of duplicated responses into a single response
- Deduplicates the responses

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ocsai.data import combine_dupes, fingerprint_series
from ocsai.train import col_split, default_split

In [3]:
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import json

data_dir = Path('../data/datasets')
all_datasets = list(data_dir.glob('*.csv'))
# concat and shuffle the data
all_data = (pd.concat([pd.read_csv(f) for f in all_datasets])
            .sample(frac=1, random_state=1234)
)
# remove start and end whitespace
all_data.response = all_data.response.str.strip()
# save for easy access later
all_data.to_csv('../data/all_data.csv', index=False)
print(all_data.shape)
all_data.sample()

(162872, 12)


Unnamed: 0,type,src,question,prompt,response,id,target,participant,response_num,language,rater_count,rating_std
1229,uses,multiaut_chinese1,什么是字典的一个令人惊讶的用途？,字典,做装饰品,multiaut_chinese1_字典-e9240b,2.6,multiaut_chinese11041,,chi,4,0.816497


## Overview

In [4]:
# doublecheck no drop missing - the source data should already be clean
assert all_data.prompt.isna().sum() == 0
print(f"Pre-dedupe data size is {len(all_data)} items")
print(f'# of unique participants is {len(all_data.participant.unique())}')
print("# of unique prompts", len(all_data.prompt.unique()))

Pre-dedupe data size is 162872 items
# of unique participants is 9982
# of unique prompts 252


In [5]:
all_data[['src', 'type']].value_counts()

src                type        
multiaut_chinese1  uses            14176
multiaut_dutch1    uses            10549
multiaut_german1   uses             8116
multiaut_german3   uses             8065
transdis           uses             8007
multiaut_polish1   uses             7415
multiaut_italian2  uses             6895
setal08            uses             5582
h18                uses             5582
dod20              uses             5490
dbc23              metaphors        4589
multiaut_italian1  uses             4269
snbmo09            uses             4099
hmsl               uses             3843
multiaut_english2  uses             3723
multiaut_german2   uses             3530
multiaut_english6  uses             3425
multiaut_english3  uses             3225
h18                consequences     3198
setal08            consequences     3198
multiaut_polish2   uses             3054
betal18            uses             2918
motesf             uses             2913
                   instan

In [6]:
s = ""
for test_type in all_data['type'].astype(str).unique():
    s += f"# {test_type.upper()}\n\n"
    s += f"**Tests with this type:{all_data[all_data.type == test_type].src.unique()}**\n\n"
    for q in all_data[all_data.type == test_type].question.unique():
        s += f"- {q}\n"

from IPython.display import Markdown
Markdown(s)

# COMPLETION

**Tests with this type:['motesf' 'motesp']**

- Complete this sentence in a surprising way: "When the kids were in the library..."
- Complete this sentence in a surprising way: "When I got on the school bus..."
- Complete this sentence in a surprising way: "At a sleepover we..."
- Complete this sentence in a surprising way: "When I was at lunch..."
- Complete this sentence in a surprising way: "When I opened my closet..."
- Complete this sentence in a surprising way: "When the teacher was talking..."
- Complete this sentence in a surprising way: "It started raining and..."
- Complete this sentence in a surprising way: "My friend called me on the phone to tell me..."
- Complete this sentence in a surprising way: "When the kids were in the library they found..."
- Complete this sentence in a surprising way: "When I got on the school bus, I saw..."
- Complete this sentence in a surprising way: "When I had a sleepover at my friend's, we played..."
- Complete this sentence in a surprising way: "When the friends met in the playground, they..."
- Complete this sentence in a surprising way: "When the teacher was talking, all the kids..."
# USES

**Tests with this type:['multiaut_dutch1' 'snb17' 'transdis' 'multiaut_russian1'
 'multiaut_english3' 'multiaut_chinese1' 'multiaut_dutch2' 'snbmo09'
 'betal18' 'h18' 'hmsl' 'multiaut_polish1' 'multiaut_dutch4' 'dod20'
 'multiaut_german1' 'multiaut_french2' 'multiaut_german3'
 'multiaut_german2' 'multiaut_polish2' 'multiaut_hebrew1'
 'multiaut_italian2' 'setal08' 'multiaut_french3' 'multiaut_spanish1'
 'multiaut_english2' 'motesf' 'multiaut_italian1' 'bs12'
 'multiaut_arabic1' 'multiaut_chinese2' 'multiaut_english6'
 'multiaut_french4' 'multiaut_dutch3' 'hass17' 'multiaut_russian2'
 'motesp']**

- Wat is een verrassend gebruik voor een FORK?
- What is a surprising use for ROPE?
- 什么是牙刷的一个令人惊讶的用途？
- Какое удивительное применение для ГАЗЕТА?
- Wat is een verrassend gebruik voor een PAPERCLIP?
- What is a surprising use for BOX?
- 什么是衣架的一个令人惊讶的用途？
- What is a surprising use for a BRICK?
- 什么是筷子的一个令人惊讶的用途？
- What is a surprising thing that is ROUND?
- What is a surprising use for a KNIFE?
- What is a surprising use for PAPERCLIP?
- Jakie jest zaskakujące zastosowanie dla CEGŁA?
- Wat is een verrassend gebruik voor een TOWEL?
- Wat is een verrassend gebruik voor een BRICK?
- What is a surprising use for TABLE?
- 什么是钳子的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein MESSER?
- Was ist eine überraschende Verwendung für ein HAARFOEHN?
- Quel est un usage surprenant pour un CHAPEAU?
- Was ist eine überraschende Verwendung für ein TROMPETE?
- Was ist eine überraschende Verwendung für ein BÜROKLAMMER?
- Jakie jest zaskakujące zastosowanie dla PUSZKA?
- מהו שימוש מפתיע למברג?
- Qual è un uso sorprendente per un LIBRO?
- What is a surprising use for BOOK?
- 什么是西瓜的一个令人惊讶的用途？
- 什么是吸管的一个令人惊讶的用途？
- 什么是红酒的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein KONSERVENDOSE?
- 什么是图钉的一个令人惊讶的用途？
- What is a surprising use for BRICK?
- Quel est un usage surprenant pour un BROUETTE?
- ¿Cuál es un uso sorprendente para un LADRILLO?
- 什么是毛笔的一个令人惊讶的用途？
- 什么是床单的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein TROMMEL?
- 什么是靴子的一个令人惊讶的用途？
- What is a surprising use for PANTS?
- What is a surprising use for a TOOTHBRUSH?
- Was ist eine überraschende Verwendung für ein BETT?
- 什么是椰子的一个令人惊讶的用途？
- Qual è un uso sorprendente per un BOTTIGLIA DI PLASTICA?
- Was ist eine überraschende Verwendung für ein SÄGE?
- 什么是梳子的一个令人惊讶的用途？
- מהו שימוש מפתיע לכיסא?
- 什么是皮带的一个令人惊讶的用途？
- Jakie jest zaskakujące zastosowanie dla SZNUREK?
- ما هو استخدام مفاجئ لـ TIN CANS؟
- What is a surprising use for FORK?
- Qual è un uso sorprendente per un BARATTOLO?
- 什么是光盘的一个令人惊讶的用途？
- 什么是拖鞋的一个令人惊讶的用途？
- 什么是狐狸的一个令人惊讶的用途？
- Qual è un uso sorprendente per un ASPIRAPOLVERE?
- 什么是易拉罐的一个令人惊讶的用途？
- 什么是船桨的一个令人惊讶的用途？
- Qual è un uso sorprendente per un MATTONE?
- What is a surprising use for TIRE?
- 什么是锅的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein SCHAUFEL?
- Какое удивительное применение для ДЕРЕВЯННАЯ ЛИНЕЙКА?
- מהו שימוש מפתיע לסכין?
- What is a surprising use for KNIFE?
- Quel est un usage surprenant pour un CEINTURE?
- 什么是头发的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein PAPRIKA?
- 什么是气球的一个令人惊讶的用途？
- What is a surprising use for SHOE?
- 什么是玉米的一个令人惊讶的用途？
- 什么是茶壶的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein SCHRANK?
- Was ist eine überraschende Verwendung für ein GEIGE?
- What is a surprising use for a HAT?
- Was ist eine überraschende Verwendung für ein SEIL?
- 什么是花瓣的一个令人惊讶的用途？
- 什么是鹅卵石的一个令人惊讶的用途？
- Qual è un uso sorprendente per un BOTTIGLIETTA?
- 什么是墨水的一个令人惊讶的用途？
- 什么是轮胎的一个令人惊讶的用途？
- 什么是积木的一个令人惊讶的用途？
- 什么是耳机的一个令人惊讶的用途？
- Какое удивительное применение для КАРТОННАЯ КОРОБКА?
- Was ist eine überraschende Verwendung für ein AXT?
- 什么是铁链的一个令人惊讶的用途？
- Qual è un uso sorprendente per un APPENDINO?
- מהו שימוש מפתיע לנעל?
- מהו שימוש מפתיע לקולב?
- What is a surprising use for a LIGHT BULB?
- Qual è un uso sorprendente per un SEDIA?
- Was ist eine überraschende Verwendung für ein MÜLLTÜTE?
- Qual è un uso sorprendente per un CESTINO?
- 什么是窗帘的一个令人惊讶的用途？
- Qual è un uso sorprendente per un ACCENDINO?
- Was ist eine überraschende Verwendung für ein GURKE?
- 什么是纽扣的一个令人惊讶的用途？
- Qual è un uso sorprendente per un BOTTE?
- 什么是硬币的一个令人惊讶的用途？
- Qual è un uso sorprendente per un COLTELLO?
- 什么是纸盒的一个令人惊讶的用途？
- 什么是纸巾的一个令人惊讶的用途？
- Qual è un uso sorprendente per un BARILE?
- 什么是马来貘的一个令人惊讶的用途？
- Qual è un uso sorprendente per un ATTACCAPANNI?
- מהו שימוש מפתיע לעיפרון?
- 什么是胶水的一个令人惊讶的用途？
- 什么是手套的一个令人惊讶的用途？
- Qual è un uso sorprendente per un CUCCHIAIO?
- 什么是扇子的一个令人惊讶的用途？
- 什么是扑克的一个令人惊讶的用途？
- Qual è un uso sorprendente per un GUANTO?
- מהו שימוש מפתיע לכרית?
- מהו שימוש מפתיע לעיתון?
- What is a surprising use for a BALL?
- 什么是台灯的一个令人惊讶的用途？
- Qual è un uso sorprendente per un GRAFFETTA?
- Qual è un uso sorprendente per un ACCETTA?
- Was ist eine überraschende Verwendung für ein FLÖTE?
- Was ist eine überraschende Verwendung für ein STUHL?
- Qual è un uso sorprendente per un CAPPELLO?
- Qual è un uso sorprendente per un MARTELLO?
- Qual è un uso sorprendente per un LAMPADINA?
- 什么是银行卡的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein ZANGE?
- 什么是报纸的一个令人惊讶的用途？
- 什么是酸奶的一个令人惊讶的用途？
- What is a surprising use for BOTTLE?
- 什么是柳树的一个令人惊讶的用途？
- 什么是蜡烛的一个令人惊讶的用途？
- 什么是花生的一个令人惊讶的用途？
- 什么是音响的一个令人惊讶的用途？
- 什么是杯子的一个令人惊讶的用途？
- 什么是牙膏的一个令人惊讶的用途？
- 什么是纸杯的一个令人惊讶的用途？
- 什么是铅笔的一个令人惊讶的用途？
- 什么是夹子的一个令人惊讶的用途？
- What is a surprising use for SHOVEL?
- 什么是橡皮擦的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein ERBSE?
- Qual è un uso sorprendente per un BANANA?
- 什么是笛子的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein TOMATE?
- 什么是耳机线的一个令人惊讶的用途？
- 什么是南瓜的一个令人惊讶的用途？
- 什么是勺子的一个令人惊讶的用途？
- 什么是柿子的一个令人惊讶的用途？
- 什么是木头的一个令人惊讶的用途？
- 什么是戒指的一个令人惊讶的用途？
- What is a surprising use for a SOCK?
- What is a surprising use for a PENCIL?
- 什么是面团的一个令人惊讶的用途？
- מהו שימוש מפתיע לצמיג?
- 什么是盘子的一个令人惊讶的用途？
- 什么是白纸的一个令人惊讶的用途？
- What is a surprising use for a BOTTLE?
- 什么是韭菜的一个令人惊讶的用途？
- 什么是镊子的一个令人惊讶的用途？
- 什么是酒瓶的一个令人惊讶的用途？
- Was ist eine überraschende Verwendung für ein TISCH?
- 什么是生姜的一个令人惊讶的用途？
- 什么是磁铁的一个令人惊讶的用途？
- 什么是灌木的一个令人惊讶的用途？
- 什么是火柴的一个令人惊讶的用途？
- 什么是卫生纸的一个令人惊讶的用途？
- 什么是黄金的一个令人惊讶的用途？
- 什么是相机的一个令人惊讶的用途？
- 什么是喇叭的一个令人惊讶的用途？
- 什么是钉子的一个令人惊讶的用途？
- 什么是荷叶的一个令人惊讶的用途？
- 什么是算盘的一个令人惊讶的用途？
- 什么是水壶的一个令人惊讶的用途？
- 什么是漏斗的一个令人惊讶的用途？
- 什么是白酒的一个令人惊讶的用途？
- 什么是地图的一个令人惊讶的用途？
- 什么是芦荟的一个令人惊讶的用途？
- 什么是画像的一个令人惊讶的用途？
- 什么是塑料袋的一个令人惊讶的用途？
- 什么是西瓜皮的一个令人惊讶的用途？
- 什么是蛋壳的一个令人惊讶的用途？
- 什么是花椒的一个令人惊讶的用途？
- 什么是铃铛的一个令人惊讶的用途？
- 什么是球拍的一个令人惊讶的用途？
- 什么是擀面杖的一个令人惊讶的用途？
- 什么是棉签的一个令人惊讶的用途？
- 什么是无花果的一个令人惊讶的用途？
- 什么是皮筋的一个令人惊讶的用途？
- 什么是曲别针的一个令人惊讶的用途？
- What is a surprising use for a SPOON?
- 什么是领带的一个令人惊讶的用途？
- 什么是贝壳的一个令人惊讶的用途？
- 什么是土豆的一个令人惊讶的用途？
- 什么是香蕉的一个令人惊讶的用途？
- 什么是鞋带的一个令人惊讶的用途？
- Qual è un uso sorprendente per un BORSA?
- Qual è un uso sorprendente per un BICICLETTA?
- 什么是蛋糕的一个令人惊讶的用途？
- 什么是冰块的一个令人惊讶的用途？
- 什么是袜子的一个令人惊讶的用途？
- 什么是温度计的一个令人惊讶的用途？
- 什么是发带的一个令人惊讶的用途？
- 什么是字典的一个令人惊讶的用途？
- 什么是牙签的一个令人惊讶的用途？
- 什么是钥匙的一个令人惊讶的用途？
- 什么是西红柿的一个令人惊讶的用途？
- מהו שימוש מפתיע למטאטא?
- 什么是围巾的一个令人惊讶的用途？
- 什么是砖头的一个令人惊讶的用途？
- 什么是橄榄油的一个令人惊讶的用途？
- 什么是咖啡的一个令人惊讶的用途？
- 什么是小米的一个令人惊讶的用途？
- 什么是发簪的一个令人惊讶的用途？
- 什么是核桃的一个令人惊讶的用途？
- 什么是蛋清的一个令人惊讶的用途？
- 什么是柳条的一个令人惊讶的用途？
- What is a surprising use for a BACKPACK?
- 什么是风车的一个令人惊讶的用途？
- 什么是浴缸的一个令人惊讶的用途？
- 什么是弹弓的一个令人惊讶的用途？
- What is a surprising use for a SHOE?
- 什么是毛巾的一个令人惊讶的用途？
- 什么是蚊帐的一个令人惊讶的用途？
- 什么是吹风机的一个令人惊讶的用途？
- Qual è un uso sorprendente per un CAPELLO?
# CONSEQUENCES

**Tests with this type:['setal08' 'h18' 'motesp']**

- What would be a surprising consequence if EVERYONE SHRANK TO 12 INCHES TALL?
- What would be a surprising consequence if PEOPLE NEEDED NO SLEEP?
- What would be a surprising consequence if YOUR TEACHER COULD READ MINDS?
- What would be a surprising consequence if ALIENS LANDED AT YOUR SCHOOL?
- What would be a surprising consequence if RAIN WAS MADE OF SODA?
- What would be a surprising consequence if A KID WAS PRESIDENT?
- What would be a surprising consequence if PEOPLE COULD TRAVEL THROUGH TIME?
# METAPHORS

**Tests with this type:['dbc23']**

- Think of the most boring high-school or college class you’ve ever had. What was it like to sit through?
- Think about the most disgusting thing you ever ate or drank. What was it like to eat or drink it?
- Think about the worst movie or TV show you have ever seen. What was it like to watch it?
- Think of the messiest room that you’ve ever had to live in. What was it like to live there?
# INSTANCES

**Tests with this type:['motesf' 'setal08' 'h18' 'motesp']**

- What is a surprising example of something HUGE?
- What is a surprising thing that makes a NOISE?
- What is a surprising example of something FROZEN?
- What is a surprising example of something RED?
- What is a surprising example of something TASTY?
- What is a surprising example of something WET?
- What is a surprising example of something SOFT?
- What is a surprising example of something FUN?
- What is a surprising example of something SMELLY?
- What is a surprising example of something BIG?
- What is a surprising example of something COLD?


## Check Correlations among Duplicates

Check correlation among ratings for responses which have been submitted more than once. Here, I sample one rating vs mean of all the ratings for a duplicate response. This contextualizes the max what a model might be able to do - if humans can't agree (sometimes with themselves!), then it would be impossible for a model to do so.

In [None]:
#@markdown Average rating-to-rating correlation on duplicates
# run multiple times with different samples
from tqdm import trange
corrs = []
for i in trange(1000):
    check_dupe_corr = all_data.sample(frac=1, random_state=i**2)
    just_duped = check_dupe_corr[check_dupe_corr[['prompt', 'response']].duplicated(keep=False)]
    first = just_duped.drop_duplicates(['prompt', 'response'], keep='first')
    last = just_duped.drop_duplicates(['prompt', 'response'], keep='last')
    merged = first.merge(last[['prompt', 'response', 'target']], how='inner', on=['prompt', 'response'])
    corr = merged[['target_x', 'target_y']].corr().values[0,1]
    corrs.append(corr)
print("\nAverage correlation among duplicates", sum(corrs)/len(corrs))

In [None]:
#@markdown Average rating-to-mean(other ratings) correlation on duplicates
corrs = []
for i in trange(1000):
    check_dupe_corr = all_data.sample(frac=1)
    means_of_dupes = check_dupe_corr[check_dupe_corr[['prompt', 'response']].duplicated()].groupby(['prompt','response'], as_index=False).target.mean().round(2)
    corr = check_dupe_corr.merge(means_of_dupes[['prompt', 'response', 'target']], on=['prompt', 'response']).corr().loc['target_x', 'target_y']
    corrs.append(corr)
print("\nAverage correlation among duplicates", sum(corrs)/len(corrs))

## Merge Duplicates

Merge ratings for items with duplicates, so that a response that has been rated multiple times has the average of all instances as it's ground truth.

First, set a de-duplication strategy.

In [4]:
all_data['dupe_control'] = fingerprint_series(all_data.response)
# Fix chinese
all_data.loc[all_data.language == 'chi', 'dupe_control'] = fingerprint_series(all_data.loc[all_data.language == 'chi', 'response'], basic=True)

### Doublecheck Fingerprinting Algorithm

Doublecheck de-duplication algorithm. This was check *before* the fix line was added above.

See which 'fingerprint' controlled responses have the most variants of a more basic
'uncased match' version of the response. Then inspect the ones that vary the most.

They tend to mostly be Polish, so easy to inspect for me.

The big thing I'm checking for is false positives: does the fingerprinting produce anything that
shouldn't be combines. Generally, the answer is 1) *no, the fingerprinting doesn't surface prominent errors., 2) except for Chinese*.

Another check is whether meaning get changed. My judgement is *no*, it can be argued that 'she burped!!!!' is different from 'she burped', but likely not of enough import.

One thing noticed was that 'podpórka' and 'подпорка' would be combined - a reminder to control by language when merging.


In [14]:
all_data['basic_control'] = all_data.response.str.lower()
top_variants = all_data[['dupe_control', 'basic_control']].drop_duplicates().dupe_control.value_counts()
for fp in top_variants.index[:35]:
    print(fp)
    print(all_data[all_data.dupe_control == fp].basic_control.unique())
    print()
all_data.drop(columns=['basic_control'], inplace=True)

podporka
['podpórka' 'podpórka.' 'podpórka,' 'podporka' 'подпорка']

wazon
['wazon' 'wazon,' 'wazon.' "wazon'" '- wazon,']

dlugopisy na pojemnik
['pojemnik na długopisy' 'pojemnik na dlugopisy' 'pojemnik na długopisy.'
 'pojemnik na dlugopisy,' 'pojemnik na długopisy...']

jako podstawke
['jako podstawkę' 'jako podstawkę.' 'jako podstawke' 'jako podstawkę,'
 'jako podstawke ,']

doniczka
['doniczka' 'doniczka,' '- doniczka,' 'doniczka.' 'doniczką']

polka
['pólka' 'półka' 'polka' 'półka,']

kubek
['kubek' 'kubek.' 'kubek,' '- kubek,']

jako popielniczke
['jako popielniczkę' 'jako popielniczke' 'jako popielniczkę.'
 'jako popielniczkę,']

jako narzedzie zbrodni
['jako narzędzie zbrodni.' 'jako narzędzie zbrodni,'
 'jako narzedzie zbrodni.' 'jako narzędzie zbrodni']

jako skarbonke
['jako skarbonkę' 'jako skarbonke' 'jako skarbonkę,' 'jako skarbonkę.']

do jako kwiatow wazon
['do kwiatów jako wazon' 'jako wazon do kwiatów'
 '- jako wazon do kwiatów,' 'jako wazon do kwiatów,']

swiecznik

In [111]:
all_data['dupe_control_fp'] = all_data['dupe_control'].copy()

### Combine human judgements for duplicate responses

If a response shows up multiple times in the dataset and has been rated by judges multiple times, combine the judgements. Multiple rater judgements for a specific response have already been averaged before this data was loaded, so the current step essentially averages the averages.

In [5]:
# based on a grouping of language+type+question+prompt+dupe_control, combine duplicate rows,
# with the following strategies for combining the remaining rows:
from tqdm.auto import tqdm
tqdm.pandas()
groupcols = ['language', 'type', 'question', 'prompt', 'dupe_control']
dedupe = all_data.groupby(groupcols, as_index=False).progress_apply(combine_dupes)

  0%|          | 0/103773 [00:00<?, ?it/s]

In [6]:
og_size = len(all_data)
deduped_size = len(dedupe)
print("Folded {} rows into {} deduped rows ({}%)".format(og_size, deduped_size, round(deduped_size/og_size*100, 2)))

Folded 162872 rows into 103773 deduped rows (63.71%)


In [7]:
# Save
dedupe.to_csv('../data/dedupe.csv', index=False)

### Caveats

This is a good faith deduplication, folding together exact duplicate responses, as well as near duplicates.

However, large language models are smart enough to understand completely different sentences with the near-identical sentiment (e.g. 'the feline leaped' vs. 'the cat jumped'). That *isn't* controlled by this deduplication.

Another thing that is not controlled by the deduplication is when the same response occurs in a different language.

For example, an un-original response for 'toothbrush' (brush your teeth) may occur in english, which the identical use for 牙刷  (刷牙) in chinese. However, at least post duplication there's only one of each in the dataset, rather than 20 occurrences of the english response and 178 occurrences in the chinese data, as in the original data.

## Split Dataset

Rather than creating a bunch of different datasets for different splits, I'll pre-process a number of sensible splits at once and save them as different columns.

Split reference:

- `default_split`: an 80/5/15 split, entirely randomized by row
- `prompt_split`: a condition where prompts are either in train or val+test, but not both. For evaluating whether a model trained on some models can perform well on new prompts.
- `lang_split`: a condition where train is English and evaluation is other languages. AUT only.
- `type_split`: a condition where training is only on AUT, and evaluation is on other tasks.
- `participant_split`: a condition where participants are wholly in train, val, or test. Not currently crunched, because it's complicated by deduping.

*Unfinished below*

In [22]:
dedupe.loc[dedupe.type!='uses', 'lang_split'].isna().sum()

14781

In [23]:
random_seed = 987
dedupe['default_split'] = default_split(len(dedupe), random_seed=random_seed)
dedupe['prompt_split'] = col_split(dedupe['prompt'], random_seed=random_seed, include_in_train=['brick'])
# train_size=0 just means that we stop adding to the training set after including the 'include_in_train' items
dedupe['lang_split'] = col_split(dedupe['language'], random_seed=random_seed,
                                 train_size=0, include_in_train=['eng']) 
dedupe.loc[dedupe.type!='uses', 'lang_split'] = pd.NA
dedupe['type_split'] = col_split(dedupe['type'], random_seed=random_seed,
                                 train_size=0, include_in_train=['uses'])
dedupe.to_csv('../data/ocsai-all.csv', index=False)

# Dataset Reference

####gt_main2

This is the main split in the first LLM paper.

- Data size is 20202 items
- seed 987
- targetsplits {'train': 80, 'val': 5, 'test': 15}
- split_by_part: False; split_by_prompt: False
- Final split sizes: [80.0, 5.0, 15.0]
- (gt_main_std *should* be identical, with stdev included, but I haven't doublechecked)

####gt_byparticipant
- Data size is 20202 items
- seed 987
- targetsplits {'train': 80, 'val': 5, 'test': 15}
- split_by_part: True; split_by_prompt: False
- Final split sizes: [80.7, 4.7, 14.6]

####gt_byprompt

This the the split used for having different prompts between test and train in the first LLM paper.

- Data size is 20202 items
- seed 987
- targetsplits {'train': 79, 'val': 4, 'test': 17}
- split_by_part: False; split_by_prompt: True
- train: ['brick', 'box', 'knife', 'rope', 'book', 'table', 'tire', 'fork', 'ball', 'pencil', 'lightbulb', 'shoe', 'hat', 'sock', 'toothbrush', 'backpack']
- val: []
- test: ['paperclip', 'spoon', 'bottle', 'shovel', 'pants']
- Final split sizes: [83.2, 0.0, 16.8]

####all

This is the condition for training the final model, for use in the Open Creativity Scoring system. It was *not* deduped for training to avoid data leakeage, since it's for applied use and that performance would be desired.

- Data size is 27217 items
- seed 987
- targetsplits {'train': 94, 'val': 1, 'test': 5}
- split_by_part: False; split_by_prompt: False
- Final split sizes: [94.0, 1.0, 5.0]

####gt_alltests2

This is the condition which includes consequences, instances, and complete the sentence. It may grow outdated as I develop a format for training with this data.

It doesn't have test/train, rather multiple numbered groups so that an ensemble can be trained.

- Data size is 31567 items
- seed 987
- targetsplits {'group1': 32, 'group2': 32, 'group3': 32, 'val': 4}
- split_by_part: False; split_by_prompt: False
- Final split sizes: [32.0, 32.0, 32.0, 4.0]

In [None]:
dprint("All gt options")
print([x.stem.split('.')[0] for x in base_dir.glob('*tar.gz')])

All gt options
['gt_main', 'gt_bypart3', 'gt_byprompt4', 'gt_byparticipant', 'gt_byprompt', 'all', 'gt_main2', 'gt_main_std', 'gt_alltests1']
