In [46]:
import pandas as pd
import numpy as np
from itertools import combinations

# Pilot triads/trials
This notebook generates triads (and trials) for a pilot experiment in the multilingual semantic triads project.

## How many triads/trials are needed
With current methods (see simulation notebooks) we need to sample about 50% of all possible triads before the amount of semantic similarity structure we can recover plateaus and increases no further. Because the total number of possible triads scales very poorly with the number of concepts used, we start with a set of only 50 concepts.

Below we briefly demonstrate how many trials per participant are theoretically needed to recover a reasonable amount of semantic similarity structure for 50 concepts.

In [47]:
concepts = list(range(50))  # numbers to use as stand-ins for concepts
max_triads = len(list(combinations(concepts, 3)))
print(f'Number of possible triads for 50 concepts: {max_triads}')
print(f'Number of triads we need to sample to recover structure for 50 concepts: {max_triads / 2:.0f}')
print(f'Number of participants needed, assuming 5 triads per trial and 100 trials per participant: '
        + f'{(max_triads / 2) / (5 * 100):.0f}')

Number of possible triads for 50 concepts: 19600
Number of triads we need to sample to recover structure for 50 concepts: 9800
Number of participants needed, assuming 5 triads per trial and 100 trials per participant: 20


From an experimenter's perspective it might make sense to gather 2 or more observations per triad, because we usually do not trust single observations very much. It's important to keep in mind, however, that the triads are not independent, since they describe a structured space. Triads are therefore strongly interdependent (e.g. if we have collected a number of triads that suggest _cat_ is similar to various animals while being dissimilar from vehicles, then a new participant rating _cat_ as being more similar to _train_ than to _dog_ won't carry much weight in the determination of the overall structure.

## Picking pilot concepts
In order to have good materials to compare the recovered semantic similarity structure to, we want concepts that are represented in other lexical semantics datasets. There are a number of such datasets available, but some of the most-used our field are the McRae feature norms (2005), the Buchanan feature norms (?), and the Small World of Words data (?).

Ideally, we want to use words that are represented in all three of these datasets, but specifically the Buchanan and McRae norms don't have much overlap. To see what we have to work with, we compute the intersection of concepts represented in all three datasets.

In [50]:
df_mcrae = pd.read_csv('datasets/mcrae_concepts.txt', sep='\t')
df_swow = pd.read_csv('datasets/SWOW-EN.R100.csv')
df_buchanan = pd.read_csv('datasets/buchanan_words.csv')
display(df_mcrae)
display(df_swow)
display(df_buchanan)

Unnamed: 0,Concept,Pronunciation,Phon_1st,KF,ln(KF),BNC,ln(BNC),Familiarity,Length_Letters,Length_Phonemes,...,Num_Func,Num_Vis_Mot,Num_VisF&S,Num_Vis_Col,Num_Sound,Num_Taste,Num_Smell,Num_Tact,Num_Ency,Num_Tax
0,accordion,[@][kO:][dj@n],@,1,0.00,2,0.69,2.90,9,7,...,2,0,2,0,2,0,0,0,2,1
1,airplane,[E@][pleIn],E@,21,3.04,108,4.68,6.55,8,5,...,3,3,5,0,0,0,0,0,2,0
2,alligator,[&][lI][geI][t@r*],&,4,1.39,114,4.74,3.75,9,8,...,0,2,6,1,0,0,0,0,5,2
3,ambulance,[&m][bjU][l@ns],&,7,1.95,1846,7.52,6.45,9,9,...,7,1,4,3,1,0,0,0,1,2
4,anchor,[&N][k@r*],&,17,2.83,700,6.55,3.85,6,5,...,3,0,3,0,0,0,0,1,6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536,wrench,[rEntS],r,2,0.69,213,5.36,4.70,6,4,...,8,0,3,0,0,0,0,1,1,1
537,yacht,[jOt],j,7,1.95,1426,7.26,3.85,5,3,...,5,0,3,0,0,0,0,0,5,1
538,yam,[j&m],j,1,0.00,51,3.93,3.30,3,3,...,2,0,0,1,0,1,0,0,2,1
539,zebra,[zE][br@],z,1,0.00,276,5.62,2.60,5,5,...,2,2,5,2,0,0,0,0,2,3


Unnamed: 0.1,Unnamed: 0,id,participantID,age,gender,nativeLanguage,country,education,created_at,cue,R1,R2,R3
0,1,29,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,although,nevertheless,yet,but
1,2,30,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,deal,no,cards,shake
2,3,31,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,music,notes,band,rhythm
3,4,32,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,inform,tell,rat on,
4,5,33,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,way,path,via,method
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1228195,1228196,1530300,132506,29,Ma,Canada,Australia,5.0,2018-08-10 01:56:27,strange,mask,weird,stranger
1228196,1228197,1530290,132506,29,Ma,Canada,Australia,5.0,2018-08-10 01:56:27,sunset,sea,sky,clause
1228197,1228198,1530291,132506,29,Ma,Canada,Australia,5.0,2018-08-10 01:56:27,useless,pitty,worthless,worth
1228198,1228199,1530284,132506,29,Ma,Canada,Australia,5.0,2018-08-10 01:56:27,volume,loud,music,key


Unnamed: 0,where,cue,feature,translated,frequency_feature,Unnamed: 5
0,top,abandon,desert,desert,9,
1,top,abandon,give,give,19,
2,top,abandon,leave,leave,26,
3,top,abandon,leaving,leave,1,
4,top,abandon,left,leave,5,
...,...,...,...,...,...,...
49041,top,TRUE,rightly,right,1,
49042,top,TRUE,truth,truth,10,
49043,top,TRUE,unfaithful,faith,1,
49044,top,TRUE,unreal,real,1,


In [54]:
concepts_mcrae = set(df_mcrae['Concept'])
concepts_swow = set(df_swow['cue'])
concepts_buchanan = set(df_buchanan['cue'])
print(f'Number of concepts in McRae feature norms: {len(concepts_mcrae)}')
print(f'Number of concepts in Small World of Words: {len(concepts_swow)}')
print(f'Number of concepts in Buchanan feature norms: {len(concepts_buchanan)}')
print()
print(f'Number of concepts in both McRae and SWoW: {len(concepts_mcrae & concepts_swow)}')
print(f'Number of concepts in both McRae and Buchanan: {len(concepts_mcrae & concepts_buchanan)}')
print(f'Number of concepts in both Buchanan and SWoW: {len(concepts_buchanan & concepts_swow)}')
print()
intersect = concepts_mcrae & concepts_swow & concepts_buchanan
print(f'Number of concepts in all three datasets: {len(intersect)}')

Number of concepts in McRae feature norms: 541
Number of concepts in Small World of Words: 12282
Number of concepts in Buchanan feature norms: 3722

Number of concepts in both McRae and SWoW: 500
Number of concepts in both McRae and Buchanan: 63
Number of concepts in both Buchanan and SWoW: 3529

Number of concepts in all three datasets: 63


### Categorizing and selecting concepts
From the intersection of the three datasets, we can choose our 50 concepts. It makes sense to not just do this at random, but put the concepts into rough semantic categories and ensure good coverage of a few of these categories. It's easiest to store them in a TSV file now, work on them outside this notebook, and them load them in again afterwards.

We will also insert a few verbs into the list, as the intersection concepts are all fairly basic concrete nouns.

In [76]:
df = pd.DataFrame(list(intersect), columns=['concept'])
df['category'] = ''
display(df)
df.to_csv('pilot_items_uncategorized.tsv', sep='\t', index=False)

Unnamed: 0,concept,category
0,gloves,
1,apple,
2,zebra,
3,cabin,
4,knife,
...,...,...
58,shirt,
59,brush,
60,pyramid,
61,snail,


In [75]:
df = pd.read_csv('pilot_items_categorized.tsv', sep='\t')

FileNotFoundError: [Errno 2] No such file or directory: 'pilot_items_categorized.tsv'