# Create Matching Study Stimuli

This notebook processes the pilot data to create stimuli files for a follow-up matching study.

## Process:
1. Randomly select one participant per stimulus-condition combination
2. Extract their 9 scoring trial sounds
3. Create JSON files with randomized sound order and all possible animation choices

In [1]:
import json
import random
from collections import defaultdict
from pathlib import Path

# Set random seed for reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

print(f"Random seed set to: {RANDOM_SEED}")

Random seed set to: 42


## Load Data

In [2]:
# Load the pilot data
data_path = Path('temp/image-scoring-cogsci-data.json')

with open(data_path, 'r') as f:
    data = json.load(f)

print(f"Loaded {len(data)} total trials")

# Filter to only scoring trials
scoring_trials = [t for t in data if t.get('study_phase') == 'scoring_trial']
print(f"Found {len(scoring_trials)} scoring trials")

Loaded 5276 total trials
Found 2344 scoring trials


## Organize by Condition and Stimulus

In [3]:
# Group trials by condition, stimulus, and participant
grouped_data = defaultdict(lambda: defaultdict(list))

for trial in scoring_trials:
    condition = trial['session_info']['condition']
    stimulus = trial['session_info']['stimulus']
    participant_id = trial['prolific']['prolificSessionID']
    
    key = (condition, stimulus)
    grouped_data[key][participant_id].append(trial)

# Display summary
print("Available data:")
for (condition, stimulus), participants in sorted(grouped_data.items()):
    print(f"  {stimulus}-{condition}: {len(participants)} participants, each with {[len(trials) for trials in participants.values()]} trials")

Available data:
  A-musical: 16 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] trials
  B-musical: 17 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] trials
  C-musical: 17 participants, each with [9, 9, 9, 9, 9, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] trials
  D-musical: 16 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] trials
  E-musical: 17 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 1, 9, 9, 9, 9, 9, 9, 9, 9] trials
  F-musical: 16 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] trials
  G-musical: 16 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] trials
  H-musical: 17 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 9, 9, 9, 9] trials
  A-referential: 19 participants, each with [9, 9, 9, 9, 9, 11, 9, 9, 9, 1, 1, 9, 9, 9, 9, 1, 9, 9, 9] trials
  B-referential: 16 participants, each with [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] t

## Select One Participant per Condition-Stimulus Combination

In [4]:
# Randomly select one participant for each condition-stimulus combination
selected_participants = {}

for (condition, stimulus), participants in grouped_data.items():
    # Get list of participant IDs
    participant_ids = list(participants.keys())
    
    # Randomly select one
    selected_id = random.choice(participant_ids)
    selected_participants[(condition, stimulus)] = {
        'participant_id': selected_id,
        'trials': participants[selected_id]
    }
    
    print(f"{stimulus}-{condition}: Selected participant {selected_id} ({len(participants[selected_id])} trials)")

A-musical: Selected participant 696afcc968ceaae5fd8b1172 (9 trials)
A-referential: Selected participant 69700faab91841c862d732c0 (9 trials)
B-referential: Selected participant 697014129af5c95901b64166 (9 trials)
B-musical: Selected participant 69701692bc43cf04d892e464 (9 trials)
H-musical: Selected participant 6970d2d39189a65db58966ed (9 trials)
G-referential: Selected participant 6970d4ed9e61e7e7f16fb250 (9 trials)
E-referential: Selected participant 6970d7d4165d6d8181eae14a (9 trials)
F-referential: Selected participant 6970d5a4befbe4f1424ebca2 (9 trials)
D-referential: Selected participant 6970e053bf80d8d279329863 (9 trials)
H-referential: Selected participant 6970d60214567613d620853f (9 trials)
C-musical: Selected participant 6970d59b77e38462a91fdfe7 (9 trials)
G-musical: Selected participant 6970d6f1f3dd5690252ba2bb (9 trials)
E-musical: Selected participant 6970d7dd99ef0b8aa6e17f86 (9 trials)
C-referential: Selected participant 6970e1b60c3145d1f8559130 (9 trials)
D-musical: Selec

## Create JSON Files for Each Combination

Each file will contain:
- Array of 9 sounds (in random order)
- Each sound has:
  - `audio`: the audio blob from the trial
  - `choices`: array of all 9 animated clips with start/end states
    - Each choice includes `correct_answer` boolean

In [7]:
# Create output directory
output_dir = Path('matching_stimuli')
output_dir.mkdir(exist_ok=True)

# Process each selected participant
for (condition, stimulus), data in selected_participants.items():
    trials = data['trials']
    
    # Sort trials by stimulus_index to ensure correct order
    trials_sorted = sorted(trials, key=lambda t: t['stimulus_index'])
    
    # Extract all animation data (these are the 9 clips for this stimulus)
    all_animations = []
    for trial in trials_sorted:
        animation = {
            'start_state': trial['animation_data']['start_state'],
            'end_state': trial['animation_data']['end_state']
        }
        all_animations.append(animation)
    
    # Create the sound entries
    sounds = []
    for i, trial in enumerate(trials_sorted):
        # Create choices array with correct_answer flags
        choices = []
        for j, anim in enumerate(all_animations):
            choice = {
                'start_state': anim['start_state'],
                'end_state': anim['end_state'],
                'correct_answer': (i == j)  # True if this animation matches this sound
            }
            choices.append(choice)
        
        sound_entry = {
            'audio': trial['audio_blob'],
            'choices': choices
        }
        sounds.append(sound_entry)
    
    # Verify correct_answer flags before shuffling
    print(f"\n  Verifying {stimulus}-{condition}:")
    for i, sound in enumerate(sounds):
        # Find which choice is marked as correct
        correct_indices = [j for j, choice in enumerate(sound['choices']) if choice['correct_answer']]
        
        # Should be exactly one correct answer
        assert len(correct_indices) == 1, f"Sound {i} has {len(correct_indices)} correct answers (expected 1)"
        
        # The correct choice should match the original animation for this trial
        correct_idx = correct_indices[0]
        original_animation = all_animations[i]
        correct_choice = sound['choices'][correct_idx]
        
        # Verify the correct choice matches the original animation
        assert correct_choice['start_state'] == original_animation['start_state'], \
            f"Sound {i}: correct choice start_state doesn't match original"
        assert correct_choice['end_state'] == original_animation['end_state'], \
            f"Sound {i}: correct choice end_state doesn't match original"
    
    print(f"  ✓ All {len(sounds)} sounds have correct_answer properly set to match their original animations")
    
    # Randomize the order of sounds
    random.shuffle(sounds)
    
    # Save to file
    filename = output_dir / f"{stimulus}-{condition}.json"
    with open(filename, 'w') as f:
        json.dump(sounds, f, indent=2)
    
    print(f"Created {filename} with {len(sounds)} sounds, each with {len(sounds[0]['choices'])} choices")


  Verifying A-musical:
  ✓ All 9 sounds have correct_answer properly set to match their original animations
Created matching_stimuli/A-musical.json with 9 sounds, each with 9 choices

  Verifying A-referential:
  ✓ All 9 sounds have correct_answer properly set to match their original animations
Created matching_stimuli/A-referential.json with 9 sounds, each with 9 choices

  Verifying B-referential:
  ✓ All 9 sounds have correct_answer properly set to match their original animations
Created matching_stimuli/B-referential.json with 9 sounds, each with 9 choices

  Verifying B-musical:
  ✓ All 9 sounds have correct_answer properly set to match their original animations
Created matching_stimuli/B-musical.json with 9 sounds, each with 9 choices

  Verifying H-musical:
  ✓ All 9 sounds have correct_answer properly set to match their original animations
Created matching_stimuli/H-musical.json with 9 sounds, each with 9 choices

  Verifying G-referential:
  ✓ All 9 sounds have correct_answer

## Verify Output

Let's check one of the generated files to verify the structure:

In [6]:
# Load and examine one output file
output_files = list(output_dir.glob('*.json'))
print(f"Generated {len(output_files)} files:")
for f in sorted(output_files):
    print(f"  - {f.name}")

if output_files:
    # Examine the first file
    sample_file = output_files[0]
    with open(sample_file, 'r') as f:
        sample_data = json.load(f)
    
    print(f"\n{sample_file.name} structure:")
    print(f"  - Total sounds: {len(sample_data)}")
    print(f"  - Choices per sound: {len(sample_data[0]['choices'])}")
    print(f"  - Sample sound entry keys: {list(sample_data[0].keys())}")
    print(f"  - Sample choice keys: {list(sample_data[0]['choices'][0].keys())}")
    
    # Count correct answers per sound
    for i, sound in enumerate(sample_data):
        correct_count = sum(1 for choice in sound['choices'] if choice['correct_answer'])
        print(f"  - Sound {i+1}: {correct_count} correct answer(s)")

Generated 16 files:
  - A-musical.json
  - A-referential.json
  - B-musical.json
  - B-referential.json
  - C-musical.json
  - C-referential.json
  - D-musical.json
  - D-referential.json
  - E-musical.json
  - E-referential.json
  - F-musical.json
  - F-referential.json
  - G-musical.json
  - G-referential.json
  - H-musical.json
  - H-referential.json

D-referential.json structure:
  - Total sounds: 9
  - Choices per sound: 9
  - Sample sound entry keys: ['audio', 'choices']
  - Sample choice keys: ['start_state', 'end_state', 'correct_answer']
  - Sound 1: 1 correct answer(s)
  - Sound 2: 1 correct answer(s)
  - Sound 3: 1 correct answer(s)
  - Sound 4: 1 correct answer(s)
  - Sound 5: 1 correct answer(s)
  - Sound 6: 1 correct answer(s)
  - Sound 7: 1 correct answer(s)
  - Sound 8: 1 correct answer(s)
  - Sound 9: 1 correct answer(s)


## Summary

The notebook has successfully:
1. ✅ Loaded the pilot data
2. ✅ Grouped by condition and stimulus
3. ✅ Randomly selected one participant per combination (seed: 42)
4. ✅ Created JSON files with:
   - 9 sounds in random order
   - Each sound has audio blob and all 9 animation choices
   - Each choice marked with correct_answer boolean

Files are saved in `matching_stimuli/` directory.