<!-- uncomment for html renders -->
<!-- ---
title: "Replication of Drawings of real-world scenes during free recall reveal detailed object and spatial information in memory by Bainbridge et al. (2019, Nature Communications)"
author: "Haoyu Du (h6du)"
date: "`r format(Sys.time(), '%B %d, %Y')`"
format:
  html:
    toc: true
    toc_depth: 3
--- -->

## Introduction

Bainbridge et al. (2019) uses a free recall drawing paradigm to investigate the diagnosticity of visual memory for static, real-world scenes. Their results show that participants’ drawings, both immediately and after a delay, could reliably capture object and spatial information.  The study is methodologically rigorous, incorporating drawing, memory, and rating tasks. While replicating the free drawing component is too resource-intensive, it is feasible and worthwhile to recruit new participants on Prolific to complete the drawing-to-picture matching task using the existing drawing dataset, allowing for an assessment of the robustness of the original main effect. If time permits, we will conduct a model-based analysis using vision-language models (e.g., CLIP or DeepMeaning) to provide a computational benchmark for alignment with human performance, though that is not the main goal for this project.

The original stimulus set contains 90 scene images, 3 per scene category. These images are selected from validated naturalistic scene image datasets, controlled for memorability and low-level visual features. For the present replication, we select a subset of the categories that contain images that are appropriate for future extension to developmental studies. Specifically, we have chosen [17 of the original 30 scene categories](https://github.com/haoyudu/bainbridge2019/tree/main/data/stim) based on their suitability for children (e.g., playground, bedroom, kitchen), retaining both high-memorable and low-memorable exemplars per category for a total of 30 images that have drawings, and using the original foil in each category to preserve the 3AFC experimental design. 

The target comparison for this replication is the contrast between Delayed Recall drawings and Category Drawings, which is one of the most theoretically important effects demonstrating that memory drawings contain visual information specific to the studied images rather than merely canonical representations of scene categories. In the original study, Delayed Recall drawings were correctly matched to their corresponding images significantly more than Category Drawings with a very large effect size (Cohen's d $\approx$ 3.7). Given this robust effect, we significantly downsize the stimulus set and the sample size, but conserve some integrity to the original study with thoughtful subsampling.

Following the original procedure, each drawing will be presented alongside 3 scene images from the same category (1 target, 2 foils), and raters will judge which image the drawing best represents. To balance statistical power with cost efficiency, we will recruit 10-12 independent raters per drawing (as opposed to 24 in the original study), as the high inter-rater reliability suggests that fewer raters will still provide stable estimates. 14 Prolific participants will complete approximately 60 trials each. 


### Links

[Project repository](https://github.com/haoyudu/bainbridge2019)

Original paper: Bainbridge, W.A., Hall, E.H. & Baker, C.I. [Drawings of real-world scenes during free recall reveal detailed object and spatial information in memory.](https://github.com/haoyudu/bainbridge2019/blob/main/original_paper/bainbridge2019.pdf) Nat Commun 10, 5 (2019). 



## Methods

### Power Analysis

In the original study, Delayed Recall drawings were correctly matched to their corresponding images by 84.3% of raters (SD = 10.9%), significantly outperforming Category Drawings at 30.7% (SD = 17.4%), yielding a very large effect size (Cohen's d $\approx$ 3.7).Given this robust effect, we conducted an *a priori* conservative power analysis for an independent samples t-test (two-tailed) assuming a substantially attenuated effect size of d = 2.0, $\alpha$  = 0.05, and 90% power, which indicated that only 7 drawings per condition (14 total) would be required to detect the effect with 90% power. 

To maintain fidelity to the original study's design and enable exploratory analyses by category and memorability, we will rate 34 Delayed Recall drawings and 34 Category Drawings (68 total) from 17 selected scene categories. Specifically, both image exemplars per category (one high-memorable and one low-memorable) will be included, preserving the original memorability manipulation and allowing for some within-category comparisons. With this sample size, the study achieves greater than 99% power to detect an effect of d = 2.0, and remains well-powered even if the true effect size is as small as d = 1.0. Additionally, while the original study employed 24 independent raters per drawing, we will use 10-12 per drawing to balance cost efficiency with measurement reliability. Given the high accuracy and inter-rater reliability in the original study (inferred from proportion correct), 10-12 raters should provide stable estimates of drawing recognizability. 


### Planned Sample

Approximately 14 participants will be recruited via [Prolific](www.prolific.com) to complete the drawing-to-picture matching task. Eligibility criteria include fluent English speakers, age 18 years or older, normal or corrected-to-normal vision, and approval rating of 95% or higher on Prolific. Ideally, participants would have no prior exposure to the Bainbridge et al. (2019) stimulus set; however, Prolific does not provide automatic screening for this criterion across different researchers. Each participant will complete 60 trials, matching the original study's average of 58.2 trials per participant. With 68 drawings requiring 12 ratings each, this yields 816 total trials distributed across 14 participants (60 trials per participant times 14 participants equals 840 possible ratings, accounting for minor variation in completion rates and attention checks). 

Data collection will continue until all 68 drawings have received at least 10 complete ratings from unique participants who pass quality control checks (detailed below). If initial recruitment yields fewer than 10 usable ratings per drawing, additional participants will be recruited in batches of 2-3 until the target is reached. Participants will be paid \$3 for approximately 12 minutes of work ($15 per hour). 


### Materials

The stimuli consist of real-world scene photographs from the SUN database (Xiao et al., 2010). From the original 30 scene categories, we selected 17 categories appropriate for potential developmental extensions: amusement park, badlands, bathroom, bedroom, dining room, farm, fountain, garden, house, kitchen, lighthouse, living room, mountain, playground, pool, street, tower. For each category, there are two image exemplars (one high-memorable and one low-memorable based on prior memorability scores from Isola et al., 2011) and one medium-memorable foil image. This gives us 34 target images (2 per category times 17 categories) and 17 foil images (1 per category). All images are 512 pixels on the longest dimension and approximately 14 degrees of visual angle when viewed on a standard monitor at typical viewing distance. 

The drawings are from the publicly available dataset associated with Bainbridge et al. (2019), accessed via [Harvard Dataverse](https://dataverse.harvard.edu/dataverse/drawingrecall). Two types of drawings will be used as stimuli for this replication experiment. First, Delayed Recall drawings (n = 34) were created by participants who studied 30 scene images for 10 seconds each, completed an 11-minute digit span distractor task, and then drew as many images as they could remember from memory. For each of the 34 target images in the selected categories, one Delayed Recall drawing will be selected from the available drawings for that image. When multiple drawings are available for an image, one will be randomly selected using R's `sample()` function to reduce researcher bias in drawing selection. Second, Category Drawings (n = 34) were created by participants who were given only the scene category name (e.g., "kitchen") and asked to draw a typical example of that category without viewing any specific image. Two Category Drawings per selected category will be randomly sampled from the 15 available Category Drawings per category in the original dataset. 

All drawings are pen-and-paper sketches and scanned as JPG files. Filenames in the original dataset follow the format `[condition]_[subnum]_[imnum]_[memorability]_[scene].jpg`, enabling identification of which target image corresponds to each drawing.

### Procedure	

The experimental task will be programmed using the jsPsych library (de Leeuw, 2015) and hosted via GitHub Pages. Data will be collected using the DataPipe plugin and stored directly to an Open Science Framework (OSF) repository linked to this replication project. Participants will access the experiment through Prolific's participant recruitment platform. 

The task design will closely follow the original study's "Drawing matching online experiment" (Bainbridge et al., 2019), adapted from single-trial Amazon Mechanical Turk HITs to a multi-trial jsPsych session. Participants will first read instructions explaining that they will see drawings of everyday scenes and their task is to match each drawing to one of three photographs. On each trial, they will see one drawing and three photographs from the same scene category (e.g., three different kitchens), and should select which photograph they think the drawing best represents. Participants will be told that even if the drawing is rough or incomplete, they should try their best to make a match. The instructions note that the task will take approximately 12 minutes and request that the participants complete it in a quiet environment without distractions on a desktop or laptop computer (mobile devices will be excluded). Participants will remain naive to the experimental manipulation.  

Each trial follows the structure of the original experiment. The drawing is displayed centered at the top of the screen, with three scene photographs presented in a horizontal row below. Following the original implementation, the spatial positions (left, center, right) of the three photographs are randomized on each trial using a shuffle algorithm. Participants select their response by clicking on one of the three photographs. After making a selection, participants click a "Continue" button to proceed to the next trial. 

There is no time limit for responses, allowing participants to carefully consider their choices as in the original study. However, to ensure active engagement and data quality, the entire experimental session has a 30-minute time limit. Participants who do not complete all 60 trials within this window will be excluded from analysis. This session-level timeout is consistent with standard Amazon Mechanical Turk practices (though not explicitly reported in the original study) and is generous given the expected completion time of approximately 12 minutes.

Each participant completes 60 trials in random order. Trials are sampled such that each drawing appears at most once per participant, and drawings from both conditions are intermixed. Participants remain blind to the drawing condition. For each Delayed Recall drawing, the target image is predetermined. For each Category Drawing, there is no predetermined target, as these drawings were created from category names rather than specific images. The two foil images for each trial are the other two images from the same category. For instance, if the target is the high-memorable kitchen image, the foils are the low-memorable kitchen image and the kitchen foil. The assignment of which image serves as target for Category Drawings and the presentation order of all 60 trials are randomized uniquely for each participant. 

To ensure data quality, three attention check trials are inserted at random positions within the 60-trial sequence. On these trials, participants see a clear, unambiguous instruction (e.g., "Please select the leftmost image to show you are paying attention") or a simple drawing that makes the correct answer obvious. Participants who fail 2 or more attention checks will be flagged for exclusion from analysis. 

After completing all trials, participants will answer two brief questions. First, "Did you experience any technical difficulties during the task?" with Yes/No response options and an optional text explanation. Second, "Do you have any comments about the study?" with an optional text response field. Participants are then thanked, debriefed, provided with a completion code for Prolific verification, and redirected to Prolific for compensation. 

For each trial, the following data will be recorded and saved via DataPipe to the OSF repository: (1) participant identification number (anonymized Prolific ID), (2) trial number, (3) drawing filename, (4) drawing condition (Delayed Recall or Category), (5) target image filename (for Delayed Recall drawings) or NA (for Category Drawings), (6) the three image filenames in their randomized positions (left, center, right), (7) the participant's selected image, (8) response time in milliseconds, and (9) whether the trial is an attention check, (10) for Category Drawings: which image was selected (high-memorable, low-memorable, or foil). At the end of the session, participant-level data including responses to the technical difficulties question, open-ended comments, total session duration, and completion status will also be recorded. This data structure enables computation of trial-level accuracy (correct vs. incorrect selection), drawing-level accuracy (proportion correct across raters), and application of all exclusion criteria specified in the Analysis Plan. All data will be saved in JSON format and automatically uploaded to the linked OSF repository upon completion of each participant's session.

### Analysis Plan

**Data cleaning and exclusion criteria:**

At the participant level, we will exclude participants who meet any of the following criteria. First, participants who incorrectly respond to 2 or more of the 3 attention check trials will be excluded entirely. Second, participants who do not complete all 60 experimental trials within the 30-minute session limit will be excluded. Third, participants who self-report significant technical problems (e.g., images not loading properly) will be reviewed on a case-by-case basis and excluded if issues likely impaired task performance. Fourth, participants whose median response time across all experimental trials is below 1 second, indicating rapid clicking without consideration or whose overall accuracy is below chance (33.3%) across all trials will be excluded as evidence of non-compliance. 

At the trial level, individual trials with response times less than 500 milliseconds will be excluded as anticipatory responses that do not reflect genuine matching judgments. After applying this trial-level exclusion, if more than 20% of the participant's trials are excluded, the entire participant will be excluded from analysis as their data is insufficiently reliable. 

**Data preparation and unit of analysis:**

The primary unit of analysis is the drawing, with accuracy computed as the proportion of raters who correctly matched each drawing to its target image. Data preparation will proceed in three stages. First, at the trial level, accuracy is a binary-coded variable where 1 indicates the participant correctly selected the target image and 0 indicates the selection of a foil image. Second, at the drawing level, we will compute the proportion of raters who correctly identified the target image using the formula `(number of correct responses) / (total number of valid ratings for this drawing)`. Each of the 68 drawings will yield one accuracy score ranging from 0 to 1, aggregated across 10-12 independent raters after applying all exclusion criteria. Third, each drawing is labeled as either "Delayed Recall" or "Category" based on the dataset. 

For Delayed Recall drawings, accuracy at the trial level is straightforward: 1 if the participant selected the target image, 0 otherwise. For Category Drawings, since there is no single correct target image, trial-level responses are coded as selection of the high-memorable exemplar, low-memorable exemplar, or foil. At the drawing level, accuracy for Category Drawings is computed following the original study's approach: for each Category Drawing, we calculate the proportion of raters who selected the high-memorable exemplar and separately the proportion who selected the low-memorable exemplar, then average these two proportions. This yields a hypothetical accuracy score that represents the average match rate if either exemplar were considered the target.

**Confirmatory analysis:**

The key hypothesis is that Delayed Recall drawings will be matched to their target images significantly more accurately than Category Drawings, replicating the original finding. Following Bainbridge et al. (2019), this will be tested using a two-tailed Wilcoxon rank-sum test comparing the distribution of accuracy scores between the two drawing conditions. The null hypothesis states that the distributions of matching accuracy are identical between Delayed Recall and Category Drawing conditions. The alternative hypothesis states that the distributions differ between conditions. The significance level is set at alpha equals 0.05 (two-tailed). Effect size will be quantified using rank-biserial correlation, where values of 0.1, 0.3, and 0.5 conventionally correspond to small, medium, and large effects.

The rationale for using a non-parametric test is to adhere to the original study. The authors noted that accuracy proportions, bounded between 0 and 1, may not follow normal distributions, particularly when values cluster near boundaries. The Wilcoxon test makes no distributional assumptions about the shape of the data and is therefore more appropriate for proportions. Additionally, this choice ensures direct comparability with the original statistical approach. 

As a sensitivity analysis to assess robustness of findings, we will also report results from a two-tailed independent samples t-test with Cohen's d as the effect size metric. If assumptions are violated, Welch's correction will be applied. Specifically, we will assess normality using Shapiro-Wilk tests on both groups' accuracy distributions plus visual inspection via quantile-quantile plots. We will also assess the homogeneity of variance using Levene's test, and if violated, we will use Welch's t-test instead of Student's t-test. If both the Wilcoxon test and t-test converge on the same conclusion regarding significance and direction, this strengthens confidence in the finding. However, if normality is strongly violated (p << 0.01 in Shapiro-Wilk tests for either group), the Wilcoxon test will be prioritized as the primary analysis. 

**Criteria for successful replication:**

The replication will be considered successful given the following two conditions. First, the Wilcoxon rank-sum test shows that Delayed Recalled drawings have significantly higher matching accuracy than Category Drawings (p < 0.05, two-tailed). Second, the effect size should at least be medium magnitude (rank-biserial $r \geq 0.3$), though we anticipate a much larger effect given the original study's Z-score of 9.29 and p-value of $1.58 \times 10^{-20}$. A close replication would show Delayed Recall accuracy substantially above chance (original: 84.3%), Category Drawing accuracy near chance (original: 30.7%), and a very large effect size. 

**Secondary exploratory analyses:**

If time permits and the primary replication is successful, we shall conduct exploratory analyses. (To be specified later.) 


### Differences from Original Study

Several aspects of the current replication differ from Bainbridge et al. (2019), though none are expected to fundamentally alter the main effect given its robustness in the original study. First, the original study used all 60 images from 30 scene categories, whereas the current replication uses 34 images from 17 categories selected for child-appropriateness to enable potential developmental extensions. This may reduce generalizability across scene types but should not affect the core theoretical contrast if the effect is robust across the included categories. The 17 selected categories span indoor (e.g., bedroom, kitchen), outdoor natural (e.g., badlands, farm), and outdoor man-made (e.g., tower, fountain) scenes, maintaining diversity in scene types. Second, the original study used 24 independent raters per drawing, totaling 1,101 raters across all drawings and conditions. The current replication uses 10-12 raters per drawing, totaling approximately 14 raters. This reduction is justified by the high inter-rater reliability evident in the original results and power analyses indicating that 10-12 raters should provide stable estimates of drawing recognizability. Third, the original study recruited participants from Amazon Mechanical Turk, whereas the current replication uses Prolific. Both platforms recruit general adult populations from primarily the United States. Fourth, the original study analyzed all Delayed Recall drawings produced by the 30 participants (variable number per image based on individual recall rates, approximately 363 total Delayed Recall drawings). The current replication selects one Delayed Recall drawing per target image (34 total). This design choice ensures balanced sample sizes across images and simplifies statistical analysis by treating drawings as independent observations, though it reduces representation of within-image drawing variability. Similarly, the original study used all 15 Category Drawings per category (450 total Category Drawings) when analyzing the drawing matching task. The current replication randomly samples 2 Category Drawings per category (34 total) to match the Delayed Recall sample size. Last but not least, to ensure the quality of data collected through an online platform without introducing researcher degrees of freedom during post-hoc exclusions, the current replication implements a session-level timeout as well as more explicitly defined exclusion criteria. 

Despite these differences, the fundamental experimental design, task structure, and theoretical question remain unchanged. The original effect was extremely large, and the theoretical prediction that memory drawings contain image-specific visual information beyond category-level gist representations—should hold robustly across the methodological variations introduced here. The modifications primarily serve to adapt the study from AMT's single-trial HIT framework to a modern multi-trial jsPsych implementation while improving experimental control, data quality, and reproducibility practices. If the effect fails to replicate, the differences documented here provide a framework for understanding potential boundary conditions, though such failure would be surprising given the magnitude of the original effect.

### Reliability and Validity

The primary construct of interest is drawing recognizability, operationalized as the proportion of independent raters who correctly match each drawing to its corresponding image in a 3AFC task. The measure ranges from 0 to 1, with chance performance at 1/3. The original study used 24 independent raters per drawing to assess recognizability, which provides a form of inter-rater reliability through aggregation across multiple judges. The relatively small standard deviations reported for both conditions suggest agreement among raters. Even though there was no precise report of inter-rater reliability, it was implicit in the sampling. Construct validity is also established through multiple control conditions, such as having Image Drawings and Immediate Recall in addition to Category Drawings. The task also has strong face validity, as matching drawings to photographs is a direct and intuitive assessment of visual information content. Even with a slightly different sample, the replication will assume the validity of the measure and ensure reliability with multiple raters per image. 

### Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

#### Actual Sample
  Sample size, demographics, data exclusions based on rules spelled out in analysis plan

#### Differences from pre-data collection methods plan
  Any differences from what was described as the original plan, or “none”.


## Results


### Data preparation

*[to be completed after data collection]*

We will first identify and exclude participants who: (1) failed 2 or more of the 3 attention check trials, (2) did not complete all 60 trials within the 30-minute session limit, (3) reported significant technical problems that impaired performance, or (4) showed evidence of non-compliance (median response time below 1 second or overall accuracy below 33.3%). After applying these exclusions, we will report the final sample size, number of participants excluded by each criterion, and basic demographic characteristics (age, gender) of the retained sample.

For retained participants, we will exclude individual trials with response times below 500 milliseconds as anticipatory responses. We will then check whether any participant had more than 20% of their trials excluded by this criterion; if so, that entire participant will be removed from the dataset as insufficiently reliable.

For each of the 68 drawings, we will compute accuracy as the proportion of valid ratings that correctly identified the target image. Specifically, for each drawing, we will count the number of participants who selected the target image and divide by the total number of valid ratings for that drawing (expected to be 10-12 per drawing after exclusions). This yields one accuracy score per drawing, ranging from 0 (no raters identified the target) to 1 (all raters identified the target). Each drawing will be labeled as "Delayed Recall" (n = 34) or "Category" (n = 34) based on its filename and condition assignment.

The final dataset will contain 68 rows (one per drawing) with the following variables: `drawing_id`, `condition` (Delayed Recall or Category), `category`, `memorability` (high or low), `target_image`, `number_of_raters`, `number_correct`, and `accuracy` (proportion correct).

Descriptive statistics will be presented in a table showing the number of drawings, mean accuracy, standard deviation, median, minimum, and maximum for each condition. A histogram will display the distribution of accuracy scores for both conditions with a reference line at chance performance (33.3%). 

### Confirmatory analysis

The primary hypothesis will be tested using the two-tailed Wilcoxon rank-sum test as specified in the Analysis Plan.

**Assumption checks:** Prior to the confirmatory test, we will assess the appropriateness of the supplementary parametric analysis by checking normality and homogeneity of variance. Shapiro-Wilk tests will be conducted on both conditions' accuracy distributions to assess normality. Quantile-quantile plots will be generated for visual inspection of normality. Levene's test will assess homogeneity of variance between conditions. Results of these checks will determine whether the supplementary t-test uses standard Student's t-test (if assumptions met) or Welch's correction (if variances unequal). If either group shows strong violation of normality (Shapiro-Wilk p < 0.01), we will prioritize the Wilcoxon test as the primary analysis and report it most prominently.

**Expected reporting:** 

Delayed Recall drawings (Median = X.XX) were [significantly more / not more] recognizable than Category Drawings (Median = X.XX), W = X, Z = X, [p = X], rank-biserial r = X, 95% CI [X, X].

Delayed Recall drawings (M = X, SD = X) were [significantly more / not significantly more] recognizable than Category Drawings (M = X, SD = X), t(X) = X,[p = X], Cohen's d = X, 95% CI [X, X].

*Side-by-side graph with original graph is ideal here*

### Exploratory analyses

Any follow-up analyses desired (not required).  

## Discussion

### Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.  

### Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt.  None of these need to be long.

---

## References

Bainbridge, W. A., Hall, E. H., & Baker, C. I. (2019). Drawings of real-world scenes during free recall reveal detailed object and spatial information in memory. *Nature Communications*, *10*, 5. https://doi.org/10.1038/s41467-018-07830-6

de Leeuw, J. R. (2015). jsPsych: A JavaScript library for creating behavioral experiments in a web browser. *Behavior Research Methods*, *47*(1), 1-12. https://doi.org/10.3758/s13428-014-0458-y

Isola, P., Xiao, J., Torralba, A., & Oliva, A. (2011). What makes an image memorable? In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (pp. 145-152). IEEE. https://doi.org/10.1109/CVPR.2011.5995721

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (pp. 3485-3492). IEEE. https://doi.org/10.1109/CVPR.2010.5539970