## Sanity Check: Sample Sizes and Data Structure.   

written by **Jinwoo Lee**.    
jil527@ucsd.edu | jinwoo-lee.com

Oct 29th, 2025.    
as a PSYC201A's **Week 5 lab practicum: Experiment Design Workshop II**

___
### Part I. Loading Data.    

First, I will load the preprocessed datasets shared by the authors from the `data-from-authors` folder. This folder includes the following data:     

- **Metadata** (e.g., participantsâ€™ ages)        

- **Psychometric data** (e.g., STAI-G_X2 scores)    

- **Image-driven phenotypes** ('IDPs'; e.g., vectors of streamline counts for each voxel in canvas space)    

These files were obtained personally from the research team upon request and therefore **will not be shared publicly** in repositories such as GitHub or OSF. In this Jupyter Notebook, we will assume that this folder is located in the top-level directory for the analyses.

In [1]:
# loading modules
import pandas as pd
import os

In [2]:
data_path = '../data-from-authors'
df_meta = pd.read_csv(os.path.join(data_path, 'Meta.csv'))                      # metadata including participants' demographics 
df_behav = pd.read_csv(os.path.join(data_path, 'STAI_G_X2.csv'))                # all participants' self-reported trait anxiety
df_tract = pd.read_csv(os.path.join(data_path, 'tractfiles/yh_probmap_L.csv'))  # young participants' left vPFC-amygdala streamline vectors

---
### Part II. Checking a Structure of Behavioral Data.     
According to the original paper - Kim & Kim (2022), total 119 participants were included in the younger adults group. Therefore, we should find a dataframe of their STAI-G_X2 scores with a shape of **(119, 2)** (i.e., subject IDs and STAI scores).   

Since `df_behav` has total 221 participants' data in both younger and older adult groups, we need to perform subsampling of participants in the younger group. The original paper performs subsampling based on three criteria:

-  **Age**: Participants aged between 20 and 35 were included.   

-  **Missing Value**: Participants with any missing values in their STAI-G-X2 or IDPs were excluded.  

-  **Psychiatric Diagnosis**: Participants with a current or past psychiatric diagnosis were excluded.     

By applying these criteria, we should be able to obtain a dataframe of STAI-G_X2 scores for 119 younger adult participants.

In [4]:
### Subsampling: Younger Adults
# Step (1) - based on age (N: 221 -> 153)
young_age_groups = ['20-25', '25-30', '30-35']
young_mask = df_meta['Age'].astype(str).str.strip().isin(young_age_groups)
df_meta_young_1 = df_meta[young_mask].copy()

# Step (2) - based on missing data (N: 153 -> 150)
missing_subjects = [
    'sub-032339', 'sub-032341', 'sub-032459',
    'sub-032370', 'sub-032466', 'sub-032438', 'sub-032509'
]
df_meta_young_2 = df_meta_young_1[
    ~df_meta_young_1['Unnamed: 0'].astype(str).str.strip().isin(missing_subjects)
]

# Step (3) - based on SKID diagnosis (N: 150 -> 119)
# excluding all samples with current or past psychiatric diagnosis
df_meta_young = df_meta_young_2[df_meta_young_2['SKID_Diagnoses'].astype(str).str.strip() == 'none'].copy()

# Final Check
young_ids = df_meta_young['Unnamed: 0'].to_numpy()
print(f"After subsampling, total 119 participants were left in the young adult group: {len(young_ids) == 119}")

df_behav_young = df_behav[df_behav['Unnamed: 0'].isin(young_ids)].copy()
print(f"Now, we have (119, 2) trait anxiety data as expected: {df_behav_young.shape == (119, 2)}")

After subsampling, total 119 participants were left in the young adult group: True
Now, we have (119, 2) trait anxiety data as expected: True


---
### Part III. Checking a Structure of Tractography Data.  
According to the authors, the IDPs of the white matter tractography between the left amygdala and vPFC are saved in the data `yh_probmap_L.csv`, which is structured **(*N* of voxels in the canvas space, *N* of participants)**. Thus, we need to transpose the data frame first and then check if the resultant size is **(119, *N* of voxels)**.

In [5]:
# transpose
df_tract_t = df_tract.T

# exclude the first row (= the voxel index)
df_tract_t = df_tract_t.iloc[1:].reset_index(drop = True)

print(f"Now, we have 119 participants' left amygdala-vPFC IDPs: {df_tract_t.shape[0] == 119}")

Now, we have 119 participants' left amygdala-vPFC IDPs: True
