# Gather data

Moves all the appropriate onsets to a speficied directory

<a id="toc"></a>
1.) [Imports](#imports)  
2.) [Setup](#setup)  
3.) [Main](#main)  
4.) [Filter by log/image availability](#filter)  
5.) [Save as CSV](#save)  

<a id="imports"></a>
### [^](#toc) Imports

In [1]:
import os
import glob
import shutil
import pandas as pd
from tqdm import tqdm

<a id="main"></a>
### [^](#toc) Main

In [2]:
projects = [
    ("/Volumes/group/PANLab_Datasets/RAD/data/", "RAD"),
    ("/Volumes/group/PANLab_Datasets/CONNECTOME/conn_mdd/data/", "CONN_MDD"),
    ("/Volumes/group/PANLab_Datasets/CONNECTOME/conn_hc/data/", "CONN_HC"),
    ("/Volumes/group/PANLab_Datasets/ENGAGE/data/", "ENGAGE"),
    ("/Volumes/group/PANLab_Datasets/iSPOTD_Controls_data", "ISPOTD"),
]

tasks = [
    ("100_fMRI/101_fMRI_preproc_GO_NO_GO/02_globalremoved_func_data.nii.tar.gz", "gonogo"),
    ("100_fMRI/103_fMRI_preproc_FACES-CONSCIOUS/02_globalremoved_func_data.nii.tar.gz", "conscious"),
    ("100_fMRI/105_fMRI_preproc_FACES-NONCONSCIOUS/02_globalremoved_func_data.nii.tar.gz", "nonconscious"),
]

rows = list()
for project in projects:
    project_path = project[0]
    project_name = project[1]
    
    for subject_path in tqdm(glob.glob(os.path.join(project_path, "*"))):
        subject = os.path.basename(subject_path)
        
        for time_session_path in glob.glob(os.path.join(subject_path, "*")):
            time_session = os.path.basename(time_session_path)
            
            for task in tasks:
                task_path = task[0]
                task_name = task[1]
                filepath = os.path.join(time_session_path, task_path)
                if os.path.isfile(filepath):
                    task_path = os.path.dirname(filepath)
                    onsets = list(glob.glob(os.path.join(task_path, "*_Onsets.csv")))
                    row = {
                        "project": project_name,
                        "subject": subject,
                        "time_session": time_session,
                        "task": task_name,
                        "filepath": filepath,
                        "onsets": onsets,
                    }
                    rows.append(row)

100%|██████████| 542/542 [02:36<00:00,  3.46it/s]
100%|██████████| 167/167 [01:02<00:00,  2.69it/s]
100%|██████████| 153/153 [00:39<00:00,  3.91it/s]
100%|██████████| 242/242 [08:17<00:00,  2.05s/it]
100%|██████████| 124/124 [03:29<00:00,  1.69s/it]


### Create Dataframe

In [3]:
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,filepath,onsets,project,subject,task,time_session
0,/Volumes/group/PANLab_Datasets/RAD/data/RAD_36...,[/Volumes/group/PANLab_Datasets/RAD/data/RAD_3...,RAD,RAD_36,gonogo,000_data_archive
1,/Volumes/group/PANLab_Datasets/RAD/data/RAD_36...,[/Volumes/group/PANLab_Datasets/RAD/data/RAD_3...,RAD,RAD_36,conscious,000_data_archive
2,/Volumes/group/PANLab_Datasets/RAD/data/RAD_36...,[/Volumes/group/PANLab_Datasets/RAD/data/RAD_3...,RAD,RAD_36,nonconscious,000_data_archive
3,/Volumes/group/PANLab_Datasets/RAD/data/RAD_41...,[/Volumes/group/PANLab_Datasets/RAD/data/RAD_4...,RAD,RAD_41,gonogo,000_data_archive
4,/Volumes/group/PANLab_Datasets/RAD/data/RAD_41...,[/Volumes/group/PANLab_Datasets/RAD/data/RAD_4...,RAD,RAD_41,conscious,000_data_archive


<a id="filter"></a>
### [^](#toc) Filter by log/image availability

In [13]:
print("Shape (pre filter): ", df.shape)
filtered_df = df[df.onsets.map(lambda x: len(x)) > 0]
print("Shape (post filter):", filtered_df.shape)

Shape (pre filter):  (2764, 6)
Shape (post filter): (2355, 6)


<a id="save"></a>
### [^](#toc) Save as CSV

In [14]:
filtered_df.to_csv("data/full.csv")