In [1]:
import pandas as pd
from src.pipeline.load import filter_subjects_with_two_timepoints
from src.pipeline.clean import (
    handle_missing_values,
    merge_family_ids,
    enforce_common_subjects,
    drop_siblings,
    link_with_g_scores,
    save_processed_data

)
from src.analysis.eda import (
    list_columns_with_missing_values,
    count_columns_with_missing_values,
    percentage_missing_values,
    divide_features,
    categorical_summary_stats
)

In [2]:
# Load data
features = pd.read_csv('../data/raw/Demographics.csv')
# Load g_score
g_factor = pd.read_csv('../data/raw/ABCD_new_G_all.csv')
# Drop "Task" since it's constant
features = features.drop(columns=["Task"], inplace=False)

# Features description
The dataset we are working with consist of 13274 rows and 20 columns. Thus, we have information for the 13274 subjects. Let us look at the description of the features:
* src_subject_id - subject identifier
* eventname - timepoint of data
* site_id_l - data collection site (may want to include this in models as a grouping factor as there are likely site effects in the data, we will sometimes model subject nested in site as random effects)
* rel_family_id - family identifier. There are some siblings, twins, etc in the data. Not necessarily an issue sometimes, but when we're doing any kind of cross validation or anything we make sure we don't have family members in different folds.
* interview_age - age of the subject in months
* Subject/Session are just recoded versions of subject ID and eventname as they match with the naming of the imaging data better
* Task - this should all be rest
* GoodRun_5 - number of good runs in the included data
* censor_5 - number of censored timepoints in the included data
* TRs - number of included timepoints in the data
* confounds_nocensor - number of confounds in the run-level nuisance correction model (other than censored timepoints). Most of these quality metrics you probably don't need to worry about
* meanFD - the mean motion of the subject during their included scans. We usually use linear and quadratic terms of this as a confound at the group level
* race.4level - subject reported race
* hisp - subject reported hispanic yes/no
* demo_sex_v2 - sex at birth 1=male, 2=female (I think I dropped any other responses, but if there are others we might need to exclude just because there are so few we can't get good estimates of effects)
* EdYearsHighest - parental years of education (highest among parents, I think I need to double check we might actually use average elsewhere, but this variable is likely not immediately relevant for now)
* IncCombinedMidpoint - combined income of parents (midpoint of a bin, because it's only reported in 10 bins)
* Income2Needs - calculated income to needs metric based on parental income and number of people in the household
* Married - parents currently married or not

In [3]:
# Keep only participants with both baseline and follow-up
df_filtered = filter_subjects_with_two_timepoints(features)

# Split by event
df_baseline = df_filtered[df_filtered["eventname"] == "baseline_year_1_arm_1"]
df_followup = df_filtered[df_filtered["eventname"] == "2_year_follow_up_y_arm_1"]

In [4]:
# Drop rows with missing values (except family IDs)
df_baseline, df_followup = handle_missing_values(df_baseline, df_followup)

# Fill in family IDs for follow-up from baseline
df_followup = merge_family_ids(df_baseline, df_followup)

In [5]:
# Enforce common subjects
df_baseline, df_followup = enforce_common_subjects(df_baseline, df_followup)

In [6]:
# Drop siblings
df_baseline = drop_siblings(df_baseline)
df_followup = drop_siblings(df_followup)


In [7]:
# Merge with g-score data
merged_df_baseline, merged_df_followup = link_with_g_scores(df_baseline, df_followup, g_factor)

In [9]:
save_processed_data(
    merged_df_baseline,
    merged_df_followup,
    output_dir="../data/processed/",
    prefix="features_linked"
)

Data saved to ../data/processed/features_linked_0.csv and ../data/processed/features_linked_1.csv
