# *[ARCHIVED]* RECOVER Pediatric PIC-SURE Data Dictionary

**ARCHIVED**

This notebook created the RECOVER Pediatric Cohort Data Dictionary using the RECOVER Peds data available via the PIC-SURE API. However, this file is no longer created with updated versions of the RECOVER datasets. This code is being posted in the GitHub repository since it contains helpful code pieces for data QA tests.

## General testing notes
The dataset QA tests that are run below perform these main functions:
1. Get all variables from the data dictionary for a given study
2. Run an "anyof" query using all of these variables

Running the query would return errors where the variable did not have any participant information associated with a given variable. 

In [None]:
# Do installs
import sys
import pandas as pd
import re
import pyarrow.feather as feather
import numpy as np

## Step 1: Gather data from PIC-SURE API (or load existing data)
In this step, we get:
1. PIC-SURE Dictionary for RECOVER Pediatric cohort
2. Participant-level data for RECOVER Pediatric cohort

In [None]:
# Do imports for PIC-SURE API
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git
import PicSureClient
import PicSureBdcAdapter

In [None]:
# Connect to PIC-SURE API
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

In [None]:
# Search for RECOVER Pediatric (phs003461) variables in PIC-SURE Dictionary
dictionary = bdc.useDictionary().dictionary() # Set up the dictionary
all_variables = dictionary.find("phs003461") # Retrieve all variables you have access to
search_vars = all_variables.dataframe()
recover_vars = search_vars[search_vars.studyId == "phs003461"]

In [None]:
halfway = int(len(recover_vars)/2)
first_half = recover_vars[0:halfway]
second_half = recover_vars[halfway:]

In [None]:
# Get RECOVER Pediatric data
# Note: The RECOVER data is currently too large to export in one query, which is why two queries are created and executed
authPicSure = bdc.useAuthPicSure()
test_query1 = authPicSure.query()
test_query2 = authPicSure.query()
test_query1.anyof().add(first_half.HPDS_PATH)
test_query2.anyof().add(second_half.HPDS_PATH)
recover_results1 = test_query1.getResultsDataFrame(low_memory = False)
recover_results2 = test_query2.getResultsDataFrame(low_memory = False)

In [None]:
recover_results1.head()

In [None]:
recover_results2.head()

In [None]:
# Merge the results of the two queries
recover_results = recover_results1.merge(recover_results2, on = ["patient_id", "\\_Parent Study Accession with Subject ID\\",
                                                                 "\\_Topmed Study Accession with Subject ID\\", "\\_consents\\"],
                                                            how = 'outer')

In [None]:
# Save as files to not rely on dev environment
feather.write_feather(recover_results, 'recover_peds_export')
feather.write_feather(recover_vars, 'recover_peds_variable_info')

In [None]:
# Load previously saved files
recover_vars = feather.read_feather('recover_variable_info')
recover_results = feather.read_feather('recover_export')

## Step 2: Create Data Dictionary - Define Functions
Set up the functions to review the data and create the data dictionary
Dataframe/Data Dictionary plan

| Variable / Field Name | Field Label | Subcohort | Dataset | Data Type | Field Attributes | Mapped Instrument | Number of participants |
| ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Encoded variable name | Decoded Variable Description | Pediatric, Pedtiatric Caregiver, or Pediatric Congenital | Dataset Name | Continuous or Categorical | Continuous: min/max/mean, Categorical: Date, ID, or # top 6 values | RedCap Instrument mapping | Number of participants with values |
| PIC-SURE name | PIC-SURE decoded description | Based on PIC-SURE variable group | Order enrollment enrollment, then enrollment demographics, then the other enrollment, then the follow up, then biospecimens, then fitbit | PIC-SURE data type | Continuous - only show 2 digits after decimal, Categorical: do top 6 values defined by number of participants with values | Based on document shared | Number of participants with values |


In [None]:
# Not all columns needed for RECOVER Pediatrics Data dictionary
simplified = recover_vars[['columnmeta_var_id', 'columnmeta_description', 'columnmeta_var_group_id', 
             'columnmeta_var_group_description', 'columnmeta_data_type', 'columnmeta_min', 
             'columnmeta_max', 'values', 'HPDS_PATH']]

In [None]:
# First identify subcohort and separate based on this, since there are different codebooks for each subcohort
subcohort = []
for path in simplified['HPDS_PATH']:
    if "recover_pediatric_caregiver" in path:
        subcohort.append("Pediatric Caregiver")
    elif "recover_pediatric_congenital" in path:
        subcohort.append("Pediatric Congenital")
    elif "recover_pediatric\\" in path:
        subcohort.append("Pediatric")
    else:
        subcohort.append("Not found")

print(set(subcohort)) # Confirm that there are no instances of "Not Found"

# Add subcohort information to simplified dataframe
simplified.loc[:,'Subcohort'] = subcohort

In [None]:
# Separate the variables from the three subcohorts
simplified_peds = simplified[simplified.Subcohort == "Pediatric"]
simplified_congenital = simplified[simplified.Subcohort == "Pediatric Congenital"]
simplified_caregiver = simplified[simplified.Subcohort == "Pediatric Caregiver"]

### Starting with RedCap mapping for Pediatric subcohort
RedCap codebooks found here: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/written-documentation/nih-recover-release-notes 

In [None]:
# Start with identifying the datasets used to map to instruments from RedCap form
picsure_dt = list(simplified_peds['columnmeta_var_group_id'].unique())
picsure_dt.sort()
print(picsure_dt)

In [None]:
# Map to RedCap instruments manually with human-readable name
peds_instrument_mapping = {
    "enrollment": "Enrollment",
    "consent_tracking_tier_12": "Consent Tracking Tier 12",
    "visit_form": "Visit Form",
    "identity": "Identity",
    "demographics": "Demographics",
    "child_birth_history": "Child Birth History",
    "child_current_health_status": "Child Current Health Status",
    "special_health_care_needs_screener": "Special Health Care Needs Screener",
    "promis": "Promis",
    "current_covid_infection_history": "Current Covid Infection History",
    "first_covid_infection_history": "First Covid Infection History",
    "most_recent_covid_infection_history": "Most Recent Covid Infection History",
    "weekly_covid_infection_history": "Weekly Covid Infection History",
    "covid_infection_history_followup": "Covid Infection History Followup",
    "misc_and_pots": "Misc And Pots",
    "covid_testing_history": "Covid Testing History",
    "covid_family_infection": "Covid Family Infection",
    "covid_symptoms": "Covid Symptoms",
    "compass31": "Compass31",
    "covid_vaccine_history": "Covid Vaccine History",
    "covid_health_consequences": "Covid Health Consequences",
    "child_social_determinants_of_health": "Child Social Determinants Of Health",
    "child_wellbeing": "Child Wellbeing",
    # PART TWO
    #"Adolescent Wellbeing"
    "asq": "Asq",
    "behavior_checklist": "Behavior Checklist",
    #"End Of Surveys"
    "anthropometry": "Anthropometry",
    "active_standing_test":"Active Standing Test",
    "electrocardiogram": "Electrocardiogram",
    #"Electrocardiogram Reading Center"
    "spirometry": "Spirometry",
    #"Pft Reading Center"
    'beighton_scale_for_joint_hypermobility': "Beighton Scale For Joint Hypermobility",
    'nih_toolbox': "NIH Toolbox",
    'pulse_oximetry': "Pulse Oximetry",
    'clinical_labs': "Clinical Labs",
    'research_labs': "Research Labs",
    'biospecimens': "Biospecimens",
    'antibody_test_results': "Antibody Test Results",
    'arup_test_results': "Arup Test Results",
    "promote_to_followup": "Promote To Followup",
    #"Select For Tier 3"
    'echocardiogram': "Echocardiogram",
    #"Echocardiogram Reading Center"
    #"Cardiac Mri Reading Center"
    #"Cpet Screening"
    #"Cpet"
    #"Cpet Reading Center"
    #"Neurocognitive And Wellbeing Assessments"
    #"Tier 3 Cssrs"
    #"Brain Mri Local Site"
    #"Brain Mri Reading Center"
    #"Eeg"
    #"Eeg Reading Center"
    #"AE/PD Report"
    "end_of_participation": "End Of Participation",
    "mhp_data": "Mhp Data"
}
# Pediatric Biospeciments, Pediatric Demographics, and Pediatric Fitbit do not have RedCap instruments

### RedCap mapping for Congenital Cohort

In [None]:
# Do the same for congenital
picsure_dt = list(simplified_congenital['columnmeta_var_group_id'].unique())
picsure_dt.sort()
for i in picsure_dt:
    print(i)

In [None]:
# Map to RedCap instruments manually with human-readable name
cong_instrum_mapping = {
    "enrollment": "Enrollment",
    "consent_tracking_tier_12": "Consent Tracking Tier 1-2",
    #"Visit Form"
    #"Identity"
    "demographics": "Demographics",
    "child_birth_history": "Child Birth History",
    "child_current_health_status": "Child Current Health Status",
    "first_covid_infection_history": "First COVID Infection History",
    "most_recent_covid_infection_history": "Most Recent COVID Infection History",
    "covid_infection_history_followup": "COVID Infection History (followup)",
    "related_conditions": "Related Conditions",
    "covid_family_infection": "COVID Family Infection",
    "covid_symptoms": "COVID Symptoms",
    "covid_vaccine_history": "COVID Vaccine History",
    "covid_health_consequences": "COVID Health Consequences",
    "child_social_determinants_of_health": "Child Social Determinants Of Health",
    "child_social_determinants_of_health_followup": "Child Social Determinants Of Health Followup",
    "bisqr": "Bisqr",
    "asq": "ASQ",
    "asqse": "Asqse",
    "cbcl":"Cbcl",
    #"End Of Surveys"
    "dp4":"Dp4",
    "bayley4":"Bayley4",
    "dasii": "Dasii",
    "anthropometry": "Anthropometry",
    "biospecimens": "Biospecimens",
    "antibody_test_results":"Antibody Test Results",
    #Alert Response Log
    #AE/PD Report
    "end_of_participation": "End Of Participation",
    "mchatrf": "Mchatrf",
    "promis": "PROMIS",
    "special_health_care_needs_screener": "Special Health Care Needs Screener"
}
# Pediatric Congential Biospecimens, Pediatric Congenital Demographics do not have RedCap Instruments

### RedCap Mapping for Caregiver Cohort

In [None]:
simplified_caregiver

In [None]:
# Do the same for caregiver
picsure_dt = list(simplified_caregiver['columnmeta_var_group_id'].unique())
picsure_dt.sort()
for i in picsure_dt:
    print(i)

In [None]:
care_instrum_mapping = {
    #Enrollment
    'consent_tracking': "Consent Tracking",
    #Visit Form
    #Identity
    #Demographics
    #PROMIS
    'current_health_status': 'Current Health Status',
    # First covid infection history
    # most recent covid infection history
    # covid infection history followup
    # covid testing history
    # covid vaccine history
    # Covid symptoms
    # Covid health consequences
    "household_social_determinants_of_health": "Household Social Determinants Of Health",
    "caregiver_social_determinants_of_health": "Caregiver Social Determinants Of Health",
    "caregiver_wellbeing": "Caregiver Wellbeing"
    #End of Surveys
    #Biospecimens
    #Antibody test results
    #AE/PD report
    #Mhp Data
}
# Pediatric Caregiver Biospecimens and Pediatric Caregiver Demographics not RedCap Instruments

### Map each instrument in new column

In [None]:
# Function to add instruments to a new column in the dataframe
def add_instruments(mapping, df):
    df = df.reset_index()
    for_df_instrum = []
    for dt in df['columnmeta_var_group_id']:
        if dt in mapping.keys():
            for_df_instrum.append(mapping[dt])
        else:
            for_df_instrum.append('')
    df_instrum = pd.DataFrame({'mapped_instrument': for_df_instrum})
    new_df = pd.concat([df, df_instrum], axis=1)
    mvp = new_df[['columnmeta_var_id', 'columnmeta_description', 'Subcohort', 'columnmeta_var_group_id', 'columnmeta_var_group_description', 'columnmeta_data_type', 'mapped_instrument', 'HPDS_PATH', 'values']]
    return(mvp)

In [None]:
# Apply functions to subcohort datasets
mvp_peds = add_instruments(peds_instrument_mapping, simplified_peds)
mvp_cong = add_instruments(cong_instrum_mapping, simplified_congenital)
mvp_care = add_instruments(care_instrum_mapping, simplified_caregiver)

In [None]:
mvp_care.shape

In [None]:
# Create the data dictionary functions

def continuous(hpds_path, recover_results):
    field_attributes = {}
    if type(recover_results[hpds_path].min()) == np.float64:
        field_attributes["min"] = round(recover_results[hpds_path].min(), 2)
        field_attributes["max"] = round(recover_results[hpds_path].max(), 2)
        field_attributes["mean"] = round(recover_results[hpds_path].mean(), 2)
    else:
        field_attributes["min"] = round(recover_results[hpds_path].min().iloc[0], 2)
        field_attributes["max"] = round(recover_results[hpds_path].max().iloc[0], 2)
        field_attributes["mean"] = round(recover_results[hpds_path].mean().iloc[0], 2)
    return field_attributes

def categorical(hpds_path, recover_results, data_dict):
    field_attributes = {}
    pattern1 = r'^\d{4}-\d{2}-\d{2}$' #YYYY-MM-DD
    pattern2 = r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$' #YYYY-MM-DD HH:MM:SS
    row = data_dict[data_dict.HPDS_PATH == hpds_path]
    #print(list(row["values"].iloc[0]))
    if list(row["values"]) == [[]]:
        field_attributes = "No values"
    elif "participant_id" in row["HPDS_PATH"].iloc[0] or "kit_id" in row["columnmeta_var_id"].iloc[0]:
        field_attributes = "ID"
    elif len(row["values"]) > 0 and (bool(re.match(pattern1, list(row["values"].iloc[0])[0])) or (bool(re.match(pattern2, list(row["values"].iloc[0])[0])))):
        field_attributes = "Date"
    elif len(row["values"]) > 0:
        row_values = list(row["values"].iloc[0])
        #print(row_values)
        for cat in row_values:
            if cat == "false":
                field_attributes[False] = sum(recover_results[hpds_path] == False)
            elif cat == "true":
                field_attributes[True] = sum(recover_results[hpds_path] == True)
            else:
                try:
                    field_attributes[cat] = sum(recover_results[hpds_path] == float(cat))
                except:
                    field_attributes[cat] = sum(recover_results[hpds_path] == cat)
        if len(field_attributes) > 5:
            field_attributes = dict(sorted(field_attributes.items(), key = lambda x: x[1], reverse = True)[:6])
    return field_attributes

In [None]:
def get_field_attr(mvp):
    num_participants_final = []
    field_attr_final = []
    skip_vars = [
        "\\phs003461\\recover_pediatric_caregiver\\demographics\\deceased_date\\",
        "\\phs003461\\recover_pediatric_congenital\\demographics\enroll_zip_code\\",
        "\\phs003461\\recover_pediatric\\enrollment\\enrollment\\enrl_bpid2\\",
        "\\phs003461\\recover_pediatric\\follow-up_month_6\\covid_health_consequences\\stimqp_talkplan\\"
    ]

    for i, row in mvp.iterrows():
        hpds_path = mvp.loc[i, "HPDS_PATH"]
        print(hpds_path)
        if hpds_path in skip_vars:
            print(hpds_path)
            print("Skipping")
            field_attr_final.append("Variable Not Found")
            num_participants_final.append("Variable Not Found")
            continue
        d_type = mvp.loc[i, "columnmeta_data_type"]
        num_participant = recover_results[hpds_path].count()
        num_participants_final.append(num_participant)
    
        if d_type == "continuous":
            field_attr = continuous(hpds_path, recover_results)
        if d_type == "categorical":
            field_attr = categorical(hpds_path, recover_results, mvp)
    
        
        print(field_attr)

        field_attr_final.append(field_attr)
    return(num_participants_final, field_attr_final)

In [None]:
def add_info_to_df(df, res):
    final_data_dict = pd.concat([df, 
                                 pd.DataFrame({"Number Participants": res[0]}),
                                 pd.DataFrame({"Field Attributes": res[1]})],
                                 axis=1)
    final_data_dict = final_data_dict.drop(['HPDS_PATH', 'values'], axis = 1)
    final_data_dict.columns = ["Variable Name", "Variable Description", "Subcohort",
                               "Dataset", "Dataset Description", "Data Type", "Mapped Instrument",
                                'Number Participants', 'Field Attributes']
    return(final_data_dict)

In [None]:
care_attr = get_field_attr(mvp_care)
care_final = add_info_to_df(mvp_care, care_attr)


In [None]:
care_final.head()

In [None]:
cong_attr = get_field_attr(mvp_cong)
cong_final = add_info_to_df(mvp_cong, cong_attr)

In [None]:
cong_final.head()

In [None]:
peds_attr = get_field_attr(mvp_peds)
peds_final = add_info_to_df(mvp_peds, peds_attr)

## Write output

In [None]:
peds_final.to_csv("RECOVER_Pediatrics_PIC-SURE_Data_Dictionary_Pediatric_Cohort.csv")
care_final.to_csv("RECOVER_Pediatrics_PIC-SURE_Data_Dictionary_Pediatric_Caregiver.csv")
cong_final.to_csv("RECOVER_Pediatrics_PIC-SURE_Data_Dictionary_Pediatric_Congenital.csv")

# Crazy findings

In [None]:
testing_var = '\\phs003461\\recover_pediatric_congenital\\demographics\\enroll_protocol\\'
# Show unique variables from data dictionary
dict_vals = mvp_cong[mvp_cong.HPDS_PATH == testing_var]["values"]
print(dict_vals[0][0])
# Show unique variables from data itself
data_vals = recover_results[testing_var].unique()
print(data_vals)

In [None]:
# Confirm when counting from data - using value from data dict
print("from data dict")
print(sum(recover_results[testing_var] == dict_vals[0][0]))

print("from data")
print(sum(recover_results[testing_var] == data_vals[1]))