Import the NIH Toolbox Data Dictionary into a Pandas Dataframe and interpret some of the responses

In [1]:
 
#Here is the link: https://nihtoolbox.desk.com/customer/portal/kb_article_attachments/144125/original.xlsx?1562788410

# This notebook provides a solution to the Button problem and creates the 'interpretted' NIH Toolbox data dictionary on this repository

The Button problem:
In short the button problem is a lack of transparency between the published NIH Toolbox Data dictionary and the actual output from the IPADS for variables that encode 'correct' or 'incorrect' answers from a selection of possibilities.  

Open the NIH Toolbox Data dictionary, either from the link above, or from the copy stored on this repo

look at an example, say for 'NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1'
You will notice that the bulk of items for this Instrument have something like the following examples in the Response column

example 1:
0=BlueBall\n1=YellowTruck\n  

example 2:
1=BlueBall\n0=YellowTruck\n 

This is confusing, because the ACTUAL data exported contains 1's and 2's.  Meanwhile, the NDA fields are prepared to accept 'correct' and 'incorrect'


We have been instructed (NIH Toolbox help desk) to interpret this notation as follows: 
A '1' on the left of the '=' always means that selecting the item to the right of the '=' is 'correct', 
0 on the left of the '='  means that selecting the item to the right of the '='  is 'incorrect'

The first item listed in the Responses (Blue Ball in this case) corresponds with a '1' in the export)
The second item (Yellow Truck in this case) corresponds with a 2 in the export.

So to translate from 1's and 2's (other buttons have more options) in the export we need to make a map
Please execute all cells in this notebook to convince yourself that the logic is being captured correctly, because it is this logic that is encoded in the 'requested_python' column of the crosswalk for >500 elements.
 


In [275]:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile


In [278]:
#anyone know how to read this directly from the nihtoolbox desk?  
#readfromNIH='https://nihtoolbox.desk.com/customer/portal/kb_article_attachments/144125/original.xlsx?1562788410'
#in the meantime, read from a downloaded copy (replace the paths below with your own locations):
#note that this version of the Data Dictionary is not LIVE any more...they are probably fixing it. Hopefully they will not 
#change the format
fpath_downloadedNIH='/home/petra/UbWinSharedSpace1/ccf-nda-behavioral/PycharmToolbox/Ipad2NDA_withCrosswalk/NIHToolbox2NDA/'
fname='NIH_Toolbox_IPAD_DataDictionary041119_Edited_(7-9-19).xlsx'
fstring=fpath_downloadedNIH+fname

In [279]:
#read crosswalkfile (dont assume that all of the NIH Toolbox DD vars are complete - that code down stream of this in 
#iterative process)
pathin="/home/petra/UbWinSharedSpace1/ccf-nda-behavioral/PycharmToolbox/Ipad2NDA_withCrosswalk/NIHToolbox2NDA/"
crosswalkfile="Crosswalk_NIH_Toolbox_2_NDA.csv"
crosswalk=pd.read_csv(crosswalkfile,header=0,low_memory=False, encoding = "ISO-8859-1")
crosswalk.head()

#import csv
#with open(crosswalkfile, 'rb') as f:
#    reader = csv.reader(f)
#    linenumber = 1
#    try:
#        for row in reader:
#            linenumber += 1
#    except Exception as e:
#        print("Error line "+str(linenumber))
        


Unnamed: 0,Measurement System,Domain,Inst,Item ID,Stem,Context,DataType,Responses,Translation (output of IPAD),hcp_variable,...,specialty_code,hcp_variable_upload,nda_structure,nda_element,description,valueRange,notes,template,inst_short,Source
0,,,Anxiety Summary Parent Report (3-7),,,,,,,Assessment_Name,...,,Assessment_Name,tlbx_fearanx01,version_form,Form used/assessment name,,,Anxiety_Summary_3-7.tlbx_fearanx01_template,Anxiety_Summary_3-7,HCPD
1,,,Anxiety Summary Parent Report (3-7),,,,,,,Fully_Corrected_T_score,...,,Fully_Corrected_T_score,tlbx_fearanx01,nih_tlbx_fctsc,Fully-Corrected T-Score,0::120,,Anxiety_Summary_3-7.tlbx_fearanx01_template,Anxiety_Summary_3-7,HCPD
2,,,Anxiety Summary Parent Report (3-7),,,,,,,Language,...,,primary_language,tlbx_fearanx01,primary_language,Subject's Primary Language,,,Anxiety_Summary_3-7.tlbx_fearanx01_template,Anxiety_Summary_3-7,HCPD
3,,,Cognition Composite Scores,,,,,,,Assessment_Name,...,1.0,Assessment_Name,cogcomp01,version_form,Form used/assessment name,,,cogcomp01_template,cogcomp01,HCPD HCPA
4,,,Cognition Composite Scores,,,,,,,Language,...,1.0,Language,cogcomp01,interview_language,Language Used in the Interview,,,cogcomp01_template,cogcomp01,HCPD HCPA


In [296]:
#print(fstring)
tlbxitems = pd.read_excel(fstring, sheet_name='NIH Toolbox')

print("Column headings:")
print(tlbxitems.columns)


Column headings:
Index(['Measurement System', 'Domain', 'Instrument Title', 'Item ID', 'Stem',
       'Context', 'DataType', 'Responses'],
      dtype='object')


In [297]:
tlbxitems['Item ID'].head()

0     VOCAB_INTRO
1    VOCAB_INSTR1
2    VOCAB_PRACT1
3    VOCAB_PRACT2
4    VOCAB_INSTR2
Name: Item ID, dtype: object

In [298]:
tlbxitems['Item ID'].shape

(4778,)

In [299]:
#remove indentical duplications from the data dictionary
#vast majority of duplications for variable type 'information' but NOT ALL.  
tlbxitems=tlbxitems.drop_duplicates(subset=None, keep='first', inplace=False).copy()
tlbxitems.shape

(4390, 8)

In [300]:
#to identify and create specialty button code to populate the 'Translation (output of IPAD)' column
#send this code back to the NIH Toolbox people so that they can update their data dictionary 
#and other people (who might not have yet identified this issue) don't have this problem

In [301]:
tlbxitems['hcp_variable']=tlbxitems['Item ID'].str.lower().str.replace('-','_').str.replace('(','_').str.replace(')','_')

In [302]:
#of the following instruments, specialty code to resolve button press issues needs to created for at
#least the DCCS instruments and the Picture Vocab Tests...lets start there
insts_w_buttons_issues =crosswalk.loc[crosswalk.specialty_code.str.contains('button')==True].Inst.unique().astype(list)
insts_w_buttons_issues

array(['NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1',
       'NIH Toolbox Dimensional Change Card Sort Test Ages 3-7 v2.1',
       'NIH Toolbox Dimensional Change Card Sort Test Ages 8-11 v2.1',
       'NIH Toolbox Flanker Inhibitory Control and Attention Test Age 12+ v2.1',
       'NIH Toolbox Flanker Inhibitory Control and Attention Test Ages 3-7 v2.1',
       'NIH Toolbox Flanker Inhibitory Control and Attention Test Ages 8-11 v2.1',
       'NIH Toolbox Oral Reading Recognition Test Age 3+ v2.1',
       'NIH Toolbox Pattern Comparison Processing Speed Test Age 7+ v2.1',
       'NIH Toolbox Pattern Comparison Processing Speed Test Ages 3 - 6 v2.1',
       'NIH Toolbox Picture Vocabulary Test Age 3+ v2.0',
       'NIH Toolbox Picture Vocabulary Test Age 3+ v2.1'], dtype=object)

In [303]:
##subset for testing
#insts_w_buttons_issues =['NIH Toolbox Picture Vocabulary Test Age 3+ Practice v2.0',
#                         'NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1',
#                         'NIH Toolbox Picture Vocabulary Test Age 3+ v2.0',
#                         'NIH Toolbox Pattern Comparison Processing Speed Test Age 7+ v2.1'
#                         NIH Toolbox Pattern Comparison Processing Speed Test Ages 3 - 6 v2.1
#                         ]

In [304]:
#examples (look at the Response column)
tlbxitems.loc[tlbxitems['Instrument Title']=='NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1']
tlbxitems.loc[tlbxitems['Instrument Title']=='NIH Toolbox Picture Vocabulary Test Age 3+ v2.0']


Unnamed: 0,Measurement System,Domain,Instrument Title,Item ID,Stem,Context,DataType,Responses,hcp_variable
5,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC003,LAVOC003,,integer,0=LAVOC003-3\n0=LAVOC003-2\n0=LAVOC003-1\n1=LA...,lavoc003
6,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC051,LAVOC051,,integer,1=LAVOC051-9\n0=LAVOC051-1\n0=LAVOC051-3\n0=LA...,lavoc051
7,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC011,LAVOC011,,integer,0=LAVOC011-3\n1=LAVOC011-9\n0=LAVOC011-1\n0=LA...,lavoc011
8,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC014,LAVOC014,,integer,0=LAVOC014-3\n0=LAVOC014-2\n0=LAVOC014-1\n1=LA...,lavoc014
9,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC013,LAVOC013,,integer,0=LAVOC013-1\n1=LAVOC013-9\n0=LAVOC013-3\n0=LA...,lavoc013
10,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC028,LAVOC028,,integer,0=LAVOC028-2\n0=LAVOC028-3\n0=LAVOC028-1\n1=LA...,lavoc028
11,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC062,LAVOC062,,integer,0=LAVOC062-2\n0=LAVOC062-1\n0=LAVOC062-3\n1=LA...,lavoc062
12,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC004,LAVOC004,,integer,1=LAVOC004-9\n0=LAVOC004-3\n0=LAVOC004-2\n0=LA...,lavoc004
13,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC002,LAVOC002,,integer,0=LAVOC002-2\n1=LAVOC002-9\n0=LAVOC002-3\n0=LA...,lavoc002
14,NIH Toolbox,Cognition,NIH Toolbox Picture Vocabulary Test Age 3+ v2.0,LAVOC008,LAVOC008,,integer,1=LAVOC008-9\n0=LAVOC008-1\n0=LAVOC008-2\n0=LA...,lavoc008


In [305]:
#How to resolve?  Define a function that will translate this into code to populate the crosswalk
#I.E parsing the response into 'correct' and 'button' logic, and then rebuild that into a python snppit
#for any entry in the 'Response' column for a given row in the NIH Toolbox's data dictionary. 

#example:  
#A=buttoncorrect(instrument='NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1',variable='dccs_shape_prac2')
#A is a tuple: ttestkeep,pythonsnp,translation
#A[0] is ttestkeep- is the mini data frame that illustrates the logic of the translation
#A[1] is pythonsnp- is the python code to translate NIH Toolbox to NDA
#A[2] is translation corresponds with the verbiage to add to the NIH Toolbox data dictionary for interpretation


def buttoncorrect(instrument='NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1',variable='dccs_shape_instr1'):
    ddsub=tlbxitems.loc[tlbxitems['Instrument Title']==instrument].copy()
    test=ddsub.loc[ddsub.hcp_variable==variable].Responses.str.split('\n',expand=True)
    test=test.reset_index(drop=True)
    ttest=test.transpose()
    ttest['dd']=ttest[0]
    ttest['button']=ttest.index + 1
    ttestkeep=ttest.loc[ttest.dd.str.contains('=')==True,['dd','button']]
    ttestkeep['NDA_answer_as_string']=''
    ttestkeep.loc[ttest.dd.str.contains('0='),'NDA_answer_as_string']='incorrect'
    ttestkeep.loc[ttest.dd.str.contains('1='),'NDA_answer_as_string']='correct'
    ttestkeep.loc[ttest.dd.str.contains('0='),'NDA_answer_as_number']='0'
    ttestkeep.loc[ttest.dd.str.contains('1='),'NDA_answer_as_number']='1'
    #now we turn all this logic into a single string of code for the given response option 
    #that can be used to populate the 'requested_python' field of the crosswalk
    ttestkeep['codeblock']='studydata.loc[studydata.{}=='.format(variable)+ttestkeep.button.astype(str)+",'{}']=".format(variable) +ttestkeep.NDA_answer_as_number
    ttestkeep['translate']='IPAD output value of '+ttestkeep.button.astype(str)+'='+ttestkeep.NDA_answer_as_string
    pythonsnp=';'.join(ttestkeep.codeblock.tolist())
    translation=';'.join(ttestkeep.translate.tolist())
    return ttestkeep,pythonsnp,translation # you'll only actually need the pythonsnp, but prove to yourself that its working by examining the output for a few vars
    


In [306]:
#see that the logic is getting mapped properly
ttestkeep,pythonsnp,translation=buttoncorrect(instrument='NIH Toolbox Dimensional Change Card Sort Test Age 12+ v2.1',variable='dccs_shape_prac2')
print(translation)
ttestkeep

IPAD output value of 1=incorrect;IPAD output value of 2=correct


Unnamed: 0,dd,button,NDA_answer_as_string,NDA_answer_as_number,codeblock,translate
0,0=WhiteRabbit,1,incorrect,0,"studydata.loc[studydata.dccs_shape_prac2==1,'d...",IPAD output value of 1=incorrect
1,1=BrownBoat,2,correct,1,"studydata.loc[studydata.dccs_shape_prac2==2,'d...",IPAD output value of 2=correct


In [307]:
#see that the logic is getting translated to extractable code
ttestkeep,pythonsnp,translation=buttoncorrect(instrument='NIH Toolbox Picture Vocabulary Test Age 3+ v2.0',variable='lavoc091')
for i in ttestkeep.codeblock:
    print(i)
print("***")
print(pythonsnp)
print(translation)
ttestkeep

studydata.loc[studydata.lavoc091==1,'lavoc091']=0
studydata.loc[studydata.lavoc091==2,'lavoc091']=1
studydata.loc[studydata.lavoc091==3,'lavoc091']=0
studydata.loc[studydata.lavoc091==4,'lavoc091']=0
***
studydata.loc[studydata.lavoc091==1,'lavoc091']=0;studydata.loc[studydata.lavoc091==2,'lavoc091']=1;studydata.loc[studydata.lavoc091==3,'lavoc091']=0;studydata.loc[studydata.lavoc091==4,'lavoc091']=0
IPAD output value of 1=incorrect;IPAD output value of 2=correct;IPAD output value of 3=incorrect;IPAD output value of 4=incorrect


Unnamed: 0,dd,button,NDA_answer_as_string,NDA_answer_as_number,codeblock,translate
0,0=LAVOC091-2,1,incorrect,0,"studydata.loc[studydata.lavoc091==1,'lavoc091']=0",IPAD output value of 1=incorrect
1,1=LAVOC091-9,2,correct,1,"studydata.loc[studydata.lavoc091==2,'lavoc091']=1",IPAD output value of 2=correct
2,0=LAVOC091-3,3,incorrect,0,"studydata.loc[studydata.lavoc091==3,'lavoc091']=0",IPAD output value of 3=incorrect
3,0=LAVOC091-1,4,incorrect,0,"studydata.loc[studydata.lavoc091==4,'lavoc091']=0",IPAD output value of 4=incorrect


Now make a wrap around function...

Given name of an affected instrument, find the variables with button issues, and output dataframe
with Instrument name, 'Item ID', 'hcp_variable', pythonsnp, and translation 
then put the translation into the NIH DD (tlbxitems).  Put the pythonsnp into the crosswalk


In [309]:
#inst=insts_w_buttons_issues[2]
#print(inst)
buttoncodes=pd.DataFrame()
for inst in insts_w_buttons_issues:
    ddsub=tlbxitems.loc[tlbxitems['Instrument Title']==inst].copy()
    if ddsub.empty:
        print(inst+ ' is not in the NIH Toolbox Data Dictionary so button issue cannot be resolved')
    else:
        ddsub_buttons=ddsub.loc[ddsub.Responses.str.contains('1=')==True,['Instrument Title','Item ID','Responses','hcp_variable']] 
        ddsub_buttons
        #remember that the output of the buttoncorrectfunction is a tuple
        #so the [2] below is where we grab the specific output of the function want and put it into the column of interest 
        #similarly, the [1] below is where we grab the python code
        ddsub_buttons['Translation (output of IPAD)']=ddsub_buttons.apply(lambda x: buttoncorrect(instrument=inst,variable=x['hcp_variable'])[2],axis=1)
        ddsub_buttons['requested_python']=ddsub_buttons.apply(lambda x: buttoncorrect(instrument=inst,variable=x['hcp_variable'])[1],axis=1)
        ddsub_buttons=ddsub_buttons[['Translation (output of IPAD)','requested_python','Instrument Title','hcp_variable']]
        buttoncodes=pd.concat([buttoncodes,ddsub_buttons],axis=0)

NIH Toolbox Oral Reading Recognition Test Age 3+ v2.1 is not in the NIH Toolbox Data Dictionary so button issue cannot be resolved
NIH Toolbox Picture Vocabulary Test Age 3+ v2.1 is not in the NIH Toolbox Data Dictionary so button issue cannot be resolved


In [313]:
buttoncodes=buttoncodes.rename(columns={'Instrument Title':'Inst'})
buttoncodes

Unnamed: 0,Translation (output of IPAD),requested_python,Inst,hcp_variable
2605,IPAD output value of 1=incorrect;IPAD output v...,"studydata.loc[studydata.dccs_shape_instr1==1,'...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_instr1
2606,IPAD output value of 1=correct;IPAD output val...,"studydata.loc[studydata.dccs_shape_instr2==1,'...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_instr2
2608,IPAD output value of 1=correct;IPAD output val...,"studydata.loc[studydata.dccs_shape_prac1==1,'d...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_prac1
2609,IPAD output value of 1=incorrect;IPAD output v...,"studydata.loc[studydata.dccs_shape_prac2==1,'d...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_prac2
2610,IPAD output value of 1=incorrect;IPAD output v...,"studydata.loc[studydata.dccs_shape_prac3==1,'d...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_prac3
2611,IPAD output value of 1=correct;IPAD output val...,"studydata.loc[studydata.dccs_shape_prac4==1,'d...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_prac4
2612,IPAD output value of 1=incorrect;IPAD output v...,"studydata.loc[studydata.dccs_shape_instr5==1,'...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_instr5
2613,IPAD output value of 1=correct;IPAD output val...,"studydata.loc[studydata.dccs_shape_instr6==1,'...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_instr6
2615,IPAD output value of 1=correct;IPAD output val...,"studydata.loc[studydata.dccs_shape_prac5==1,'d...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_prac5
2616,IPAD output value of 1=incorrect;IPAD output v...,"studydata.loc[studydata.dccs_shape_prac6==1,'d...",NIH Toolbox Dimensional Change Card Sort Test ...,dccs_shape_prac6


In [314]:
#now update the crosswalk with interpretations and python snppits


In [315]:
#now merge this data into the crosswalk and write the new csv
#note you'll have to do some massaging to the crosswalk after it has been updated because there will be two requested_python
#vars now...preventing overwrite...need to open the file and update the cells by hand until see that no bugs here that would be 
#introduced by auto update
newcrosswalk=pd.merge(crosswalk,buttoncodes,how='left',on=['Inst','hcp_variable'])

In [316]:
newcrosswalk.columns

Index(['Measurement System', 'Domain', 'Inst', 'Item ID', 'Stem', 'Context',
       'DataType', 'Responses', 'Translation (output of IPAD)_x',
       'hcp_variable', 'action_requested', 'requested_python_x',
       'specialty_code', 'hcp_variable_upload', 'nda_structure', 'nda_element',
       'description', 'valueRange', 'notes', 'template', 'inst_short',
       'Source', 'Translation (output of IPAD)_y', 'requested_python_y'],
      dtype='object')

In [317]:
newcrosswalk.to_csv(pathin+'test'+crosswalkfile,index=False)#crosswalkfile,index=False)