# This Jupyter Notebook is written to convert Data and Scores files from NIH Toolbox IPAD exports into NDA data structures using linking information from a 'crosswalk' and extra NDA-required subject data from a csv.

Some Notes: 
using a specialty Python 3 virtual environment (named PycharmToolbox) as kernel for this notebook.
Installed by running the following commands in my terminal and then switching the kernel with the dropdown menu above:
> source /home/petra/.virtualenvs/PycharmToolbox/bin/activate
> pip install ipykernel
> ipython kernel install --user --name=PycharmToolbox
> jupyter-notebook

requirements file generated from within the activated virtual environment by:
> pip freeze > requirements.txt 


In [1]:
import os, datetime
import pandas as pd
import numpy as np
import subprocess

snapshotdate = datetime.datetime.today().strftime('%m_%d_%Y')


Specify the input and output data and paths for NIH toolbox.
To run the cells of this notebook, you will need four files.

Two are in the .csv format of the IPAD Toolbox applcation export.
E.g. a raw Data file containing item level responses, and a Scores file, containing the summary statistics for the collection of item level data. We don't need the registration file.  These two files are linked by PIN and Inst variables, and must be cleaned a priori, to remove subjects that are in one but not the other file.  I.e. the list of unique PINs (ex. HCP0211999_V1) in one file should be exactly the same as the list of unique PINs in the other. For HCP data, we concatenate the exports of all subjects' Score data in to a single file, and the exports of all subjects Raw data into a second file.  Because all other sources of HCP data use 'subject' and 'visit' rather than a PIN which is a concatenation of both, we create these variables (subject and visit) from PIN prior to running this program as well.  

The third necessary file is a csv containing the fields that NDA requires in all of their structures 
e.g. subjectkey (GUID or pseudo-GUID), src_subject_id (e.g. HCP0211999), interview_age (in months), and gender (misnomer for sex assigned at birth).  In HCP data, we link the two sources of information via 'subject' and 'visit.'  

Lastly, read in the crosswalk file - which will map your vars to NDA after transpose is complete.  I have placed the crosswalk from HCPA as an example.  Any instruments in this crosswalk that are the same as yours (look at 'Inst' column) will work for you.  You will have to add any instruments not present, after obtaining variable maps and templates from the NDA for your particular set of NIH Toolbox Data.  

Note that subject and visit are variables we created locally to merge with the data coming from a different local source (REDCap).  They are not variables that are output from the NIH Toolbox app on the Ipads, but are necessary for the merge with the NDA required fields stored elsewhere.


In [2]:
#path for formatted structures
pathout="/home/petra/UbWinSharedSpace1/ccf-nda-behavioral/PycharmToolbox/Ipad2NDA_withCrosswalk/NIHToolbox2NDA/prepped_structures/"

In [3]:
#csv scores and raw files for transformation - 
scoresD='/home/petra/UbWinSharedSpace1/boxtemp/HCAorBoth_Toolbox_Scored_Combined_12_17_2019.csv'
rawD='/home/petra/UbWinSharedSpace1/boxtemp/HCAorBoth_Toolbox_Raw_Combined_12_17_2019.csv'

In [4]:
#read into dataframe and take a peak
scordata=pd.read_csv(scoresD,header=0,low_memory=False)
scordata.head()

Unnamed: 0.1,Unnamed: 0,Age,Age-Corrected Standard Score,Age-Corrected Standard Scores Dominant,Age-Corrected Standard Scores Non-Dominant,AgeCorrCrystal,AgeCorrDCCS,AgeCorrEarly,AgeCorrEngRead,AgeCorrEngVocab,...,gender,iPad Version,pin,raw_cat_date,site,source,study,subject,v1_interview_date,visit
0,0,,101.0,,,,,,,,...,,iPad Air 2 (WiFi),,12_12_2019,,,,HCA6058970,,V1
1,1,,97.0,,,,,,,,...,,iPad Air 2 (WiFi),,12_12_2019,,,,HCA6058970,,V1
2,2,,97.0,,,,,,,,...,,iPad Air 2 (WiFi),,12_12_2019,,,,HCA6058970,,V1
3,3,,120.0,,,,,,,,...,,iPad Air 2 (WiFi),,12_12_2019,,,,HCA6058970,,V1
4,4,,78.0,,,,,,,,...,,iPad Air 2 (WiFi),,12_12_2019,,,,HCA6058970,,V1


In [5]:
rawdata=pd.read_csv(rawD,header=0,low_memory=False)
rawdata.head()

Unnamed: 0.1,Unnamed: 0,Age,App Version,Assessment Name,DataType,DateCreated,DateCreatedDatetime,DeviceID,Education,Ethnicity,...,index,level_0,parent,raw_cat_date,site,source,study,subject,v1_interview_date,visit
0,0,,1.19.2160,Assessment 1,informational,1/29/19 11:37,,6D3999FF-2614-43C3-BB7E-82729622914B,,,...,,,,12_12_2019,,,,HCA6058970,,V1
1,1,,1.19.2160,Assessment 1,informational,1/29/19 11:37,,6D3999FF-2614-43C3-BB7E-82729622914B,,,...,,,,12_12_2019,,,,HCA6058970,,V1
2,2,,1.19.2160,Assessment 1,integer,1/29/19 11:37,,6D3999FF-2614-43C3-BB7E-82729622914B,,,...,,,,12_12_2019,,,,HCA6058970,,V1
3,3,,1.19.2160,Assessment 1,integer,1/29/19 11:38,,6D3999FF-2614-43C3-BB7E-82729622914B,,,...,,,,12_12_2019,,,,HCA6058970,,V1
4,4,,1.19.2160,Assessment 1,informational,1/29/19 11:38,,6D3999FF-2614-43C3-BB7E-82729622914B,,,...,,,,12_12_2019,,,,HCA6058970,,V1


In [6]:
rawdata.columns #dont use gender from rawdata ... use gender from NDA specialty fields - ROSETTA STONE ALERT

Index(['Unnamed: 0', 'Age', 'App Version', 'Assessment Name', 'DataType',
       'DateCreated', 'DateCreatedDatetime', 'DeviceID', 'Education',
       'Ethnicity', 'FathersEducation', 'Firmware Version', 'FirstDate4PIN',
       'Gender', 'GuardiansEducation', 'Handedness', 'Inst', 'InstEnded',
       'InstEndedDatetime', 'InstOrdr', 'InstSctn', 'InstStarted',
       'InstStartedDatetime', 'ItemID', 'ItmOrdr', 'Locale',
       'MothersEducation', 'Name', 'PIN', 'Position', 'Race', 'Response',
       'ResponseTime', 'SE', 'Score', 'StartingLevelOverride', 'TScore',
       'Theta', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'datediff', 'file_id',
       'filename', 'flagged', 'gender', 'iPad Version', 'index', 'level_0',
       'parent', 'raw_cat_date', 'site', 'source', 'study', 'subject',
       'v1_interview_date', 'visit'],
      dtype='object')

In [7]:
#prep the fields that NDA requires in all of their structures - we did this in another program, since output is required elsewhere
#here, just subsetting ROSETTA STONE to particular study, renaming a few vars, and changing the date format
subjectlist='/home/petra/UbWinSharedSpace1/redcap2nda_Lifespan2019/Dev_pedigrees/UnrelatedHCAHCD_w_STG_Image_and_pseudo_GUID09_27_2019.csv'
subjects=pd.read_csv(subjectlist)[['subjectped','nda_gender', 'nda_guid', 'nda_interview_age', 'nda_interview_date']]
ndar=subjects.loc[subjects.subjectped.str.contains('HCA')].rename(
    columns={'nda_guid':'subjectkey','subjectped':'src_subject_id','nda_interview_age':'interview_age',
             'nda_interview_date':'interview_date','nda_gender':'gender'}).copy()
ndar['interview_date'] = pd.to_datetime(ndar['interview_date']).dt.strftime('%m/%d/%Y')
ndarlist=['subjectkey','src_subject_id','interview_age','interview_date','gender']


In [8]:
#this is the list of variables in the scored and raw data files that you might need...
#creating list in case your scored data is merged with other files for other reasons (ours was)
scorlist=['Age-Corrected Standard Score', 'Age-Corrected Standard Scores Dominant',
 'Age-Corrected Standard Scores Non-Dominant', 'AgeCorrCrystal', 'AgeCorrDCCS', 'AgeCorrEarly',
 'AgeCorrEngRead', 'AgeCorrEngVocab', 'AgeCorrFlanker', 'AgeCorrFluid', 'AgeCorrListSort',
 'AgeCorrPSM', 'AgeCorrPatternComp', 'AgeCorrTotal', 'Assessment Name', 'Computed Score',
 'ComputedDCCS', 'ComputedEngRead', 'ComputedEngVocab', 'ComputedFlanker', 'ComputedPSM',
 'ComputedPatternComp', 'DCCSaccuracy', 'DCCSreactiontime',  'Dominant Score', 'FlankerAccuracy',
 'FlankerReactionTime', 'FullTCrystal', 'FullTDCCS', 'FullTEarly', 'FullTEngRead', 'FullTEngVocab',
 'FullTFlanker', 'FullTFluid', 'FullTListSort', 'FullTPSM', 'FullTPatternComp', 'FullTTotal',
 'Fully-Corrected T-score', 'Fully-Corrected T-scores Dominant', 'Fully-Corrected T-scores Non-Dominant',
 'FullyCorrectedTscore', 'Group', 'Inst', 'InstrumentBreakoff', 'InstrumentRCReason', 'InstrumentRCReasonOther',
 'InstrumentStatus2', 'ItmCnt', 'Language', 'Male', 'National Percentile (age adjusted)',
 'National Percentile (age adjusted) Dominant', 'National Percentile (age adjusted) Non-Dominant',
 'Non-Dominant Score', 'PIN', 'Raw Score Left Ear', 'Raw Score Right Ear', 'RawDCCS',
 'RawFlanker', 'RawListSort', 'RawPSM', 'RawPatternComp', 'RawScore', 'SE', 'Static Visual Acuity Snellen',
 'Static Visual Acuity logMAR', 'TScore', 'Theta', 'ThetaEngRead', 'ThetaEngVocab', 'ThetaPSM', 'Threshold Left Ear',
 'Threshold Right Ear', 'UncorrCrystal', 'UncorrDCCS', 'UncorrEarly', 'UncorrEngRead', 'UncorrEngVocab',
 'UncorrFlanker', 'UncorrFluid', 'UncorrListSort', 'UncorrPSM', 'UncorrPatternComp', 'UncorrTotal',
 'Uncorrected Standard Score', 'Uncorrected Standard Scores Dominant', 'Uncorrected Standard Scores Non-Dominant',
 'UncorrectedStandardScore']
rawlist=['App Version', 'Assessment Name', 'DataType','DateCreated', 'DeviceID',  'Firmware Version',  
 'Inst', 'InstEnded','InstEndedDatetime', 'InstOrdr', 'InstSctn', 'InstStarted','InstStartedDatetime',
 'ItemID', 'ItmOrdr', 'Locale','PIN', 'Position', 'Response', 'ResponseTime', 'SE', 'Score', 'TScore',
 'Theta','iPad Version']

In [34]:
#merge the score and raw data with the required fields for the NDA
#Note that subject and visit are HCP specific variables that we use to subset the records being sent to the NDA
#create dummy vars if you dont have them...
#scordata['subject']=scordata.PIN #or some other variable in scordata that can be used to merge with ndarfields data
#scordata['visit']='V1' #we keep this around because eventually we'll be releaseing V2,V3, and FU data
#rawdata['subject']=rawdata.PIN
#rawdata['visit']='V1'

scordata=pd.merge(scordata[scorlist+['subject','visit']],ndar,how='inner',left_on='subject', right_on='src_subject_id')
rawdata=pd.merge(rawdata[rawlist+['subject','visit']],ndar,how='inner',left_on='subject', right_on='src_subject_id')


In [39]:
#scordata.loc[scordata.Inst.str.contains('Fluid')].Inst

rawdata.loc[rawdata.Inst.str.contains('Fluid')]

ValueError: cannot index with vector containing NA / NaN values

In [28]:
#specify your crosswalk- take a peak - use the latest crosswalk from the https://github.com/humanconnectome/NIHToolbox2NDA/
#e.g. NIH_Toolbox_crosswalk_HCP.csv
crosswalkfile="/home/petra/UbWinSharedSpace1/ccf-nda-behavioral/PycharmToolbox/Ipad2NDA_withCrosswalk/NIHToolbox2NDA/NIH_Toolbox_crosswalk_HCP.csv"
crosswalk=pd.read_csv(crosswalkfile,header=0,low_memory=False, encoding = "ISO-8859-1")
crosswalk.head()

Unnamed: 0,Inst,template,inst_short,Source,nda_structure,nda_element,hcp_variable,action_requested,hcp_variable_upload,specialty_code,requested_python
0,Anxiety Summary Parent Report (3-7),Anxiety_Summary_3-7.tlbx_fearanx01_template,Anxiety_Summary_3-7,HCPD,tlbx_fearanx01,version_form,Assessment_Name,,Assessment_Name,,
1,Anxiety Summary Parent Report (3-7),Anxiety_Summary_3-7.tlbx_fearanx01_template,Anxiety_Summary_3-7,HCPD,tlbx_fearanx01,nih_tlbx_fctsc,Fully_Corrected_T_score,,Fully_Corrected_T_score,,
2,Anxiety Summary Parent Report (3-7),Anxiety_Summary_3-7.tlbx_fearanx01_template,Anxiety_Summary_3-7,HCPD,tlbx_fearanx01,primary_language,Language,Please rename 'Language' to 'primary_language'...,primary_language,,studydata['primary_language']=studydata['Langu...
3,Cognition Composite Scores,cogcomp01_template,cogcomp01,HCPD HCPA,cogcomp01,version_form,Assessment_Name,,Assessment_Name,1.0,
4,Cognition Composite Scores,cogcomp01_template,cogcomp01,HCPD HCPA,cogcomp01,interview_language,Language,,Language,1.0,


Do a little QC and data exploration

In [42]:
#check that your instruments are in both raw data and scores files. 
#For HCP, all but the NIH Toolbox Pain Intensity FF Age 18+ v2.0 Instrument are practices
#So only the Pain Intensity instrument needed special coding attention (to be dealt with later)
#check your data and adjust if needed 
print('Instruments in Raw data but not Scores:')
for i in rawdata.Inst.unique():
    if i not in scordata.Inst.unique():
        print(i)
print('Instruments in Scored data but not Raw:')
for i in scordata.Inst.unique():
    if i not in rawdata.Inst.unique():
        print(i)


Instruments in Raw data but not Scores:
NIH Toolbox Pattern Comparison Processing Speed Test Age 7+ Practice v2.1
NIH Toolbox Pain Intensity FF Age 18+ v2.0
nan
Instruments in Scored data but not Raw:
Cognition Fluid Composite v1.1
Cognition Crystallized Composite v1.1
Cognition Total Composite Score v1.1
Cognition Early Childhood Composite v1.1
NIH Toolbox Visual Acuity Practice Age 8+ v2.0
Negative Affect Summary (18+)
Social Satisfaction Summary (18+)
Psychological Well Being Summary (18+)
NIH Toolbox Emotion Instructions (Adult/Child) v1.0


In [12]:
#check that lengths are the same...indicating one to one PIN match between scores and raw
print(len(rawdata.PIN.unique()))
print(len(scordata.PIN.unique()))
#check that shape is same before and after removing duplicates (should not be any)
rawdata.shape
scordata.shape
print(rawdata.shape)
print(scordata.shape)
testraw=rawdata.drop_duplicates(subset={'PIN','Inst','ItemID','Position'},keep='first')
testscore=scordata.drop_duplicates(subset={'PIN','Inst'})
print(testraw.shape)
print(testscore.shape)


669
669
(235573, 32)
(25264, 96)
(235573, 32)
(25264, 96)


In [13]:
#define the function that will turn a dataframe into a csv structure 
#- use the definition to send the pain data (which doesn't have entries in the scored data)
def data2struct(patho,dout,crosssub,study='HCPD'):
    """
    Convert dout, a prepared pandas dataframe, into a csv structure that NDA can import
    
    parameters: 
    patho - full path to place you want to store structures (there will be many)
    dout - name of data frame that contains all the variables to be exported
    crosssub - a dataframe which is the subset of the crosswalk for the instrument to be exported as structure
    study - a string to put in the name of the csv file along with the structure name and the short name of the instrument
    
    note that snapshotdate is defined external to this funtion near import statments...     
    
    """
    strucroot=crosssub['nda_structure'].str.strip().str[:-2][0]
    strucnum=crosssub['nda_structure'].str.strip().str[-2:][0]
    instshort=crosssub['inst_short'].str.strip()[0]
    filePath=os.path.join(pathout,study+'_'+instshort+'_'+strucroot+strucnum+'_'+snapshotdate+'.csv')
    if os.path.exists(filePath):
        os.remove(filePath)
    else:
        pass
        #print("Can not delete the file as it doesn't exists")
    with open(filePath,'a') as f:
        f.write(strucroot+","+str(int(strucnum))+"\n")
        dout.to_csv(f,index=False)


In [14]:
#define the function that can be used for the instruments that follow more generalizable pattern
#This function will alert you to any instruments that were successfully transformed but tha might
#warrent a closer look.
def sendthroughcrosswalk(pathout,instreshapedfull,inst_i,crosswalk,studystr='HCPD'):
    # replace special charaters in column names
    instreshapedfull.columns = instreshapedfull.columns.str.replace(' ', '_').str.replace('-', '_').str.replace('(','_').str.replace(')', '_')
    crosswalk_subset = crosswalk.loc[crosswalk['Inst'] == inst_i]
    crosswalk_subset.reset_index(inplace=True)
    # crosswalk_subset.loc[crosswalk_subset['hcp_variable_upload'].isnull()==False,'hcp_variable']
    cwlistbef = list(crosswalk_subset['hcp_variable'])
    before = len(cwlistbef)
    cwlist = list(set(cwlistbef) & set(
        instreshapedfull.columns))  # drop the handful of vars in larger instruments that got mapped but that we dont have
    after = len(cwlist)
    if before != after:
        print("WARNING!!! " + inst_i + ": Crosswalk expects " + str(before) + " elements, but only found " + str(after))
        notfound=list(np.setdiff1d(cwlistbef,cwlist))
        print("Not Found:"+ str(notfound))
    studydata = instreshapedfull[ndarlist + cwlist].copy()
    # execute any python one liners
    for index, row in crosswalk_subset.iterrows():
        if pd.isna(row['requested_python']):
            pass
        else:
            exec(row['requested_python'])
    uploadlist = list(crosswalk_subset['hcp_variable_upload'])
    uploadlist = list(set(uploadlist) & set(studydata.columns))
    data2struct(patho=pathout, dout=studydata[ndarlist + uploadlist], crosssub=crosswalk_subset, study=studystr)



Do special cases first

In [24]:
#Within the rawdata structure (for HCP), all but the NIH Toolbox Pain Intensity FF Age 18+ v2.0 Instrument are practices
#So only the Pain Intensity instrument needed special coding attention
#check your data and adjust if needed - note that subject and visit are variables we created locally 
#to merge with the data coming from a different local source (REDCap)
#create the NDA structure for this special case
inst_i='NIH Toolbox Pain Intensity FF Age 18+ v2.0'
#most of the rows contain duplicated information...only need to know the PIN once, for example, not once for each item response
# so values in the response column need to be pivoted and then merged with the rest of the data, 
paindata=rawdata.loc[rawdata.Inst==inst_i][['PIN','subject','Inst','visit','ItemID','Position',
        'subjectkey','src_subject_id','interview_age','interview_date','gender',
        'Response','ResponseTime', 'SE', 'Score', 'TScore','Theta']]
paindata.ItemID = paindata.ItemID.str.lower().str.replace('-','_').str.replace('(','_').str.replace(')','_')
inst = paindata.pivot(index='PIN', columns='ItemID', values='Response').reset_index()
meta = paindata.drop_duplicates(subset=['PIN', 'visit'])
painreshaped = pd.merge(meta, inst, on='PIN', how='inner').drop(columns={'subject','visit','PIN'})
crosswalk_subset=crosswalk.loc[crosswalk['Inst']==inst_i]
crosswalk_subset.reset_index(inplace=True)
cwlist=list(crosswalk_subset['hcp_variable_upload'])

#several dummy vars for required vars
painreshaped['nih_tlbx_agegencsc']=999
painreshaped['nih_tlbx_rawscore']=999
painreshaped['nih_tlbx_tscore']=999
painreshaped['nih_tlbx_se']=999
painreshaped['nih_tlbx_theta']=999

reshapedslim=painreshaped[ndarlist+cwlist]
data2struct(patho=pathout,dout=reshapedslim,crosssub=crosswalk_subset,study='HCPA')

In [29]:
# Another special case is for Cognition Composite scores - going to cogcomp01 structure at the NDA- 
# This was mapped before Leo agreed to accept data by NIH Toolbox Instrument name (pivot by Inst)
# keeping this special case coding in for posterity and to shed light on one type of merge he must do on his end, 
# when it comes to NIH toolbox data
cogcompdata=scordata.loc[scordata.Inst.str.contains('Cognition')==True][['PIN','Language',
    'Assessment Name','Inst',  'Uncorrected Standard Score', 'Age-Corrected Standard Score',
    'National Percentile (age adjusted)', 'Fully-Corrected T-score']+ndarlist]

#initialize prefix
cogcompdata['varprefix']='test'
cogcompdata.loc[cogcompdata.Inst=='Cognition Crystallized Composite v1.1','varprefix']='nih_crystalcogcomp_'
cogcompdata.loc[cogcompdata.Inst=='Cognition Early Childhood Composite v1.1','varprefix']='nih_eccogcomp_'
cogcompdata.loc[cogcompdata.Inst=='Cognition Fluid Composite v1.1','varprefix']='nih_fluidcogcomp_'
cogcompdata.loc[cogcompdata.Inst=='Cognition Total Composite Score v1.1','varprefix']='nih_totalcogcomp_'
#pivot the vars of interest by varprefix and rename
uncorr=cogcompdata.pivot(index='PIN',columns='varprefix',values='Uncorrected Standard Score')
for col in uncorr.columns.values:
    uncorr=uncorr.rename(columns={col:col+"unadjusted"})
ageadj=cogcompdata.pivot(index='PIN',columns='varprefix',values='Age-Corrected Standard Score')
for col in ageadj.columns.values:
    ageadj=ageadj.rename(columns={col:col+"ageadj"})
npage=cogcompdata.pivot(index='PIN',columns='varprefix',values='National Percentile (age adjusted)')
for col in npage.columns.values:
    npage=npage.rename(columns={col:col+"np_ageadj"})
#put them together
cogcompreshape=pd.concat([uncorr,ageadj,npage],axis=1)
meta=cogcompdata[['PIN','Language','Assessment Name']+ndarlist].drop_duplicates(subset={'PIN'})
#these variables were intended to capture the version...they got mapped to raw scores, though, which isnt right.
#ultimately decided to leave these out for now, but know they exist in the IPAD data but not the 
#NDA which wasnt thinking of different versions of subcomponents when they built this structure
meta['nih_crystalcogcomp']='Cognition Crystallized Composite v1.1'
meta['nih_eccogcomp']='Cognition Early Childhood Composite v1.1'
meta['nih_fluidcogcomp']='Cognition Fluid Composite v1.1'
meta['nih_totalcogcomp']='Cognition Total Composite Score v1.1'
cogcompreshape=pd.merge(meta,cogcompreshape,on='PIN',how='inner')

inst_i='Cognition Composite Scores'
sendthroughcrosswalk(pathout,cogcompreshape,inst_i,crosswalk,studystr='HCPA')


In [27]:
#cogcompreshape.head()
#cogcompdata.columns
cogcompreshape.columns
#cogcompreshape.nih_crystalcogcomp

Index(['PIN', 'Language', 'Assessment_Name', 'subjectkey', 'src_subject_id',
       'interview_age', 'interview_date', 'gender', 'nih_crystalcogcomp',
       'nih_eccogcomp', 'nih_fluidcogcomp', 'nih_totalcogcomp',
       'nih_crystalcogcomp_unadjusted', 'nih_eccogcomp_unadjusted',
       'nih_fluidcogcomp_unadjusted', 'nih_totalcogcomp_unadjusted',
       'nih_crystalcogcomp_ageadj', 'nih_eccogcomp_ageadj',
       'nih_fluidcogcomp_ageadj', 'nih_totalcogcomp_ageadj',
       'nih_crystalcogcomp_np_ageadj', 'nih_eccogcomp_np_ageadj',
       'nih_fluidcogcomp_np_ageadj', 'nih_totalcogcomp_np_ageadj'],
      dtype='object')

In [41]:
#Last special Case is for Visual Acuity, which needs double pivot because of repeat items at different positions
#This special case not yet mapped by NDA - so don't run, but will look something like this
#special case for instruments with "Visual Acuity" in their titles, which have dup inst/itemid at diff positions
#for i in scordata.Inst.unique():
#    if i in rawdata.Inst.unique():
#        inst_i=i
#        if "Visual Acuity" in inst_i:
#            print('Processing ' + inst_i + '...')
#                items=rawdata.loc[rawdata.Inst.str.contains('Visual Acuity')][['PIN','subject','Inst',
#                   'gender','visit','ItemID','Position','Response']]
#                items.ItemID = items.ItemID.str.lower()
#                items['dup_number']=items.groupby(['PIN','ItemID']).cumcount()+1
#                items['ItemID_Dup']=items.ItemID.str.replace('|', '_') + '_P'+items.dup_number.astype(str)
#                inst=items.pivot(index='PIN',columns='ItemID_Dup',values='Response')
#                meta = items.drop_duplicates(subset=['PIN', 'visit'])[['Inst', 'PIN', 
#                                                               'subject', 'visit']]
#                instreshaped = pd.merge(meta, inst, on='PIN', how='inner')
#                items2 = scordata.loc[scordata.Inst == inst_i]
#                instreshapedfull = pd.merge(instreshaped, items2, on='PIN', how='inner')

In [30]:
#for non-special instruments in both scores and raw data types
for i in scordata.Inst.unique():
    if i in rawdata.Inst.unique():
        inst_i=i
        if "Visual Acuity" in inst_i:
            pass  #special case--see below
        elif "Practice" in inst_i:
            print("Note:  Omitting practice instrument, "+inst_i)
        else:
            try:  #this will fail if there are duplicates or if no-one has the data of interest (e.g. idlist too small), or if only V2 instrument
                #print('Processing '+inst_i+'...')
                items=rawdata.loc[rawdata.Inst==inst_i][['PIN','subject','Inst','visit','ItemID','Position',
                   'subjectkey','src_subject_id','interview_age','interview_date','gender',
                   'Response','ResponseTime']]# not these..., 'SE', 'Score', 'TScore','Theta']]
                items.ItemID = items.ItemID.str.lower().str.replace('-','_').str.replace('(','_').str.replace(')','_').str.replace(' ','_')
                inst=items.pivot(index='PIN',columns='ItemID',values='Response').reset_index()
                meta=items.drop_duplicates(subset=['PIN','visit'])
                instreshaped = pd.merge(meta, inst, on='PIN', how='inner').drop(columns={'subject', 'visit','Inst'})
                items2=scordata.loc[scordata.Inst==inst_i][scorlist]
                instreshapedfull=pd.merge(instreshaped,items2,on='PIN',how='inner')
                sendthroughcrosswalk(pathout,instreshapedfull, inst_i, crosswalk,studystr='HCPA')
            except:
                print('Couldnt process '+inst_i+'...')



Not Found:['lavoc048', 'lavoc068', 'lavoc081', 'lavoc084', 'lavoc091', 'lavoc096', 'lavoc100', 'lavoc103', 'lavoc112', 'lavoc118', 'lavoc119', 'lavoc120', 'lavoc126', 'lavoc127', 'lavoc128', 'lavoc132', 'lavoc134', 'lavoc136', 'lavoc137', 'lavoc143', 'lavoc145', 'lavoc148', 'lavoc152', 'lavoc155', 'lavoc157', 'lavoc158', 'lavoc160', 'lavoc161', 'lavoc165', 'lavoc169', 'lavoc176', 'lavoc177', 'lavoc181', 'lavoc182', 'lavoc183', 'lavoc184', 'lavoc187', 'lavoc188', 'lavoc189', 'lavoc190', 'lavoc191', 'lavoc193', 'lavoc194', 'lavoc197', 'lavoc199', 'lavoc203', 'lavoc205', 'lavoc206', 'lavoc214', 'lavoc215', 'lavoc216', 'lavoc224', 'lavoc225', 'lavoc229', 'lavoc231', 'lavoc233', 'lavoc236', 'lavoc237', 'lavoc240', 'lavoc242', 'lavoc259', 'lavoc303', 'lavoc306', 'lavoc315', 'lavoc317', 'lavoc329', 'lavoc337', 'lavoc346', 'lavoc364', 'lavoc422', 'lavoc452', 'lavoc473']
Not Found:[]
Not Found:['lare019', 'lare020', 'lare021', 'lare022', 'lare023', 'lare026', 'lare027', 'lare028', 'lare030', 'l

now validate all these files by calling the OS from within this notebook (assuming you are using linux) to run the NDA validator on your command line.  Alternatively, you could just close this notebook now and run the the following for loop below.  

for var in pathout/*.csv; do vtcmd $var; done
(for var in prepped_structures/*.csv; do vtcmd $var; done)

Either option requires that you have downloaded and installed https://github.com/NDAR/nda-tools python package
per instructions.  I installed vtcmd in my home directory, which set a couple defaults in place., such as the location of validation results. To have the output of the validation sent to a more meaningful location than than the default, open the /home/petra/.NDATools/settings.cfg file (wherever it resides in your system), and  
change the line under [Files] that says 'validation_results = NDAValidationResults' to a better place (perhaps 'pathout').  Example, mine now says 
validation_results = /home/petra/UbWinSharedSpace1/ccf-nda-behavioral/PycharmToolbox/Ipad2NDA_withCrosswalk/NIHToolbox2NDA/NDAValidationResults

so that the prepped structures directory and the NDAValidationResults Directory are right next to one another.


In [33]:
#example process for one structure 
#run your command and capture the log (which will probably report bug) for one file ot see how it works
out=subprocess.Popen(['vtcmd',pathout+"HCPA_Grip_Strength_tlbx_motor01_03_11_2020.csv"], stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
stdout,stderr=out.communicate()
print(stdout.decode())

KeyboardInterrupt: 

If you had an error in the validation, your likely course of action is to add some extra python code to the crosswalk.   

grep notInteger /home/petra/NDAValidationResults/* > Notintegerwarnings