# 02/02/2018 

I just noticed that some questionnaires that should be applied to all members of the family, like the CAARS and FES, are only there for one of the kids. That creates holes in queries that look for the kid without the questionnaires assigned. Let's first identify which kids need the form copied.

We're going to use the results of the Max Form query Labmatrix created for us, augmented by the nuclear family ID information. This will give us an idea of which kids need it, but all forms for that kid will need to be copied to the other kid, not just the shown date, as that's just the latest date.

In [33]:
import pandas as pd
import numpy as np

cnt = 0
df = pd.read_excel('/Users/sudregp/data_plugging/max_dates_02012018.xlsx')
for fam in np.unique(df['Nuclear ID']):
    fam_df = df[df['Nuclear ID'] == fam]
    num_nans = np.sum(pd.isnull(fam_df['CAARS Self Report']))
    if num_nans > 0 and num_nans< fam_df.shape[0]:
        print 'Family %d has less CAARS than total number of members ' % fam + \
              '(%d / %d)' % (num_nans, fam_df.shape[0])
        cnt += 1
print 'Total: %d families' % cnt

Family 8 has less CAARS than total number of members (3 / 4)
Family 36 has less CAARS than total number of members (4 / 5)
Family 53 has less CAARS than total number of members (1 / 2)
Family 288 has less CAARS than total number of members (3 / 4)
Family 374 has less CAARS than total number of members (1 / 2)
Family 385 has less CAARS than total number of members (1 / 7)
Family 471 has less CAARS than total number of members (2 / 5)
Family 515 has less CAARS than total number of members (1 / 2)
Family 528 has less CAARS than total number of members (1 / 2)
Family 698 has less CAARS than total number of members (1 / 7)
Family 727 has less CAARS than total number of members (1 / 2)
Family 735 has less CAARS than total number of members (1 / 2)
Family 855 has less CAARS than total number of members (1 / 4)
Family 871 has less CAARS than total number of members (1 / 6)
Family 1334 has less CAARS than total number of members (2 / 3)
Family 1673 has less CAARS than total number of members (2

OK, now we need to check which ones involve kids:

In [65]:
fam_cnt = 0
kid_cnt = 0
df = pd.read_excel('/Users/sudregp/data_plugging/max_dates_02012018.xlsx')
for fam in np.unique(df['Nuclear ID']):
    fam_df = df[df['Nuclear ID'] == fam]
    num_nans = np.sum(pd.isnull(fam_df['CAARS Self Report']))
    num_caars = np.sum(~pd.isnull(fam_df['CAARS Self Report']))
    num_fam = fam_df.shape[0]
    # have less questionnaire dates than the number of family members?
    if num_nans > 0 and num_nans < num_fam:
        # check everybody's age at the time of the first tests
        q_date = pd.Timestamp(fam_df['CAARS Self Report'].dropna().min())
        fam_ages = [q_date - pd.Timestamp(d) for d in fam_df['Date of Birth']]
        num_kids = 0
        for a in fam_ages:
            if a.days/365 < 18:
                num_kids += 1
        if num_kids > 1 and num_kids > num_caars:
            print 'Family %d has %d kids but only %d CAARS.' % (fam,
                                                                num_kids,
                                                                num_caars)
            fam_cnt += 1
            kid_cnt += (num_kids - num_caars)
print 'Total: %d families, %d kids' % (fam_cnt, kid_cnt)

Family 288 has 2 kids but only 1 CAARS.
Family 1334 has 3 kids but only 1 CAARS.
Family 1673 has 3 kids but only 1 CAARS.
Family 1683 has 2 kids but only 1 CAARS.
Family 1697 has 3 kids but only 2 CAARS.
Family 1731 has 2 kids but only 1 CAARS.
Family 1743 has 2 kids but only 1 CAARS.
Family 1747 has 2 kids but only 1 CAARS.
Family 1753 has 2 kids but only 1 CAARS.
Family 1756 has 2 kids but only 1 CAARS.
Family 1774 has 3 kids but only 2 CAARS.
Family 1855 has 2 kids but only 1 CAARS.
Family 1894 has 3 kids but only 1 CAARS.
Family 1976 has 3 kids but only 1 CAARS.
Family 1988 has 2 kids but only 1 CAARS.
Family 10009 has 4 kids but only 1 CAARS.
Family 10018 has 3 kids but only 1 CAARS.
Family 10047 has 3 kids but only 1 CAARS.
Family 10049 has 2 kids but only 1 CAARS.
Family 10077 has 2 kids but only 1 CAARS.
Family 10084 has 2 kids but only 1 CAARS.
Family 10090 has 2 kids but only 1 CAARS.
Family 10109 has 2 kids but only 1 CAARS.
Family 10131 has 4 kids but only 1 CAARS.
Family 1

OK, this could give us some more data. How does it look for FES?

In [66]:
fam_cnt = 0
kid_cnt = 0
df = pd.read_excel('/Users/sudregp/data_plugging/max_dates_02012018.xlsx')
for fam in np.unique(df['Nuclear ID']):
    fam_df = df[df['Nuclear ID'] == fam]
    num_nans = np.sum(pd.isnull(fam_df['Family Enviro Scale']))
    num_caars = np.sum(~pd.isnull(fam_df['Family Enviro Scale']))
    num_fam = fam_df.shape[0]
    # have less questionnaire dates than the number of family members?
    if num_nans > 0 and num_nans < num_fam:
        # check everybody's age at the time of the first tests
        q_date = pd.Timestamp(fam_df['Family Enviro Scale'].dropna().min())
        fam_ages = [q_date - pd.Timestamp(d) for d in fam_df['Date of Birth']]
        num_kids = 0
        for a in fam_ages:
            if a.days/365 < 18:
                num_kids += 1
        if num_kids > 1 and num_kids > num_caars:
            print 'Family %d has %d kids but only %d FES.' % (fam,
                                                                num_kids,
                                                                num_caars)
            fam_cnt += 1
            kid_cnt += (num_kids - num_caars)
print 'Total: %d families, %d kids' % (fam_cnt, kid_cnt)

Family 1010 has 2 kids but only 1 FES.
Family 1334 has 3 kids but only 1 FES.
Family 1673 has 3 kids but only 2 FES.
Family 1683 has 2 kids but only 1 FES.
Family 1687 has 2 kids but only 1 FES.
Family 1697 has 2 kids but only 1 FES.
Family 1735 has 2 kids but only 1 FES.
Family 1743 has 2 kids but only 1 FES.
Family 1747 has 2 kids but only 1 FES.
Family 1753 has 2 kids but only 1 FES.
Family 1764 has 2 kids but only 1 FES.
Family 1774 has 3 kids but only 2 FES.
Family 1854 has 2 kids but only 1 FES.
Family 1855 has 2 kids but only 1 FES.
Family 1892 has 2 kids but only 1 FES.
Family 1895 has 2 kids but only 1 FES.
Family 1899 has 2 kids but only 1 FES.
Family 10002 has 2 kids but only 1 FES.
Family 10004 has 2 kids but only 1 FES.
Family 10018 has 3 kids but only 2 FES.
Family 10020 has 2 kids but only 1 FES.
Family 10041 has 2 kids but only 1 FES.
Family 10049 has 2 kids but only 1 FES.
Family 10059 has 2 kids but only 1 FES.
Family 10077 has 2 kids but only 1 FES.
Family 10094 has 

Sent Wendy a secure e-mail to add nuclear IDs to people that have none in the spreadsheet, and asked Labmatrix how to best proceed to populate those kids.

## Philip 190 data

So, let's see how many of Philip's set of 190 he's using for the family paper would be affected here. Of those, 161 already data I found and sent in caars_for_190_clean.csv. So, can we help savage som eof those remaining 30?

In [82]:
q = 'CAARS Self Report'

fid = open('/Users/sudregp/tmp/missing.txt')
mrns30 = [line.rstrip() for line in fid]
fid.close()

new_kid = []
df = pd.read_excel('/Users/sudregp/data_plugging/max_dates_02012018.xlsx')
for fam in np.unique(df['Nuclear ID']):
    fam_df = df[df['Nuclear ID'] == fam]
    num_nans = np.sum(pd.isnull(fam_df[q]))
    num_caars = np.sum(~pd.isnull(fam_df[q]))
    num_fam = fam_df.shape[0]
    # have less questionnaire dates than the number of family members?
    if num_nans > 0 and num_nans < num_fam:
        # check everybody's age at the time of the first tests
        q_date = pd.Timestamp(fam_df[q].dropna().min())
        fam_ages = [q_date - pd.Timestamp(d) for d in fam_df['Date of Birth']]
        num_kids = 0
        for a in fam_ages:
            if a.days/365 < 18:
                num_kids += 1
        if num_kids > 1 and num_kids > num_caars:
            for index, pp in fam_df.iterrows():
                if ((q_date - pd.Timestamp(pp['Date of Birth'])).days / 365 < 18 and
                    pd.isnull(pp[q])):
                    new_kid.append(pp['Medical Record - MRN'])

[k for k in mrns30 if k in new_kid]

['7366930', '7237613', '7036619', '7034040', '4977464', '4943715', '4545898']

Of those 7, 4 has unclean CAARS data.