# Clinical Profile Calculations on JHU EDS Sample
### Steph Howson, JHU/APL, Data Scientist

This notebook calculates fields to be generated for the Clinical Profiles model. Once the values are calculated, the results will be dynamically put into the model with the fhir.resources implementation. The Clinical Profiles Python specification was built using fhir-parser. These forked Github repositories can be found (currently not much was done to add desired features for Clinical Profiles in particular, but the templating captures much of the functionality needed):

https://github.com/stephanie-howson/fhir-parser

https://github.com/stephanie-howson/fhir.resources

The Clinical Profile Python FHIR Class definition can be found at:

https://github.com/stephanie-howson/fhir.resources/blob/master/fhir/resources/clinicalprofile.py

### Imports

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as ss
import itertools
import math

### Reading in data from SAFE

In [2]:
df_labs = pd.read_csv(r'S:\NCATS\Clinical_Profiles\clean_data\EDS\jh_eds_labs.txt','|')
df_diagnoses_hpo = pd.read_csv(r'S:\NCATS\Clinical_Profiles\clean_data\EDS\jh_eds_diagnoses_hpo.txt','|')
df_encounter = pd.read_csv(r'S:\NCATS\Clinical_Profiles\clean_data\EDS\jh_eds_encounter.txt','|')
df_meds = pd.read_csv(r'S:\NCATS\Clinical_Profiles\clean_data\EDS\jh_eds_meds.txt','|')

### Calculating Lab Information

In [12]:
%%time
code = df_labs.groupby(['LONG_COMMON_NAME']).Loinc_Code.unique()

count = df_labs.LONG_COMMON_NAME.value_counts()

df_labs['orderYear'] = pd.to_datetime(df_labs.Ordering_datetime).dt.year

frequencyPerYear = df_labs.groupby(['LONG_COMMON_NAME','orderYear','PatientID']).PatientID.count().groupby(['LONG_COMMON_NAME','orderYear']).mean()

correlatedLabsCoefficients = df_labs.groupby('LONG_COMMON_NAME').Result_numeric.apply(lambda x: pd.Series(x.values)).unstack().transpose().corr()

abscorrelation = correlatedLabsCoefficients.abs()

fractionOfSubjects = df_labs.groupby(['LONG_COMMON_NAME']).PatientID.nunique()/df_labs.PatientID.nunique()

units = df_labs.groupby(['LONG_COMMON_NAME']).unit.unique()

minimum = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.min()
maximum = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.max()
mean = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.mean()
median = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.median()
stdDev = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.std()
nthDecile = df_labs.groupby('LONG_COMMON_NAME').Result_numeric.quantile([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
                                                                         0.7, 0.8, 0.9])

Wall time: 518 ms


**NOTE: Less than a second to calculate all necessary lab information**

In [13]:
# Python magic is silly and can't save variable results and time at the same time
code = df_labs.groupby(['LONG_COMMON_NAME']).Loinc_Code.unique()

count = df_labs.LONG_COMMON_NAME.value_counts()

df_labs['orderYear'] = pd.to_datetime(df_labs.Ordering_datetime).dt.year

frequencyPerYear = df_labs.groupby(['LONG_COMMON_NAME','orderYear','PatientID']).PatientID.count().groupby(['LONG_COMMON_NAME','orderYear']).mean()

correlatedLabsCoefficients = df_labs.groupby('LONG_COMMON_NAME').Result_numeric.apply(lambda x: pd.Series(x.values)).unstack().transpose().corr()

abscorrelation = correlatedLabsCoefficients.abs()

fractionOfSubjects = df_labs.groupby(['LONG_COMMON_NAME']).PatientID.nunique()/df_labs.PatientID.nunique()

units = df_labs.groupby(['LONG_COMMON_NAME']).unit.unique()

minimum = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.min()
maximum = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.max()
mean = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.mean()
median = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.median()
stdDev = df_labs.groupby(['LONG_COMMON_NAME']).Result_numeric.std()
nthDecile = df_labs.groupby('LONG_COMMON_NAME').Result_numeric.quantile([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
                                                                         0.7, 0.8, 0.9])

#### Printing out first 10 results from each calculated field as an example
*If you copy this file, feel free to remove .head(10) to see all results, by default pandas groupby sorts alphanumerically*

In [14]:
code.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                    [20636-7]
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma       [1742-6]
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma           [6768-6]
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma     [1920-8]
Basophils [#/volume] in Blood by Automated count                               [704-7]
Basophils/100 leukocytes in Blood by Automated count                           [706-2]
Bilirubin.total [Mass/volume] in Serum or Plasma                              [1975-2]
Buprenorphine [Presence] in Urine                                             [3414-0]
Calcium [Mass/volume] in Serum or Plasma                                     [17861-6]
Calcium [Moles/volume] in Serum or Plasma                                     [2000-8]
Name: Loinc_Code, dtype: object

In [15]:
count.head(10)

Sodium [Moles/volume] in Serum or Plasma                                                  2478
Hemoglobin [Mass/volume] in Blood                                                         2469
Calcium [Mass/volume] in Serum or Plasma                                                  2467
Hematocrit [Volume Fraction] of Blood by Automated count                                  2466
Erythrocytes [#/volume] in Blood by Automated count                                       2465
Erythrocyte mean corpuscular volume [Entitic volume] by Automated count                   2464
Platelets [#/volume] in Blood by Automated count                                          2461
Erythrocyte mean corpuscular hemoglobin [Entitic mass] by Automated count                 2457
Erythrocyte mean corpuscular hemoglobin concentration [Mass/volume] by Automated count    2456
Erythrocyte distribution width [Ratio] by Automated count                                 2435
Name: LONG_COMMON_NAME, dtype: int64

In [16]:
frequencyPerYear.head(10)

LONG_COMMON_NAME                                                           orderYear
Alanine [Moles/volume] in Serum or Plasma                                  2016         1.000000
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma    2015         1.166667
                                                                           2016         2.385417
                                                                           2017         2.811550
                                                                           2018         3.135417
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma        2015         1.166667
                                                                           2016         2.395833
                                                                           2017         2.817629
                                                                           2018         3.145833
Aspartate aminotransferase [Enzymatic acti

In [17]:
correlatedLabsCoefficients.head(10)

LONG_COMMON_NAME,Alanine [Moles/volume] in Serum or Plasma,Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma,Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma,Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma,Basophils [#/volume] in Blood by Automated count,Basophils/100 leukocytes in Blood by Automated count,Bilirubin.total [Mass/volume] in Serum or Plasma,Buprenorphine [Presence] in Urine,Calcium [Mass/volume] in Serum or Plasma,Calcium [Moles/volume] in Serum or Plasma,...,Protein [Mass/volume] in Serum or Plasma,Prothrombin time (PT),Prothrombin time (PT) in Blood by Coagulation assay,Sodium [Moles/volume] in Serum or Plasma,Tissue transglutaminase IgA Ab [Units/volume] in Serum,Triglyceride [Mass/volume] in Serum or Plasma,Tryptase [Mass/volume] in Serum or Plasma,Urea nitrogen [Mass/volume] in Blood,Urea nitrogen [Mass/volume] in Serum or Plasma,Urea nitrogen [Mass/volume] in Venous blood
LONG_COMMON_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alanine [Moles/volume] in Serum or Plasma,,,,,,,,,,,...,,,,,,,,,,
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma,,1.0,-0.055443,0.412508,,-0.009873,-0.048056,,-0.021874,0.231193,...,-0.186058,-0.089089,-0.019848,-0.057534,,-0.081144,0.064373,,0.00043,-0.019745
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma,,-0.055443,1.0,-0.033946,,0.056035,0.028484,,0.009639,0.199466,...,0.063551,-0.052525,0.005521,0.006587,,0.00785,-0.030504,,0.073533,-0.01024
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma,,0.412508,-0.033946,1.0,,-0.004013,-0.032171,,-0.028781,0.240443,...,-0.17884,-0.076304,-0.007656,0.031657,,-0.047406,0.373397,,-0.036357,-0.009408
Basophils [#/volume] in Blood by Automated count,,,,,,,,,,,...,,,,,,,,,,
Basophils/100 leukocytes in Blood by Automated count,,-0.009873,0.056035,-0.004013,,1.0,-0.000222,,-0.030338,-0.172233,...,0.001766,0.077588,-0.070155,-0.008356,,0.018652,-0.069179,,0.061525,0.021938
Bilirubin.total [Mass/volume] in Serum or Plasma,,-0.048056,0.028484,-0.032171,,-0.000222,1.0,,-0.074946,0.422649,...,-0.00274,-0.190106,0.053342,0.018888,,-0.087301,0.111795,,0.106218,-0.045631
Buprenorphine [Presence] in Urine,,,,,,,,,,,...,,,,,,,,,,
Calcium [Mass/volume] in Serum or Plasma,,-0.021874,0.009639,-0.028781,,-0.030338,-0.074946,,1.0,0.074787,...,0.031181,0.231302,0.023751,0.086612,,-0.256088,-0.029405,,0.072451,-0.020347
Calcium [Moles/volume] in Serum or Plasma,,0.231193,0.199466,0.240443,,-0.172233,0.422649,,0.074787,1.0,...,0.273437,-0.021036,0.104629,-0.43621,,-0.428899,0.479242,,0.375903,0.275933


In [18]:
abscorrelation.head(10)

LONG_COMMON_NAME,Alanine [Moles/volume] in Serum or Plasma,Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma,Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma,Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma,Basophils [#/volume] in Blood by Automated count,Basophils/100 leukocytes in Blood by Automated count,Bilirubin.total [Mass/volume] in Serum or Plasma,Buprenorphine [Presence] in Urine,Calcium [Mass/volume] in Serum or Plasma,Calcium [Moles/volume] in Serum or Plasma,...,Protein [Mass/volume] in Serum or Plasma,Prothrombin time (PT),Prothrombin time (PT) in Blood by Coagulation assay,Sodium [Moles/volume] in Serum or Plasma,Tissue transglutaminase IgA Ab [Units/volume] in Serum,Triglyceride [Mass/volume] in Serum or Plasma,Tryptase [Mass/volume] in Serum or Plasma,Urea nitrogen [Mass/volume] in Blood,Urea nitrogen [Mass/volume] in Serum or Plasma,Urea nitrogen [Mass/volume] in Venous blood
LONG_COMMON_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alanine [Moles/volume] in Serum or Plasma,,,,,,,,,,,...,,,,,,,,,,
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma,,1.0,0.055443,0.412508,,0.009873,0.048056,,0.021874,0.231193,...,0.186058,0.089089,0.019848,0.057534,,0.081144,0.064373,,0.00043,0.019745
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma,,0.055443,1.0,0.033946,,0.056035,0.028484,,0.009639,0.199466,...,0.063551,0.052525,0.005521,0.006587,,0.00785,0.030504,,0.073533,0.01024
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma,,0.412508,0.033946,1.0,,0.004013,0.032171,,0.028781,0.240443,...,0.17884,0.076304,0.007656,0.031657,,0.047406,0.373397,,0.036357,0.009408
Basophils [#/volume] in Blood by Automated count,,,,,,,,,,,...,,,,,,,,,,
Basophils/100 leukocytes in Blood by Automated count,,0.009873,0.056035,0.004013,,1.0,0.000222,,0.030338,0.172233,...,0.001766,0.077588,0.070155,0.008356,,0.018652,0.069179,,0.061525,0.021938
Bilirubin.total [Mass/volume] in Serum or Plasma,,0.048056,0.028484,0.032171,,0.000222,1.0,,0.074946,0.422649,...,0.00274,0.190106,0.053342,0.018888,,0.087301,0.111795,,0.106218,0.045631
Buprenorphine [Presence] in Urine,,,,,,,,,,,...,,,,,,,,,,
Calcium [Mass/volume] in Serum or Plasma,,0.021874,0.009639,0.028781,,0.030338,0.074946,,1.0,0.074787,...,0.031181,0.231302,0.023751,0.086612,,0.256088,0.029405,,0.072451,0.020347
Calcium [Moles/volume] in Serum or Plasma,,0.231193,0.199466,0.240443,,0.172233,0.422649,,0.074787,1.0,...,0.273437,0.021036,0.104629,0.43621,,0.428899,0.479242,,0.375903,0.275933


In [19]:
fractionOfSubjects.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                    0.001534
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma      0.815951
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma          0.815951
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma    0.815951
Basophils [#/volume] in Blood by Automated count                             0.001534
Basophils/100 leukocytes in Blood by Automated count                         0.731595
Bilirubin.total [Mass/volume] in Serum or Plasma                             0.814417
Buprenorphine [Presence] in Urine                                            0.007669
Calcium [Mass/volume] in Serum or Plasma                                     0.880368
Calcium [Moles/volume] in Serum or Plasma                                    0.027607
Name: PatientID, dtype: float64

In [20]:
units.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                    [umol/L]
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma         [U/L]
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma             [U/L]
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma      [IU/L]
Basophils [#/volume] in Blood by Automated count                               [K/uL]
Basophils/100 leukocytes in Blood by Automated count                              [%]
Bilirubin.total [Mass/volume] in Serum or Plasma                              [mg/dL]
Buprenorphine [Presence] in Urine                                               [nan]
Calcium [Mass/volume] in Serum or Plasma                                      [mg/dL]
Calcium [Moles/volume] in Serum or Plasma                                     [mg/dL]
Name: unit, dtype: object

In [21]:
minimum.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                    401.8
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma        4.0
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma           16.0
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma      6.0
Basophils [#/volume] in Blood by Automated count                               0.0
Basophils/100 leukocytes in Blood by Automated count                           0.0
Bilirubin.total [Mass/volume] in Serum or Plasma                               0.1
Buprenorphine [Presence] in Urine                                              NaN
Calcium [Mass/volume] in Serum or Plasma                                       4.9
Calcium [Moles/volume] in Serum or Plasma                                      7.6
Name: Result_numeric, dtype: float64

In [22]:
maximum.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                     401.8
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma       896.0
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma           855.0
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma    1203.0
Basophils [#/volume] in Blood by Automated count                                0.0
Basophils/100 leukocytes in Blood by Automated count                            2.8
Bilirubin.total [Mass/volume] in Serum or Plasma                                4.2
Buprenorphine [Presence] in Urine                                               NaN
Calcium [Mass/volume] in Serum or Plasma                                       11.4
Calcium [Moles/volume] in Serum or Plasma                                      10.5
Name: Result_numeric, dtype: float64

In [23]:
mean.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                    401.800000
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma       29.804916
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma           75.764341
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma     28.586710
Basophils [#/volume] in Blood by Automated count                               0.000000
Basophils/100 leukocytes in Blood by Automated count                           0.542388
Bilirubin.total [Mass/volume] in Serum or Plasma                               0.470937
Buprenorphine [Presence] in Urine                                                   NaN
Calcium [Mass/volume] in Serum or Plasma                                       9.026775
Calcium [Moles/volume] in Serum or Plasma                                      8.839394
Name: Result_numeric, dtype: float64

In [24]:
median.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                    401.8
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma       17.0
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma           66.0
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma     19.0
Basophils [#/volume] in Blood by Automated count                               0.0
Basophils/100 leukocytes in Blood by Automated count                           0.5
Bilirubin.total [Mass/volume] in Serum or Plasma                               0.4
Buprenorphine [Presence] in Urine                                              NaN
Calcium [Mass/volume] in Serum or Plasma                                       9.1
Calcium [Moles/volume] in Serum or Plasma                                      8.8
Name: Result_numeric, dtype: float64

In [25]:
stdDev.head(10)

LONG_COMMON_NAME
Alanine [Moles/volume] in Serum or Plasma                                          NaN
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma      55.465530
Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma          47.040064
Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma    56.039758
Basophils [#/volume] in Blood by Automated count                                   NaN
Basophils/100 leukocytes in Blood by Automated count                          0.368155
Bilirubin.total [Mass/volume] in Serum or Plasma                              0.326424
Buprenorphine [Presence] in Urine                                                  NaN
Calcium [Mass/volume] in Serum or Plasma                                      0.650902
Calcium [Moles/volume] in Serum or Plasma                                     0.925317
Name: Result_numeric, dtype: float64

In [27]:
nthDecile.head(20)

LONG_COMMON_NAME                                                            
Alanine [Moles/volume] in Serum or Plasma                                0.1    401.8
                                                                         0.2    401.8
                                                                         0.3    401.8
                                                                         0.4    401.8
                                                                         0.5    401.8
                                                                         0.6    401.8
                                                                         0.7    401.8
                                                                         0.8    401.8
                                                                         0.9    401.8
Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma  0.1     10.0
                                                               

### Define Correlation Functions Needed for Categorical Data

In [28]:
def cramers_v(df, x, y):
    confusion_matrix = (df.groupby([x,y])[y].size().unstack().fillna(0).astype(int))
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1), (rcorr-1)))

def uncertainty_coefficient(df, x, y):
    df2 = df[[x,y]]
    total = len(df2.dropna())
    p_y = (df.groupby([y], sort=False)[y].size()/total).reindex(index=p_xy.index, level=1)

    s_xy = sum(p_xy * (p_y/p_xy).apply(math.log))

    p_x = df.groupby([x], sort=False)[x].size()/total
    s_x = ss.entropy(p_x)
    if s_x == 0:
        return  1
    else:
        return ((s_x - s_xy) / s_x)

def correlation_ratio(df, x, y):
    df2 =  df.groupby([x],sort=False)[y].agg([np.size,np.mean])
    ybar = df[y].mean()
    numerator = np.nansum(np.multiply(df2['size'],np.square(df2['mean']-ybar)))
    ssd = np.square(df[y]-ybar)
    #ssd = df.groupby([x,y],sort=False)[y].apply(lambda y: np.nansum(np.square(y-ybar)))
    denominator = np.nansum(ssd)
    if numerator == 0:
        return 0.0
    else:
        return np.sqrt(numerator/denominator)   

### Join All DataFrames to "Correlate Everything to Everything"

In [29]:
df = (df_labs.merge(df_diagnoses_hpo, on='PatientID')
             .merge(df_encounter, on=['PatientID','EncounterID'], how='outer')
             .merge(df_meds, on=['PatientID','EncounterID'], how='outer'))

### Define Categorical Fields

In [30]:
categoricals = ['Lab_Name','Base_Name','Loinc_Code','LONG_COMMON_NAME','Category','GroupId','icd_10','icd_name',
                'hpo','hpo_term','Encounter_type','Medication_Name','Dose','Route','Frequency','RXNorm',
               'Therapeutic_Class','Pharmaceutical_Class','Pharmaceutical_Subclass']

## Work in Progress...
#### Need to Define Correlations More Precisely

## Will Add in Other Fields & Their Calculated Results Shortly.....