# Labs Filtering

This notebook covers the creation of references for the following Stage 1 filtering criteria:
1. `Labs` `List` Age is 18 and over with valid low-density lipoprotein cholesterol (LDL-C) test result
2.  `Labs` `List` Existing triglycerides (TG) measurements and thresholded at 5.645 mmol/L
3. `Labs` `Dictionary` Date of highest LDL-C measurement (Index Date)
5. `Labs` `List` Those with secondary causes as proxied by lab results thresholded using Table 3.

## Move to llt notebook
6. `Drugs` `List` Those with lipid lowering treatments (LLT) within 1 year to 6 weeks of their Index Date
7. `Labs` `Dictionary` `Modified` Create LDL-C measurement dictionary with actual and scaled values (i.e., based on LLT)

## To-do:
4. `Diagnosis` `List` Those with secondary causes within 1 year (+/-) of Index Date

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os, sys
from dotenv import load_dotenv
load_dotenv("../.env")  # take environment variables
PROJECT_ROOT = os.environ.get("PROJECT_ROOT")
sys.path.append(PROJECT_ROOT)

import numpy as np
import pandas as pd
from glob import glob
from tqdm import tqdm
from datetime import datetime

import utils.PATHS as PATHS
import utils.utils as utils
# import utils.emr_utils as emr_utils
# import utils.load_utils as load_utils

## 1. Valid LDL-C and Age >= 18 patient list

### Notes:
* As LDL-C measurements will be the primary constraint to this referral algorithm, this criteria and refernece dataset (i.e., general labs) is the maximum scope
* Patient not in the general labs dataset will not be accounted for by the referral algorithm
* We are using the following as test names for LDL-C as they appear in the dataset:
  - `LDL-CHOLESTEROL,CALCULATED`; Units: MMOL/L
  - `LDL-CHOLESTEROL,DIRECT`; Units: MMOL/L
* Both CALCULATED and DIRECT measurements are handled the same way.
* Few cases with LDL-C result value of '> 10.30', these were pegged to a value of **10.30**--this risks spurious results in selecting the Index Date for these particular cases; we are not handling this issue for now.

In [3]:
lab_fp_list = glob(os.path.join(PATHS.LABS, "*.csv"))

# general labs dataframe
df_list = [pd.read_csv(path, low_memory=False) for path in tqdm(lab_fp_list)]
df = pd.concat(df_list, ignore_index=True)

100%|███████████████████████████████████████████████████████| 48/48 [01:14<00:00,  1.55s/it]


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30943083 entries, 0 to 30943082
Data columns (total 25 columns):
 #   Column                                   Dtype  
---  ------                                   -----  
 0   Institution Code                         object 
 1   Patient ID                               object 
 2   Gender                                   object 
 3   Race                                     object 
 4   Date of Birth                            object 
 5   Nationality                              object 
 6   Country of Residence                     object 
 7   Resident Indicator                       object 
 8   Case No                                  object 
 9   Result Comment Date                      float64
 10  Specimen Received Date                   object 
 11  Lab Test Code                            object 
 12  Lab Test Description                     object 
 13  Lab Test Type                            object 
 14  Lab Test Type (K

In [5]:
df['Lab Resulted Order Test Display'].unique()

array([nan])

In [6]:
df['Lab Resulted Order Test Type'].unique()

array(['T', 'P', 'C', 'G'], dtype=object)

In [7]:
df['Result Comment Date'].unique()

array([nan])

In [8]:
# Total unique patients
df["Patient ID"].unique().size

162694

In [9]:
# test results reference column
pid_col = "Patient ID"
result_col = "Result Value"
test_name_col = "Lab Resulted Order Test Description"
query_str = "LDL" # lipoprotein, 

# check potential LDL-C tests naming
(
    df.loc[df[test_name_col].str.contains(query_str, case=False)]
    [[pid_col, test_name_col, result_col]]
)[test_name_col].unique()

array(['LDL-CHOLESTEROL,CALCULATED', 'LDL-CHOLESTEROL,DIRECT',
       'CHOLESTEROL,TG,HDL,LDL'], dtype=object)

In [10]:
query_str = "lipoprotein"
(
    df.loc[df[test_name_col].str.contains(query_str, case=False)]
    [[pid_col, test_name_col, result_col]]
)[test_name_col].unique()

array(['APOLIPOPROTEIN A-1', 'APOLIPOPROTEIN B'], dtype=object)

In [11]:
# most are nto relevant, stick with LDL-CHOLESTEROL,CALCULATED and
# LDL-CHOLESTEROL,DIRECT
query_str = "LD"
(
    df.loc[df[test_name_col].str.contains(query_str, case=False)]
    [[pid_col, test_name_col, result_col]]
)[test_name_col].unique()

array(['LDL-CHOLESTEROL,CALCULATED', 'ALDOSTERONE', 'ALDOLASE',
       'LDL-CHOLESTEROL,DIRECT', 'LDH,FLUID', 'BLD FILM REPORT',
       'CHOLESTEROL,TG,HDL,LDL', 'KARYOTYPE NEONATE BLD',
       'BLD FOR MICROFILARIA', 'BURKHOLDERIA PSEUDOMALLEI PCR',
       '24HR URINARY ALDOSTERONE', 'KARYOTYPE NEONATE BLD PROFILE',
       'KARY HEM DISORDERS BLD PRO'], dtype=object)

In [12]:
def not_startswithdigit(x):
    try:
        return not x[0].isdigit()
    except:
        return False

In [13]:
# full data
nonfloat = df.loc[df[result_col].apply(lambda x: not_startswithdigit(x))][result_col].unique()
nonfloat[0:20]

array(['> 50', 'FEMALE', 'NOT DETECTED', '.', '-5.5', 'NIL', '< 1.0',
       'DETECTED', '+', '< 13', '-1.0', '< 0.06', 'NEGATIVE',
       'NON-REACTIVE', 'POSITIVE', 'TRACE', 'Indeterminate', 'Positive',
       '< 0.023', 'Negative'], dtype=object)

In [14]:
test_names = [
    'LDL-CHOLESTEROL,CALCULATED',
    'LDL-CHOLESTEROL,DIRECT'
]
ldlc_all = df.loc[df[test_name_col].isin(test_names)]

In [15]:
nonfloat_ldlc = ldlc_all.loc[ldlc_all[result_col].apply(lambda x: not_startswithdigit(x))][result_col].unique()
nonfloat_ldlc

array(['> 10.30', 'ND'], dtype=object)

In [16]:
# address '> 10.30' string
ldlc_all.loc[ldlc_all[result_col] == '> 10.30', "Result Value"] = 10.3

# Validity filtering: Exclude ND and NaN values
# drop those with 'ND' results : Invalid
ldlc_valid = ldlc_all.loc[ldlc_all[result_col] != 'ND']

# coerce results to float
# ldlc_valid.loc[:,result_col] = ldlc_valid[result_col].astype(float)
# use this as .loc back-casts to object type (right hand side)
ldlc_valid[result_col] = ldlc_valid[result_col].astype(float) 

# remove nans
ldlc_valid = ldlc_valid.dropna(subset=result_col)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ldlc_valid[result_col] = ldlc_valid[result_col].astype(float)


In [17]:
# pre filtering of valid values
display(ldlc_all.shape)

# post filtering
display(ldlc_valid.shape)

(30509, 25)

(30499, 25)

In [18]:
# overall summary statistics
ldlc_valid[[test_name_col, result_col]].describe()

Unnamed: 0,Result Value
count,30499.0
mean,2.715251
std,1.389682
min,0.0
25%,1.86
50%,2.58
75%,3.42
max,71.42


In [19]:
# stratified
grouper = ldlc_valid[[test_name_col, result_col]].groupby(test_name_col)
for name, group_data in grouper:
    print(name)
    display(group_data.set_index(test_name_col).describe())

LDL-CHOLESTEROL,CALCULATED


Unnamed: 0,Result Value
count,30392.0
mean,2.712738
std,1.387382
min,0.0
25%,1.86
50%,2.58
75%,3.41
max,71.42


LDL-CHOLESTEROL,DIRECT


Unnamed: 0,Result Value
count,107.0
mean,3.429065
std,1.806929
min,0.22
25%,2.235
50%,3.13
75%,3.99
max,10.3


In [20]:
# sanity checks : nans and ambiguities per patient
tqdm.pandas()

# nan DOBs
display(ldlc_valid["Date of Birth"].isna().sum())

# multiple dob per patient (including nan) : NONE, all pids have 1 unique DOB
(ldlc_valid.groupby("Patient ID")["Date of Birth"]
 .progress_apply(set)
 .progress_apply(list)
 .progress_apply(lambda x: len(x))
).sort_values()



0

100%|██████████████████████████████████████████████| 24170/24170 [00:00<00:00, 95103.13it/s]
100%|█████████████████████████████████████████████| 24170/24170 [00:00<00:00, 474832.80it/s]
100%|████████████████████████████████████████████| 24170/24170 [00:00<00:00, 1660002.09it/s]


Patient ID
ffd757c66b4e6b225b20    1
ffd9a0062d28db766d29    1
ffe013830194c57ebf6b    1
ffe0d168ca0628f977eb    1
ffe29b60cc135a865749    1
                       ..
fffc7834b1be0778220d    1
fffdc69ec9e4e8d95613    1
ffff3d1e06e7dbc39130    1
ffffbd6a9ee98468a416    1
00031c3262ee7a0c2981    1
Name: Date of Birth, Length: 24170, dtype: int64

In [21]:
# create age col
ref_date = "2024-01-01"
ldlc_valid["Date of Birth"] = pd.to_datetime(ldlc_valid["Date of Birth"])
ldlc_valid["Age"] = ldlc_valid["Date of Birth"].apply(utils.get_age, args=("%Y-%m-%d", ref_date))

In [22]:
ldlc_valid_18 = ldlc_valid.loc[ldlc_valid["Age"] >= 18]

# pre filtering of valid values
display(ldlc_all.shape)

# post filtering
display(ldlc_valid.shape)

# post filtering age
display(ldlc_valid_18.shape)

(30509, 25)

(30499, 26)

(30499, 26)

In [23]:
ldlc_valid_18["Age"]

1405        53
1664        46
3369        67
3905        95
4104        42
            ..
30904658    28
30911917    22
30919951    34
30920394    69
30927159    54
Name: Age, Length: 30499, dtype: int64

In [24]:
ldlc_valid_18_plist = ldlc_valid_18[pid_col].unique().tolist()
len(ldlc_valid_18_plist)

24170

## 2. Existing triglycerides (TG) measurements and thresholded at 5.645 mmol/L

NUHS Reference: "We also exclude those patients who have had a triglycerides test with a value greater than or equal to 5.645 mmol/L **more than once.**"

### Notes:
* Count occurence of TG >= 5.645 mmol/L. If more than 1, then exclude.
* Only using 'TRIGLYCERIDES', excludng those with 'TRIGLYCERIDES,FLUID' and 'TRIGLYCERIDES,URINE'; To be verified.
* NO TIME HORIZON for this criteria. All TG measurements are accounted for regardless of date.

In [25]:
# tricglycerides
query_str = "tri"
(
    df.loc[df[test_name_col].str.contains(query_str, case=False)]
    [[pid_col, test_name_col, result_col]]
)[test_name_col].unique()

array(['Anti-Intrinsic Factor', 'TRIGLYCERIDES', 'TRICHOMONAS',
       'NITRITE DIPSTIX', 'TRIGLYCERIDES,FLUID',
       'Clostridium difficile Tox A&B', 'TRICHOPHYTON MENTAGROPHYTES',
       'BILIRUBIN,PAEDIATRIC', 'Trichophyton Mentagrophytes',
       'TRIGLYCERIDES,URINE'], dtype=object)

In [26]:
# prep tg df
tg = df.loc[df[test_name_col].isin(["TRIGLYCERIDES"])]
tg[result_col][tg[result_col].apply(lambda x: not_startswithdigit(x))].unique()

array(['> 50.00', '< 0.10', 'ND'], dtype=object)

In [27]:
# address string values
tg.loc[tg[result_col] == '< 0.10', "Result Value"] = 0.099 # arbitrary
tg.loc[tg[result_col] == '> 50.00', "Result Value"] = 50.0099 # arbitrary

# Validity filtering: Exclude ND and NaN values
# drop those with 'ND' results : Invalid
tg_cleaned = tg.loc[tg[result_col] != 'ND']

# coerce results to float
# ldlc_valid.loc[:,result_col] = ldlc_valid[result_col].astype(float)
# use this as .loc back-casts to object type (right hand side)
tg_cleaned[result_col] = tg_cleaned[result_col].astype(float) 

# remove nans
tg_cleaned = tg_cleaned.dropna(subset=result_col)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tg_cleaned[result_col] = tg_cleaned[result_col].astype(float)


In [28]:
tg_cleaned[result_col].describe()

count    37032.000000
mean         1.774926
std          2.187615
min          0.099000
25%          0.970000
50%          1.360000
75%          2.000000
max         50.009900
Name: Result Value, dtype: float64

In [29]:
# create patient list with TG >= 5.645 more than once
# NO TIME HORIZON for this criteria
tg_exclude_plist = []
for pid, p_data in tg_cleaned.groupby(pid_col):
    temp = p_data.drop_duplicates()
    if temp.shape[0] > 1:
        occurence = temp[result_col].apply(lambda x: x >= 5.645).sum()
        if occurence > 1:
            tg_exclude_plist.append(pid)

In [30]:
print(len(ldlc_valid_18_plist))
print(len(tg_exclude_plist))
running_plist = list(set(ldlc_valid_18_plist).difference(set(tg_exclude_plist)))
print("Remaining: ", len(running_plist))

print("Removed: ", len(ldlc_valid_18_plist) - len(running_plist))

24170
88
Remaining:  24122
Removed:  48


## 3. Date of highest LDL-C measurement (Index Date)

This step creates a dictionary of dates and values of highest LDL-C measurement for patients that are included (so far) at this stage.

In [31]:
ldlc_valid_18_nontg = ldlc_valid_18.loc[ldlc_valid_18[pid_col].isin(running_plist)]
display(ldlc_valid_18_nontg[pid_col].unique().size)
# verified, can save

24122

In [32]:
# prep date col
date_col = "Specimen Received Date"
ldlc_valid_18_nontg[date_col] = pd.to_datetime(ldlc_valid_18_nontg[date_col])

ldlc_index_dict = {} # can save this dict
for pid, p_data in ldlc_valid_18_nontg.groupby(pid_col):
    temp = p_data.drop_duplicates().sort_values(by=result_col)
    # index -1 is the max (sort is least to most default)
    max_ldlc = temp.iloc[-1][result_col] # max ldlc val
    index_date = temp.iloc[-1][date_col] # index date
    ldlc_index_dict[pid] = {
        "LDL-C Max" : max_ldlc,
        "Index Date" : index_date,
    }

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ldlc_valid_18_nontg[date_col] = pd.to_datetime(ldlc_valid_18_nontg[date_col])


In [34]:
# either use this or the dict
ldlc_valid_18_nontg_index = pd.DataFrame().from_dict(ldlc_index_dict, orient='index')
ldlc_valid_18_nontg_index.index.name = 'Patient ID'
ldlc_valid_18_nontg_index.to_csv(
    os.path.join(PROJECT_ROOT, "results", "ldlc_valid_18_nontg_index.csv"), 
)

## 4. With lab-based secondary causes

#### Notes:

* Thresholds used:
  - Threshold for `HBA1C` was converted from mmol/L to percentages to suit the dataset units. We used 9% threshold from the Mayo Clinic document instead of the conversion from the NUHS value i.e., 9.6%.
  - Threshold for `TSH RECEPTOR ANTIBODY` and `TSH RECEPTOR ANTIBODIES (TRAb)` were converted from mU/L to IU/L to suit the dataset.
  - The rest of the thresholds were kept as is.
* All test records with `ND` values were removed
* Specific cleaning done:
  - `BILIRUBIN,TOTAL`
    * "< 3" changed to 2.99
    * "< 2" changed to 1.99
  - `URINE TOTAL PROTEIN`
    * "> 20.0" and "> 20" changed to 20.1
    * "< 0.023" changed to 0.0229
  - `CREATININE`
    * "< 15" changed to 14.99
    * "> 2200" changed to 2200.1
  - `HBA1C`
    * "> 20.1" changed to 20.2
    * "< 4.2" changed to 4.19
  - `TSH RECEPTOR ANTIBODY`
    * "> 40.0" changed to 40.1
  - `TSH RECEPTOR ANTIBODIES (TRAb)`
    * "< 0.90" changed to 0.89 # will cause issues with the threshold for 92 patients
    * "< 1.10" changed to 1.09 # will cause issues with the threshold for 92 patients
    * "> 40.00" changed to 40.1


### 4.1 Thresholds

In [35]:
# thresholds dictionary
secondary_thresholds = {
    "TSH RECEPTOR ANTIBODY": (lambda x: x >= 0.01), # tsh,  mU/L (milli-Units per litre); IU/L (CONVERTED)
    "TSH RECEPTOR ANTIBODIES (TRAb)": (lambda x: x >= 0.01), # tsh,  mU/L (milli-Units per litre); IU/L (CONVERTED)\
    
    "BILIRUBIN,TOTAL" : (lambda x: x > 34.208), # micro-mol/L; UMOL/L (KEEP THRESHOLD)
    
    # "PROTEIN,TOTAL" : (lambda x: x > 3), # grams (DISCARD due to dist of values)
    "URINE TOTAL PROTEIN" : (lambda x: x > 3), # grams, G/L; KEEP
    "PROTEIN,24HR URINE" : (lambda x: x > 3), # grams, G/day; KEEP
    
    "CREATININE" : (lambda x: x > 229.84), # micro-mol/L; UMOL/L; KEEP
    
    "HBA1C" : (lambda x: x > 81.861), # milli-mol/L; percentage (change to percentage threshold 9% from Mayo clinic document, 9.6% using conversion from NUHS threshold)
    
    "GLUCOSE FASTING" : (lambda x: x > 12.22), # milli-mol/L; mmol/L; KEEP
    
    "MDRD eGFR" : (lambda x: x < 15), # efgr, <15 ml/min; NO UNIT; KEEP
    "CORRECTED CALC CG eGFR" : (lambda x: x < 15), # efgr, < 15 ml/min; NO UNIT; KEEP
    "CKD-EPI eGFR": (lambda x: x < 15), # efgr, < 15 ml/min; NO UNIT; KEEP
}

### 4.2 Data cleaning

In [36]:
# thresholds dictionary
value_cleaner = {
    "TSH RECEPTOR ANTIBODY": {
        "> 40.0" : 40.1,
    },
    "TSH RECEPTOR ANTIBODIES (TRAb)": {
        "< 0.90" : 0.89, # will cause issues with the threshold for 92 patients
        "< 1.10" : 1.09, # will cause issues with the threshold for 92 patients
        "> 40.00": 40.1,
    },
    "BILIRUBIN,TOTAL" : {
        "< 3": 2.99,
        "< 2": 1.99,
    },
    "URINE TOTAL PROTEIN" : {
        "> 20.0": 20.1,
        "> 20" : 20.1,
        "< 0.023": 0.0229,
    },
    "CREATININE" : {
        "< 15" : 14.99,
        "> 2200" : 2200.1,
    },
    "HBA1C" : {
        "> 20.1" : 20.2,
        "< 4.2" : 4.19,
    },
}

In [37]:
def clean_values(row):
    mapper = value_cleaner.get(row[test_name_col])
    if mapper:
        row[result_col] = mapper.get(row[result_col], row[result_col])
        return row
    else:
        return row

In [38]:
tqdm.pandas()
secondary_labs = df.loc[df[test_name_col].isin(secondary_thresholds.keys())]

# update values
secondary_labs_cleaned = secondary_labs.progress_apply(lambda row: clean_values(row), axis=1)

# sanity check : will show only ND values are left
display(secondary_labs_cleaned[result_col][secondary_labs_cleaned[result_col].apply(lambda x: not_startswithdigit(x))].unique())

# address ND and NaN values; cast to float
secondary_labs_cleaned = secondary_labs_cleaned.loc[secondary_labs_cleaned[result_col] != 'ND']
secondary_labs_cleaned[result_col] = secondary_labs_cleaned[result_col].astype(float) 
secondary_labs_cleaned = secondary_labs_cleaned.dropna(subset=result_col)

# sanity checks for ND and NaNs
display(secondary_labs_cleaned[result_col][secondary_labs_cleaned[result_col].apply(lambda x: not_startswithdigit(x))].unique())
display(secondary_labs_cleaned[result_col].isna().any())

100%|██████████████████████████████████████████| 1082833/1082833 [00:41<00:00, 25884.23it/s]


array(['ND'], dtype=object)

array([], dtype=float64)

False

In [39]:
# # tests need to be preprocessed according to Result Values seen for each tests
# # WHAT UNIT IS THIS? LIKELY SAME AS NUHS, use NUHS threshold
# for test in secondary_thresholds.keys():
#     print(test)
#     temp = df.loc[df[test_name_col].isin([test])]
#     display(temp[result_col][temp[result_col].apply(lambda x: not_startswithdigit(x))].unique())
#     # describe
#     display(temp[result_col][temp[result_col].apply(lambda x: not not_startswithdigit(x))].astype(float).describe())
#     print("\n\n")

In [40]:
# patient count checker
# temp = df.loc[df[test_name_col].isin(["TSH RECEPTOR ANTIBODY", "TSH RECEPTOR ANTIBODIES (TRAb)"])]
# temp[temp[result_col].isin(["< 0.90", "< 1.10"])]["Patient ID"].unique().size

In [41]:
secondary_labs_cleaned.loc[secondary_labs_cleaned[test_name_col]=="ABC"]

Unnamed: 0,Institution Code,Patient ID,Gender,Race,Date of Birth,Nationality,Country of Residence,Resident Indicator,Case No,Result Comment Date,...,Lab Resulted Order Test Description,Lab Test Display,Lab Resulted Order Test Type,Lab Resulted Order Test Type (KKH Only),Lab Resulted Order Test Display,Result Value,Specimen Received Date.1,Admit Date,Specimen Collection Date,Visit Date


### 4.3 Thresholding

In [42]:
def threshold_labs(row):
    thresholder = secondary_thresholds.get(row[test_name_col])
    if thresholder:
        return thresholder(row[result_col])
    else:
        return False
    # else:
    #     raise ValueError(f"{row[test_name_col]} not in secondary_thresholds.")

In [43]:
secondary_labs_final = secondary_labs_cleaned.loc[secondary_labs_cleaned.progress_apply(lambda row: threshold_labs(row), axis=1)]

100%|█████████████████████████████████████████| 1079203/1079203 [00:05<00:00, 183332.69it/s]


In [44]:
secondary_labs_final["Patient ID"].unique().size

18824

In [45]:
secondary_labs_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 196140 entries, 73 to 30942124
Data columns (total 25 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   Institution Code                         196140 non-null  object 
 1   Patient ID                               196140 non-null  object 
 2   Gender                                   196139 non-null  object 
 3   Race                                     196138 non-null  object 
 4   Date of Birth                            196139 non-null  object 
 5   Nationality                              196138 non-null  object 
 6   Country of Residence                     9472 non-null    object 
 7   Resident Indicator                       196139 non-null  object 
 8   Case No                                  196140 non-null  object 
 9   Result Comment Date                      0 non-null       float64
 10  Specimen Received Date            

## Filtering for secondary labs based on index date

In [46]:
def check_secondary_labs(
    row, 
    date_col,
    ref_df,
):
    pid = row["Patient ID"]
    index_date = row["Index Date"]

    ref_df[date_col] = pd.to_datetime(ref_df[date_col])
    pid_secondary_labs = ref_df[ref_df["Patient ID"] == pid]
    
    ## take only those records which are between 365 days and 42 days of index date
    pid_secondary_labs = pid_secondary_labs[
        (index_date - pid_secondary_labs[date_col] >= pd.Timedelta(days=0)) & 
        (index_date - pid_secondary_labs[date_col] <= pd.Timedelta(days=365))
    ]
    ## if not empty apply correction else do not
    if (pid_secondary_labs.empty):
        return False
    else:
        return True

In [47]:
tqdm.pandas()
ldlc_valid_18_nontg_index = ldlc_valid_18_nontg_index.reset_index()


date_col = "Specimen Received Date"
with_secondary_labs = ldlc_valid_18_nontg_index[
    ldlc_valid_18_nontg_index.progress_apply(
        check_secondary_labs, args=(date_col, secondary_labs_final), axis=1)
]

100%|█████████████████████████████████████████████████| 24122/24122 [07:55<00:00, 50.70it/s]


In [48]:
with_secondary_labs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3310 entries, 4 to 24120
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Patient ID  3310 non-null   object        
 1   LDL-C Max   3310 non-null   float64       
 2   Index Date  3310 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 103.4+ KB


In [49]:
with_secondary_labs["Patient ID"].unique().size

3310

In [50]:
with_secondary_labs.to_csv(
    os.path.join(PROJECT_ROOT, "results", "with_secondary_labs.csv"), index=False,
)

In [51]:
ldlc_valid_18_nontg_index_nosecondarylab = ldlc_valid_18_nontg_index.loc[~ldlc_valid_18_nontg_index["Patient ID"].isin(with_secondary_labs["Patient ID"].unique().tolist())]
ldlc_valid_18_nontg_index_nosecondarylab["Patient ID"].unique().size

20812

In [52]:
ldlc_valid_18_nontg_index_nosecondarylab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20812 entries, 0 to 24121
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Patient ID  20812 non-null  object        
 1   LDL-C Max   20812 non-null  float64       
 2   Index Date  20812 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 650.4+ KB


In [53]:
ldlc_valid_18_nontg_index_nosecondarylab.to_csv(
    os.path.join(PROJECT_ROOT, "results", "ldlc_valid_18_nontg_index_nosecondarylab.csv"), index=False,
)

## End