# Space

In [1]:
import os
import logging
import pandas as pd 
from IPython.display import display, HTML
KEY = 'WorkSpace'
WORKSPACE_PATH = os.getcwd().split(KEY)[0] + KEY
# print(WORKSPACE_PATH)
os.chdir(WORKSPACE_PATH)
import sys
from proj_space import PROJECT, TaskName, SPACE
sys.path.append(SPACE['CODE_FN'])
SPACE['WORKSPACE_PATH'] = WORKSPACE_PATH
recfldtkn_config_path = os.path.join(SPACE['CODE_RFT'], 'config_recfldtkn')

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format='[%(levelname)s:%(asctime)s:(%(filename)s@%(lineno)d %(name)s)]: %(message)s')


# Part 1: Prepare Record Yaml

Expected outcome:

You will understand the raw data

You will get a record yaml file. 

## [Step 1]: assign RecName

Motivation: We utilize a yaml file to store information pertaining to our records or recommendations ('Rec'). In order to efficiently link our 'Rec' with the corresponding yaml file, it is necessary to assign a descriptive RecName. This name serves as an identifier, allowing for easy association and retrieval of information. Please select an appropriate RecName for this purpose.

Aim: assign RecName

Input: yaml file names 

Output: RecName

Instruction: 
change RecName for specific Rec :```RecName = 'P'# <-------- select your yaml file name```

In [2]:
###########################
RecName = 'MedPres'# <-------- select your yaml file name
###########################

## [Step 2] Get Necessary Args
Motivation: Prepare necessary Args for future development.

Aim: get cohort_args and record_args

Input: ```recfldtkn_config_path, SPACE,RecName, cohort_args```

Output: ```cohort_args,record_args```

Instruction: 
Run following code.



In [3]:
from recfldtkn.configfn import load_cohort_args
from recfldtkn.configfn import load_record_args

cohort_args = load_cohort_args(recfldtkn_config_path, SPACE)
record_args = load_record_args(RecName, cohort_args)

## [Step 3] Create and Update Record Yaml
Motivation: To store configuration and information.

Aim: create Yaml file for rec

Input: informations about data_path, RawRoodID, RecNumColunm and raw_columms

Output: Yaml file

Instruction: 
1. change COHORT_NAME_XXXXXX
2. change raw_data_path
3. change RawRootID
4. change raw_columns



**template**

```yaml
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% TableBase sources from different cohorts.
CohortInfo: # Cohort
  COHORT_NAME_XXXXXX: # <---- change this.
    TABLE1: 
      raw_data_path: $DATA_RAW$/Cohort_Folder_XXXXXXX/raw_table_file_name1_XXXXXXXXX.csv
      RawRootID: XXXXXXXXX
      RecNumColumn: XXXXXXXXX # in Human2RecNum, the related raw table name
      raw_columns: 
        - XXXXX # <--- to update during RecAttr 
        - XXXXX

    TABLE2:  # <-------- IN MOST OF THE TIME, WE DON'T NEED TABLE2.
      raw_data_path: $DATA_RAW$/Cohort_Folder_XXXXXXX/raw_table_file_name2_XXXXXXXXX.csv
      RawRootID: XXXXXXXXX
      RecNumColumn: XXXXXXXXX # in Human2RecNum, the related raw table name
      raw_columns: 
        - XXXXX # <--- to update during RecAttr 
        - XXXXX
```

In [4]:
# Create a HTML link and display it
path = record_args['yaml_file_path']
full_path = os.path.join(WORKSPACE_PATH, path)
display(HTML(f'{path} <a href="{full_path}" target="_blank">Open File</a>'))

## [Step 4] Update Yaml for record's Meta


In [5]:
# Create a HTML link and display it
path = record_args['yaml_file_path']
full_path = os.path.join(WORKSPACE_PATH, path)
display(HTML(f'{path} <a href="{full_path}" target="_blank">Open File</a>'))

**template**

```yaml
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
RecName: XXX    # name of the record.
RecID: XXX   # ID of the record. not necessary to be like this: RecID = RecName + 'ID'.
RawRecID: 
  - XXX
RecIDChain: 
  - XXX
ParentRecName:  # if no parent record, set it to empty. 
RecDT:          # if no RecDT, set it to empty. 
```

**your yaml**

```yaml
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
RecName: P    # name of the record.
RecID: PID   # ID of the record. not necessary to be like this: RecID = RecName + 'ID'.
RawRecID: 
  - PatientID
RecIDChain: 
  - P
  - PatientID
ParentRecName:  # if no parent record, set it to empty. 
RecDT:          # if no RecDT, set it to empty. 
```

## [Step 5] Select One Cohort

Motivation: We want to choose one cohort and test our code in this one cohort.

Aim: Specify a cohort

Input: Cohort Yaml

Output: Cohort name and Cohort label of the cohort we want to choose.

Instruction: Change ```args_information = ['--cohort_label', '1'] ```

In [6]:
################### in notebook ###################
args_information = ['--cohort_label', '2']
###################################################

import argparse
my_parser = argparse.ArgumentParser(description='Process Input.')

# Add the arguments
my_parser.add_argument('--cohort_name',
                    metavar='cohort_name',
                    default = None, 
                    type=str,
                    help='the cohort_name to process')

my_parser.add_argument('--cohort_label',
                    metavar='cohort_label',
                    default = None, 
                    type=str,
                    help='the label for cohort_name to process')



args = my_parser.parse_args(args_information)
cohort_label = int(args.cohort_label)
cohort_config = [v for k, v in cohort_args['CohortInfo'].items() if v['cohort_label'] == cohort_label][0]
cohort_name = cohort_config['cohort_name']
print('\n========== cohort_config ==========')
# print(cohort_config)
print(cohort_label, cohort_name)


2 RawData2023_CVSTDCAug


## [Step 6] df_Human, df_Prt and Save them in record_args for the selected OneCohort
Motivation: ????

Aim: Update record_args

Input:

Output: record_args['df_Prt'], 

Instruction:
1. Remember to restart the notebook to fully load the updated yaml files.
2. Run following code 

In [7]:
#######################
cohort_label_list = [cohort_label]
#######################

In [8]:
from recfldtkn.loadtools import filter_with_cohort_label, load_ds_rec_and_info
from recfldtkn.pipeline_record import get_parentRecord_info

RootID = cohort_args['RootID']

ds_Human, _ = load_ds_rec_and_info(cohort_args['RecName'], cohort_args, cohort_label_list = cohort_label_list)
df_Human = ds_Human.to_pandas()

df_Human

[INFO:2024-04-19 01:00:19,491:(config.py@58 datasets)]: PyTorch version 2.1.2+cu121 available.


Unnamed: 0,PID,PatientID,BPMeter,CurriculumCourseBadgeDetails,CurriculumCourseProgressDetails,CurriculumLessonProgressDetails,CurriculumQuizResponse,CurriculumQuizResult,CurriculumSurveyResponse,CurriculumTopicProgressDetails,...,PatientTargetSegment,RecentElogBGEntry,SleepEntry,StepEntry,TDC,UserDetail,WeightGoal,WeightMeter,TotalRecNum,CohortLabel
0,2000001,66,,,,,,,,,...,1.0,,,,5.0,1.0,,,1,2
1,2000002,67,,,,,,,,,...,1.0,,,,5.0,1.0,1.0,,4,2
2,2000003,70,,,1.0,3.0,,,,1.0,...,1.0,,,97.0,14.0,1.0,1.0,,254,2
3,2000004,88,,,,,,,,,...,1.0,,9.0,161.0,5.0,1.0,,,301,2
4,2000005,90,,,,,,,,,...,1.0,,,,5.0,1.0,,,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4251,2004252,48963,,,1.0,1.0,,,,1.0,...,1.0,,,,5.0,1.0,,,3,2
4252,2004253,48966,,,1.0,1.0,,,,1.0,...,1.0,,,,5.0,1.0,,,3,2
4253,2004254,48967,,2.0,2.0,4.0,,,,1.0,...,1.0,,,,5.0,1.0,,,9,2
4254,2004255,48970,,,1.0,1.0,,,,1.0,...,1.0,,,,5.0,1.0,,,17,2


In [9]:

#########--------
try:
    ds_P, _ = load_ds_rec_and_info('P', cohort_args, cohort_label_list = cohort_label_list)
    print(ds_P)
    df_P = ds_P.to_pandas()[[RootID, 'UserTimeZoneOffset']].rename(columns = {'UserTimeZoneOffset': 'user_tz'})
    df_Human = pd.merge(df_Human, df_P, how = 'left', on = RootID)
    print('SUCCESS ------> user_tz is available')
except:
    print("No user_timezone available")
#########--------


Dataset({
    features: ['PID', 'PatientID', 'YearOfBirth', 'ActivationDate', 'MRSegmentModifiedDateTime', 'UserTimeZone', 'UserTimeZoneOffset', 'Gender', 'MRSegmentID', 'DiseaseType'],
    num_rows: 4256
})
SUCCESS ------> user_tz is available


In [10]:
from recfldtkn.pipeline_record import get_parentRecord_info

In [11]:
record_args['ParentRecName']

'P'

In [12]:

################
rft_config = {'base_config': cohort_args}

parentResult = get_parentRecord_info(record_args, rft_config, df_Human)
prt_record_args = parentResult['prt_record_args']
df_Prt = parentResult['df_Prt']
df_Human = parentResult['df_Human']
##########################

df_Prt

Unnamed: 0,PID,PatientID
0,1000001,6
1,1000002,10
2,1000003,11
3,1000004,13
4,1000005,14
...,...,...
11616,3000065,3549
11617,3000066,3552
11618,3000067,3625
11619,3000068,3686


In [13]:

print(df_Prt.shape)
df_Prt = filter_with_cohort_label(df_Prt, cohort_label, cohort_args)
print(df_Prt.shape)

(11621, 2)
(4256, 2)


In [14]:

df_Human = df_Human[df_Human[RootID].isin(df_Prt[RootID].to_list())].reset_index(drop = True)
record_args['df_Prt'] = df_Prt
record_args['prt_record_args'] = prt_record_args

In [15]:
df_Human.head()

Unnamed: 0,PID,PatientID,BPMeter,CurriculumCourseBadgeDetails,CurriculumCourseProgressDetails,CurriculumLessonProgressDetails,CurriculumQuizResponse,CurriculumQuizResult,CurriculumSurveyResponse,CurriculumTopicProgressDetails,...,RecentElogBGEntry,SleepEntry,StepEntry,TDC,UserDetail,WeightGoal,WeightMeter,TotalRecNum,CohortLabel,user_tz
0,2000001,66,,,,,,,,,...,,,,5.0,1.0,,,1,2,-300
1,2000002,67,,,,,,,,,...,,,,5.0,1.0,1.0,,4,2,-300
2,2000003,70,,,1.0,3.0,,,,1.0,...,,,97.0,14.0,1.0,1.0,,254,2,-240
3,2000004,88,,,,,,,,,...,,9.0,161.0,5.0,1.0,,,301,2,-360
4,2000005,90,,,,,,,,,...,,,,5.0,1.0,,,1,2,-300


In [16]:
df_Prt.head()

Unnamed: 0,PID,PatientID
0,2000001,66
1,2000002,67
2,2000003,70
3,2000004,88
4,2000005,90


## [Step 7] OneCohortRec_args

In [17]:
print(cohort_name)
OneCohortRec_args = record_args['CohortInfo'][cohort_name]
OneCohortRec_args['cohort_name'] = cohort_name  
OneCohortRec_args['cohort_label'] = cohort_label
OneCohortRec_args

RawData2023_CVSTDCAug


{'MedPrescription': {'raw_data_path': '../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv',
  'RawRootID': 'PatientID',
  'RawName': 'MedPrescription'},
 'cohort_name': 'RawData2023_CVSTDCAug',
 'cohort_label': 2}

In [18]:
source_path_not_existence_flag = 0
for tablename, tableinfo in OneCohortRec_args.items():
    if tablename in ['cohort_name', 'cohort_label']: continue
    filename = tableinfo['raw_data_path']
    print(filename)
    if not os.path.exists(filename): 
        source_path_not_existence_flag += 1
    else:
        if filename.endswith('.csv'):
            df = pd.read_csv(filename, nrows=0)
            raw_tables_columns = list(df.columns)
            print('\n=======================')
            print(filename)
            for i in raw_tables_columns:
                print('-', i)
            print('=======================\n\n')
        elif filename.endswith('.p'):
            df = pd.read_pickle(filename)
            raw_tables_columns = list(df.columns)
            print('\n=======================')
            print(filename)
            for i in raw_tables_columns:
                print('-', i)
            print('=======================\n\n')

if source_path_not_existence_flag > 0:
    print(f'=== source_path_not_existence_flag: {source_path_not_existence_flag}')

OneCohortRec_args = record_args['CohortInfo'][cohort_name]
print('\n========== OneCohortRec_args ==========')
print(OneCohortRec_args)

../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv

../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv
- MedPrescriptionID
- PatientID
- MedicationID
- FrequencyType
- FrequencyValue
- StartDate
- StatusID
- ValidationStatusID
- DiscontinuedReasonID
- DiscontinuedReason
- DiscontinuedDate
- DiscontinuedBy
- RowVersionID
- RejectReasonID
- CreatedDate
- ModifiedDate
- StartDateTimeZoneOffset
- StartDateTimeZone
- DiscontinuedDateTimeZoneOffset
- DiscontinuedDateTimeZone
- PrescriptionGUID
- ReasonID
- ReasonText
- RefillDate
- RefillReminder
- DIA
- DIAOffset
- DIAValueUpdatedDate
- Notes



{'MedPrescription': {'raw_data_path': '../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv', 'RawRootID': 'PatientID', 'RawName': 'MedPrescription'}, 'cohort_name': 'RawData2023_CVSTDCAug', 'cohort_label': 2}


## [Step 8] **Important** Select useful Raw Columns

Motivation: Based on understanding of the data, choose useful Raw Columns

AIM: Select useful Raw Columns

Input: ```raw_data_path```

Output:```raw_columns```

Instruciton: Run following code and choose raw_columns based on specific project.

In [19]:
for tablename, tableinfo in OneCohortRec_args.items():
    if tablename in ['cohort_name', 'cohort_label']: continue
    print(tablename)
    print(tableinfo)
    print('\n')

MedPrescription
{'raw_data_path': '../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv', 'RawRootID': 'PatientID', 'RawName': 'MedPrescription'}




In [20]:
tablename_list =  [i for i in OneCohortRec_args if i not in ['cohort_name', 'cohort_label']]
tablename_list

['MedPrescription']

In [21]:
tablename = tablename_list[0]
tableinfo = OneCohortRec_args[tablename]
raw_data_path = tableinfo['raw_data_path']
print(raw_data_path)

../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv


In [22]:
# After checking the columns, you will find some useful raw columns
###########################################################
df = pd.read_csv(raw_data_path, low_memory=False)

######################## <- you need to test this
raw_columns = []
########################


for i in raw_columns:
    print('-', i)
###########################################################

## [Step 9] Update Yaml: OneCohort's Table raw_columns
Motivation: update Yaml file

Instruciton: Copy the above raw_columns to the corresponding raw column attributes



In [23]:
# Create a HTML link and display it
path = record_args['yaml_file_path']
full_path = os.path.join(WORKSPACE_PATH, path)
display(HTML(f'{path} <a href="{full_path}" target="_blank">Open File</a>'))

**example**

```yaml
RawData2022_CGM:
    TableFile1: 
      raw_data_path: '$DATA_RAW$/RawData2022_CGM/05_02_2022_Patient.csv'
      RawRootID: 'PatientID'  # for merging purpose
      RecNumColumn: 'Patient' # Column in PRawRecNum
      raw_columns: 
        - PatientID
        - MaritalStatusID
        - RaceID
        - EthinicityID
        - LevelOfEducationID
        - MRSegmentID
        - MRSegmentModifiedDateTime
        - DiseaseType
        - DiseaseCombinationID
        - PAPEligibility
        - PAPStatus
        - PAPStatusReason
```

# [Part 2] Load HumanRecRaw


We have a pipeline fn to do it. 

If you have interests in understanding this pipeline. 

These pipeline functions: `get_df_HumanSelected_from_OneCohortRecArgs` and `get_HumanRawRec_for_HumanGroup`

It will take the `record.yaml` within the function and then load the data as the `dfHumanRecRaw`.

Before this part, you must make your yaml file ready. 

In [24]:
print([tablename for tablename in OneCohortRec_args])

['MedPrescription', 'cohort_name', 'cohort_label']


## [Step 1] Load the df_HuamnRawRec
Motivation:

Input:

Output:

Instruction:

In [25]:
from recfldtkn.pipeline_record import get_df_HumanSelected_from_OneCohortRecArgs
from recfldtkn.pipeline_record import get_HumanRawRec_for_HumanGroup

rec_config = record_args
RawName_to_RawConfig = record_args['RawInfo']
RawName_to_dfRaw = {RawName: i['raw_data_path'] for RawName, i in OneCohortRec_args.items()
                    if RawName not in ['cohort_name', 'cohort_label']}

RawName_to_dfRaw

{'MedPrescription': '../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv'}

In [26]:
RawName_to_RawConfig

{'MedPrescription': {'RawRootID': 'PatientID',
  'RawName': 'MedPrescription',
  'raw_columns': ['MedPrescriptionID',
   'PatientID',
   'MedicationID',
   'FrequencyType',
   'FrequencyValue',
   'DiscontinuedReasonID',
   'DiscontinuedReason',
   'DIA',
   'DIAOffset',
   'DiscontinuedBy',
   'ReasonID',
   'ReasonText',
   'RefillDate',
   'RefillReminder',
   'StatusID',
   'ValidationStatusID',
   'CreatedDate',
   'StartDate',
   'ModifiedDate',
   'DIAValueUpdatedDate',
   'DiscontinuedDate',
   'StartDateTimeZoneOffset']}}

In [27]:

OneCohort_config = OneCohortRec_args
base_config = cohort_args
df_HumanSelected = get_df_HumanSelected_from_OneCohortRecArgs(rec_config,
                                                                RawName_to_RawConfig,
                                                                OneCohort_config,
                                                                df_Human,
                                                                base_config)

logger.info(f'{df_HumanSelected.shape} === df_HumanSelected <-- df_Human: selected from in CohortLabel {cohort_label}: {cohort_name} and with RecordNum > 0')

df_HumanSelected.head()

[INFO:2024-04-19 01:00:20,415:(1217961253.py@9 __main__)]: (1987, 5) === df_HumanSelected <-- df_Human: selected from in CohortLabel 2: RawData2023_CVSTDCAug and with RecordNum > 0


Unnamed: 0,PID,PatientID,CohortLabel,MedPrescription,index_group
0,2000004,88,2,21.0,0
1,2000005,90,2,1.0,0
2,2000006,93,2,9.0,0
3,2000007,94,2,5.0,0
4,2000008,95,2,1.0,0


In [28]:

RawRootID = cohort_args['RawRootID']
for index_group, df_HumanGroup in df_HumanSelected.groupby('index_group'):
    logger.info(f'current index_group: {index_group} ...')

    # ---------------------- this is the core part of the pipeline ----------------------
    # 7.1 get the df_HumanRawRec
    #     this function can be used independently to get the raw df_HumanRawRec.
    df_HumanRawRec = get_HumanRawRec_for_HumanGroup(df_HumanGroup,
                                                    RawName_to_RawConfig,
                                                    RawName_to_dfRaw,
                                                    base_config)
    index = df_HumanRawRec[RawRootID].isin(df_HumanSelected[RawRootID].to_list())
    df_HumanRawRec = df_HumanRawRec[index].reset_index(drop = True)
    logger.info(f'current df_HumanRawRec: {df_HumanRawRec.shape} ...')

    break

[INFO:2024-04-19 01:00:20,427:(1457373213.py@3 __main__)]: current index_group: 0 ...
[INFO:2024-04-19 01:00:20,429:(pipeline_record.py@307 recfldtkn.pipeline_record)]: RawName "MedPrescription" from file: ../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv
[INFO:2024-04-19 01:00:20,474:(1457373213.py@14 __main__)]: current df_HumanRawRec: (10183, 22) ...


## [Step 2] Display df_HumanRawRec

In [29]:
df_HumanRawRec

Unnamed: 0,MedPrescriptionID,PatientID,MedicationID,FrequencyType,FrequencyValue,DiscontinuedReasonID,DiscontinuedReason,DIA,DIAOffset,DiscontinuedBy,...,RefillDate,RefillReminder,StatusID,ValidationStatusID,CreatedDate,StartDate,ModifiedDate,DIAValueUpdatedDate,DiscontinuedDate,StartDateTimeZoneOffset
0,64,88,545999,1,1,0.0,,,,88.0,...,,False,3,2,2/11/2022 5:00:24 PM,2/11/2022 5:00:23 PM,1/29/2023 7:10:30 PM,,1/29/2023 7:10:29 PM,
1,65,88,547330,1,1,0.0,,,,88.0,...,,False,3,2,2/11/2022 5:02:18 PM,2/11/2022 5:02:18 PM,1/29/2023 7:10:02 PM,,1/29/2023 7:10:01 PM,
2,66,88,451442,1,1,,,,,,...,,False,1,4,2/11/2022 5:03:18 PM,2/11/2022 5:03:17 PM,2/11/2022 5:03:18 PM,,,
3,67,88,226954,1,1,,,,,,...,,False,1,4,2/11/2022 5:04:12 PM,2/11/2022 5:04:11 PM,2/11/2022 5:04:12 PM,,,
4,68,88,273426,1,1,,,,,,...,,False,1,2,2/11/2022 5:04:48 PM,2/11/2022 5:04:47 PM,2/11/2022 5:04:48 PM,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10178,20737,48970,155744,1,1,,,,,,...,,False,1,2,8/22/2023 3:05:29 AM,8/22/2023 3:05:28 AM,8/22/2023 3:05:29 AM,,,
10179,20738,48970,175684,1,1,,,,,,...,,False,1,2,8/22/2023 3:05:50 AM,8/22/2023 3:05:50 AM,8/22/2023 3:05:50 AM,,,
10180,20739,38830,453598,1,1,,,,,,...,,False,1,2,8/22/2023 3:10:54 AM,8/22/2023 3:10:54 AM,8/22/2023 3:10:54 AM,,,
10181,20740,48972,241223,1,1,,,,,,...,,False,1,2,8/22/2023 4:11:47 AM,8/22/2023 4:11:47 AM,8/22/2023 4:11:47 AM,,,


# [Part 3] HumanRecAttr

In [30]:
# Create a HTML link and display it
path = record_args['pypath']
full_path = os.path.join(WORKSPACE_PATH, path)
link = f'{path} <a href="{full_path}" target="_blank">Open File</a>'
display(HTML(link))

## [Step 1] **Important** RawRec_to_RecAttr Code

Motivation: To prepare and organize the raw data into a structured format that is suitable for further analysis or processing.

AIM: This determine how do you map the raw_columns to clean attr_columns.

Input: df_Prt and df_HumanRawRec from last step

Output: df with clean attribute

Instruction: Depend on specific project, usually we will need the lase three steps.
Refer to the Welldoc example below.




In [31]:
#-------------------
from recfldtkn.pipeline_record import post_record_process

df = df_HumanRawRec

# 1. filter out the records we don't need (optional) 
#df = df[df['StartDateTimeZoneOffset'].abs() < 1000].reset_index(drop = True)

# 2. entry type
df['EntryType'] = 'Manu'
    
# 3. update datetime columns 
DTCol_list = ['CreatedDate',
               'StartDate',
               'ModifiedDate']

for DTCol in DTCol_list: 
    df[DTCol] = pd.to_datetime(df[DTCol], format = 'mixed')

# x1. localize the datetime columns to based on time zone. 
a = len(df)
df = pd.merge(df, df_Human[['PatientID', 'user_tz']],  how = 'left')
b = len(df)
assert a == b
df['DT_tz'] = df['StartDateTimeZoneOffset'].replace(0, None).fillna(df['user_tz'])
DTCol = 'DT_r'
DTCol_source = 'CreatedDate'
df[DTCol] = df[DTCol_source]
df[DTCol] = pd.to_datetime(df[DTCol]) + pd.to_timedelta(df['DT_tz'], 'm')
assert df[DTCol].isna().sum() == 0

DTCol = 'DT_s'
DTCol_source = 'StartDate'
df[DTCol] = df[DTCol_source]
df[DTCol] = pd.to_datetime(df[DTCol]).apply(lambda x: None if x <= pd.to_datetime('2010-01-01') else x)
df[DTCol] = pd.to_datetime(df[DTCol]) + pd.to_timedelta(df['DT_tz'], 'm')
df[DTCol] = df[DTCol].fillna(df['DT_r'])
assert df[DTCol].isna().sum() == 0


# x3. drop duplicates
df = df.drop_duplicates()

# ------------------ before merge, no RootID in df ------------------ # 
df = post_record_process(df, record_args)
#-------------------

df.head()

Unnamed: 0,PID,PatientID,MedPrescriptionID,MedicationID,FrequencyType,FrequencyValue,DiscontinuedReasonID,DiscontinuedReason,DIA,DIAOffset,...,ModifiedDate,DIAValueUpdatedDate,DiscontinuedDate,StartDateTimeZoneOffset,EntryType,user_tz,DT_tz,DT_r,DT_s,MedPresID
0,2000004,88,64,545999,1,1,0.0,,,,...,2023-01-29 19:10:30,,1/29/2023 7:10:29 PM,,Manu,-360,-360.0,2022-02-11 11:00:24,2022-02-11 11:00:23,2000004-0
1,2000004,88,65,547330,1,1,0.0,,,,...,2023-01-29 19:10:02,,1/29/2023 7:10:01 PM,,Manu,-360,-360.0,2022-02-11 11:02:18,2022-02-11 11:02:18,2000004-1
2,2000004,88,66,451442,1,1,,,,,...,2022-02-11 17:03:18,,,,Manu,-360,-360.0,2022-02-11 11:03:18,2022-02-11 11:03:17,2000004-2
3,2000004,88,67,226954,1,1,,,,,...,2022-02-11 17:04:12,,,,Manu,-360,-360.0,2022-02-11 11:04:12,2022-02-11 11:04:11,2000004-3
4,2000004,88,68,273426,1,1,,,,,...,2022-02-11 17:04:48,,,,Manu,-360,-360.0,2022-02-11 11:04:48,2022-02-11 11:04:47,2000004-4


## [Step 2] Pin Down and Attr Cols and Update it to Yaml

Motivation: Choose the final attr cols 

Aim:Update final attr cols to Yaml file

Input: attr_cols

Output: Yaml file

Instruction: change the following code regarding to a specific project.

**example**

```yaml
attr_cols:
  - PID
  - PatientID
  - YearOfBirth
  - ActivationDate
  - MRSegmentModifiedDateTime
  - UserTimeZone
  - UserTimeZoneOffset
  - Gender
  - MRSegmentID
  - DiseaseType
```

In [32]:
# Create a HTML link and display it
path = record_args['yaml_file_path']
full_path = os.path.join(WORKSPACE_PATH, path)
link = f'{path} <a href="{full_path}" target="_blank">Open File</a>'
display(HTML(link))

In [33]:
RecID = rec_config['RecID']
attr_cols = ['PID', 'PatientID', RecID, 
        'DT_tz',
        'DT_r',
        'DT_s',
        'EntryType',
        'DIA', 
        'FrequencyType',
        'FrequencyValue',
        'MedicationID',
        'DiscontinuedReason']

for i in attr_cols: print('-', i)

df[attr_cols].head()

- PID
- PatientID
- MedPresID
- DT_tz
- DT_r
- DT_s
- EntryType
- DIA
- FrequencyType
- FrequencyValue
- MedicationID
- DiscontinuedReason


Unnamed: 0,PID,PatientID,MedPresID,DT_tz,DT_r,DT_s,EntryType,DIA,FrequencyType,FrequencyValue,MedicationID,DiscontinuedReason
0,2000004,88,2000004-0,-360.0,2022-02-11 11:00:24,2022-02-11 11:00:23,Manu,,1,1,545999,
1,2000004,88,2000004-1,-360.0,2022-02-11 11:02:18,2022-02-11 11:02:18,Manu,,1,1,547330,
2,2000004,88,2000004-2,-360.0,2022-02-11 11:03:18,2022-02-11 11:03:17,Manu,,1,1,451442,
3,2000004,88,2000004-3,-360.0,2022-02-11 11:04:12,2022-02-11 11:04:11,Manu,,1,1,226954,
4,2000004,88,2000004-4,-360.0,2022-02-11 11:04:48,2022-02-11 11:04:47,Manu,,1,1,273426,


## [Step 3] Write down RawRec_to_RecAttr_fn

Movivation: saving such a RawRec_to_RecAttr_fn is to create clean, efficient, and maintainable code that can be easily shared and reused.

Aim: Save RawRec_RecAttr_fn

Input: df_HumanRawRec, df_Human, cohort_args, record_args, attr_cols

output: RawRec_RecAttr_fn

Instruction: Copy the code from above and run it.

In [34]:
from recfldtkn.loadtools import convert_variables_to_pystirng, load_module_variables
import inspect

###########################
def RawRec_to_RecAttr_fn(df_HumanRawRec, df_Human, cohort_args, record_args,attr_cols):
    #-------------------
    from recfldtkn.pipeline_record import post_record_process

    df = df_HumanRawRec

    # 1. filter out the records we don't need (optional) 
    #df = df[df['StartDateTimeZoneOffset'].abs() < 1000].reset_index(drop = True)

    # 2. entry type
    df['EntryType'] = 'Manu'
        
    # 3. update datetime columns 
    DTCol_list = ['CreatedDate',
                'StartDate',
                'ModifiedDate']

    for DTCol in DTCol_list: 
        df[DTCol] = pd.to_datetime(df[DTCol], format = 'mixed')

    # x1. localize the datetime columns to based on time zone. 
    a = len(df)
    df = pd.merge(df, df_Human[['PatientID', 'user_tz']],  how = 'left')
    b = len(df)
    assert a == b
    df['DT_tz'] = df['StartDateTimeZoneOffset'].replace(0, None).fillna(df['user_tz'])
    DTCol = 'DT_r'
    DTCol_source = 'CreatedDate'
    df[DTCol] = df[DTCol_source]
    df[DTCol] = pd.to_datetime(df[DTCol]) + pd.to_timedelta(df['DT_tz'], 'm')
    assert df[DTCol].isna().sum() == 0

    DTCol = 'DT_s'
    DTCol_source = 'StartDate'
    df[DTCol] = df[DTCol_source]
    df[DTCol] = pd.to_datetime(df[DTCol]).apply(lambda x: None if x <= pd.to_datetime('2010-01-01') else x)
    df[DTCol] = pd.to_datetime(df[DTCol]) + pd.to_timedelta(df['DT_tz'], 'm')
    df[DTCol] = df[DTCol].fillna(df['DT_r'])
    assert df[DTCol].isna().sum() == 0


    # x3. drop duplicates
    df = df.drop_duplicates()


    # xyz: merge parent, sort records, and generate RecID. 
    df = post_record_process(df, record_args)
    #-------------------
    
    df_HumanRecAttr = df[attr_cols].reset_index(drop = True)
    return df_HumanRecAttr
###########################

RawRec_to_RecAttr_fn.fn_string = inspect.getsource(RawRec_to_RecAttr_fn)

## [Step 4] Save as the pipeline fn

Instruction:  Run the following code.


In [35]:
prefix = ['import pandas as pd', 'import numpy as np']
pycode = convert_variables_to_pystirng(fn_variables = [RawRec_to_RecAttr_fn], prefix = prefix)
RecName = record_args['RecName']
pypath = record_args['pypath']
print(pypath)
if not os.path.exists(os.path.dirname(pypath)): os.makedirs(os.path.dirname(pypath))
with open(pypath, 'w') as file: file.write(pycode)

RecName = record_args['RecName']
pypath = record_args['pypath']
module = load_module_variables(pypath)
RawRec_to_RecAttr_fn = module.MetaDict['RawRec_to_RecAttr_fn']

../pipeline/fn_recattr/MedPres.py


## [Step 5] Test the save pipeline fn

In [36]:

df_HumanRecAttr = RawRec_to_RecAttr_fn(df_HumanRawRec, df_Human, cohort_args, record_args, attr_cols)
df_HumanRecAttr

Unnamed: 0,PID,PatientID,MedPresID,DT_tz,DT_r,DT_s,EntryType,DIA,FrequencyType,FrequencyValue,MedicationID,DiscontinuedReason
0,2000004,88,2000004-0,-360.0,2022-02-11 11:00:24,2022-02-11 11:00:23,Manu,,1,1,545999,
1,2000004,88,2000004-1,-360.0,2022-02-11 11:02:18,2022-02-11 11:02:18,Manu,,1,1,547330,
2,2000004,88,2000004-2,-360.0,2022-02-11 11:03:18,2022-02-11 11:03:17,Manu,,1,1,451442,
3,2000004,88,2000004-3,-360.0,2022-02-11 11:04:12,2022-02-11 11:04:11,Manu,,1,1,226954,
4,2000004,88,2000004-4,-360.0,2022-02-11 11:04:48,2022-02-11 11:04:47,Manu,,1,1,273426,
...,...,...,...,...,...,...,...,...,...,...,...,...
10178,2004251,48962,2004251-3,-240.0,2023-08-21 20:59:04,2023-08-21 20:59:04,Manu,,1,1,181235,
10179,2004255,48970,2004255-0,-300.0,2023-08-21 22:05:29,2023-08-21 22:05:28,Manu,,1,1,155744,
10180,2004255,48970,2004255-1,-300.0,2023-08-21 22:05:50,2023-08-21 22:05:50,Manu,,1,1,175684,
10181,2004256,48972,2004256-0,-240.0,2023-08-22 00:11:47,2023-08-22 00:11:47,Manu,,1,1,241223,


# Save to RFT

In [37]:
import shutil
from recfldtkn.pipeline_record import pipeline_for_FldTkn
from recfldtkn.configfn import load_cohort_args
from recfldtkn.configfn import load_record_args
from recfldtkn.pipeline_record import pipeline_record
from recfldtkn.configfn import load_rft_config
from recfldtkn.loadtools import filter_with_cohort_label, load_ds_rec_and_info
from recfldtkn.pipeline_record import get_parentRecord_info
from recfldtkn.pipeline_record import get_RawName_to_dfRawPath

cohort_args = load_cohort_args(recfldtkn_config_path, SPACE)
record_args = load_record_args(RecName, cohort_args)


############################
cohort_label_list = [1, 2, 3]
############################

cohort_label_to_cohort_name = {str(v['cohort_label']): k for k, v in cohort_args['CohortInfo'].items()}

for cohort_label in cohort_label_list:
    

    cohort_name = cohort_label_to_cohort_name[str(cohort_label)]
    cohort_full_name = f'{cohort_label}-{cohort_name}'
    logger.info(f'\n\n================={cohort_full_name}=================\n')
    
    RawName_to_dfRawPath = get_RawName_to_dfRawPath(cohort_args['CohortInfo'][cohort_name], rft_config)
    RawName_to_dfRaw = RawName_to_dfRawPath
    
    # RecordName = RecName
    
    OneCohort_config = record_args['CohortInfo'][cohort_name]
    OneCohort_config['cohort_name'] = cohort_name  
    OneCohort_config['cohort_label'] = cohort_label
    


    RootID = cohort_args['RootID']
    ds_Human, _ = load_ds_rec_and_info(cohort_args['RecName'], cohort_args, cohort_label_list = [cohort_label])
    df_Human = ds_Human.to_pandas()
    #########--------
    try:
        ds_P, _ = load_ds_rec_and_info('P', cohort_args, cohort_label_list = [cohort_label])
        logger.info(ds_P)
        df_P = ds_P.to_pandas()[[RootID, 'UserTimeZoneOffset']].rename(columns = {'UserTimeZoneOffset': 'user_tz'})
        df_Human = pd.merge(df_Human, df_P, how = 'left', on = RootID)
        logger.info('SUCCESS ------> user_tz is available')
    except:
        logger.info("No user_timezone available")
    #########--------
    
    
    NoRecordFlag = False
    for RawName in record_args['RawInfo']:
        # print(RawName)
        if RawName not in df_Human.columns:
            NoRecordFlag = True 
        if df_Human[RawName].sum() == 0:
            NoRecordFlag = True
            
    if NoRecordFlag == True:
        logger.info(f'no record for this cohort: {cohort_full_name}')
        continue 
    
    
    # ----------------------------------------------------- # 
    record_to_recfldtkn_list = {
        RecName: []
    }

    RecName_list = [RecName]
    FldTknName_list = None  
    rft_config = load_rft_config(recfldtkn_config_path, RecName_list, 
                                FldTknName_list, SPACE, use_inference = False)

    results = pipeline_record(record_to_recfldtkn_list, 
                                OneCohort_config,
                                rft_config, 
                                df_Human, 
                                RawName_to_dfRaw, 
                                load_from_disk = False, 
                                reuse_old_rft = False, 
                                save_to_disk = True)

    # print([i for i in results])


[INFO:2024-04-19 01:00:25,280:(3212988979.py@26 __main__)]: 


[INFO:2024-04-19 01:00:25,294:(pipeline_record.py@39 recfldtkn.pipeline_record)]: ../_Data/0-Data_Raw/RawData2022_CGM/ <-- FolderPath
[INFO:2024-04-19 01:00:25,295:(pipeline_record.py@40 recfldtkn.pipeline_record)]: 32 <--- fullfile_list
[INFO:2024-04-19 01:00:25,432:(3212988979.py@45 __main__)]: Dataset({
    features: ['PID', 'PatientID', 'YearOfBirth', 'ActivationDate', 'MRSegmentModifiedDateTime', 'UserTimeZone', 'UserTimeZoneOffset', 'Gender', 'MRSegmentID', 'DiseaseType'],
    num_rows: 7296
})
[INFO:2024-04-19 01:00:25,436:(3212988979.py@48 __main__)]: SUCCESS ------> user_tz is available
[INFO:2024-04-19 01:00:25,457:(configfn.py@110 recfldtkn.configfn)]: file_path in load_fldtkn_args: ../pipeline\config_recfldtkn\Record\MedPres.yaml
[INFO:2024-04-19 01:00:25,475:(configfn.py@110 recfldtkn.configfn)]: file_path in load_fldtkn_args: ../pipeline\config_recfldtkn\Record\MedPres.yaml
[INFO:2024-04-19 01:00:25,517:(pipel

Saving the dataset (0/1 shards):   0%|          | 0/8637 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2184 [00:00<?, ? examples/s]

[INFO:2024-04-19 01:00:28,608:(3212988979.py@26 __main__)]: 


[INFO:2024-04-19 01:00:28,610:(pipeline_record.py@39 recfldtkn.pipeline_record)]: ../_Data/0-Data_Raw/RawData2023_CVSTDCAug/ <-- FolderPath
[INFO:2024-04-19 01:00:28,610:(pipeline_record.py@40 recfldtkn.pipeline_record)]: 41 <--- fullfile_list
[INFO:2024-04-19 01:00:28,690:(3212988979.py@45 __main__)]: Dataset({
    features: ['PID', 'PatientID', 'YearOfBirth', 'ActivationDate', 'MRSegmentModifiedDateTime', 'UserTimeZone', 'UserTimeZoneOffset', 'Gender', 'MRSegmentID', 'DiseaseType'],
    num_rows: 4256
})
[INFO:2024-04-19 01:00:28,693:(3212988979.py@48 __main__)]: SUCCESS ------> user_tz is available
[INFO:2024-04-19 01:00:28,704:(configfn.py@110 recfldtkn.configfn)]: file_path in load_fldtkn_args: ../pipeline\config_recfldtkn\Record\MedPres.yaml
[INFO:2024-04-19 01:00:28,710:(configfn.py@110 recfldtkn.configfn)]: file_path in load_fldtkn_args: ../pipeline\config_recfldtkn\Record\MedPres.yaml
[INFO:2024-04-19 01:00:28,716:

['RecName_to_RecConfig', 'RecName_to_dsRec', 'RecName_to_dsRecInfo']


[INFO:2024-04-19 01:00:28,818:(pipeline_record.py@404 recfldtkn.pipeline_record)]: df_Prt shape: (11621, 2)
[INFO:2024-04-19 01:00:28,822:(pipeline_record.py@408 recfldtkn.pipeline_record)]: df_Prt shape: (4256, 2)
[INFO:2024-04-19 01:00:28,825:(pipeline_record.py@425 recfldtkn.pipeline_record)]: RawName "MedPrescription" from file: ../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv
[INFO:2024-04-19 01:00:28,827:(pipeline_record.py@440 recfldtkn.pipeline_record)]: df_HumanSelected shape: (1987, 5)
[INFO:2024-04-19 01:00:28,828:(pipeline_record.py@446 recfldtkn.pipeline_record)]: current index_group: 0 ...
[INFO:2024-04-19 01:00:28,829:(pipeline_record.py@307 recfldtkn.pipeline_record)]: RawName "MedPrescription" from file: ../_Data/0-Data_Raw/RawData2023_CVSTDCAug/08_23_2023_MedPrescription.csv
[INFO:2024-04-19 01:00:28,870:(pipeline_record.py@455 recfldtkn.pipeline_record)]: current df_HumanRawRec: (10183, 22) ...
[INFO:2024-04-19 01:00:31,674:(pipeline_record.py

Saving the dataset (0/1 shards):   0%|          | 0/10183 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1987 [00:00<?, ? examples/s]

[INFO:2024-04-19 01:00:31,855:(3212988979.py@26 __main__)]: 


[INFO:2024-04-19 01:00:31,859:(pipeline_record.py@39 recfldtkn.pipeline_record)]: ../_Data/0-Data_Raw/RawData2023_CVSDeRxAug/ <-- FolderPath
[INFO:2024-04-19 01:00:31,859:(pipeline_record.py@40 recfldtkn.pipeline_record)]: 41 <--- fullfile_list
[INFO:2024-04-19 01:00:31,938:(3212988979.py@45 __main__)]: Dataset({
    features: ['PID', 'PatientID', 'YearOfBirth', 'ActivationDate', 'MRSegmentModifiedDateTime', 'UserTimeZone', 'UserTimeZoneOffset', 'Gender', 'MRSegmentID', 'DiseaseType'],
    num_rows: 69
})
[INFO:2024-04-19 01:00:31,941:(3212988979.py@48 __main__)]: SUCCESS ------> user_tz is available
[INFO:2024-04-19 01:00:31,955:(configfn.py@110 recfldtkn.configfn)]: file_path in load_fldtkn_args: ../pipeline\config_recfldtkn\Record\MedPres.yaml
[INFO:2024-04-19 01:00:31,962:(configfn.py@110 recfldtkn.configfn)]: file_path in load_fldtkn_args: ../pipeline\config_recfldtkn\Record\MedPres.yaml
[INFO:2024-04-19 01:00:31,969:(

['RecName_to_RecConfig', 'RecName_to_dsRec', 'RecName_to_dsRecInfo']


[INFO:2024-04-19 01:00:32,138:(pipeline_record.py@404 recfldtkn.pipeline_record)]: df_Prt shape: (11621, 2)
[INFO:2024-04-19 01:00:32,142:(pipeline_record.py@408 recfldtkn.pipeline_record)]: df_Prt shape: (69, 2)
[INFO:2024-04-19 01:00:32,144:(pipeline_record.py@425 recfldtkn.pipeline_record)]: RawName "MedPrescription" from file: ../_Data/0-Data_Raw/RawData2023_CVSDeRxAug/08_23_2023_MedPrescription.csv
[INFO:2024-04-19 01:00:32,146:(pipeline_record.py@440 recfldtkn.pipeline_record)]: df_HumanSelected shape: (41, 5)
[INFO:2024-04-19 01:00:32,146:(pipeline_record.py@446 recfldtkn.pipeline_record)]: current index_group: 0 ...
[INFO:2024-04-19 01:00:32,148:(pipeline_record.py@307 recfldtkn.pipeline_record)]: RawName "MedPrescription" from file: ../_Data/0-Data_Raw/RawData2023_CVSDeRxAug/08_23_2023_MedPrescription.csv
[INFO:2024-04-19 01:00:32,164:(pipeline_record.py@455 recfldtkn.pipeline_record)]: current df_HumanRawRec: (231, 22) ...
[INFO:2024-04-19 01:00:32,238:(pipeline_record.py@461

Saving the dataset (0/1 shards):   0%|          | 0/231 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/41 [00:00<?, ? examples/s]

['RecName_to_RecConfig', 'RecName_to_dsRec', 'RecName_to_dsRecInfo']
