<a href="https://colab.research.google.com/github/jhajagos/PHR2OHDSI/blob/main/Map_CDAs_to_OHDSI_CDM_in_a_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Convert CDAs to OHDSI Compatible Parquet Files with PySpark

This notebook is for converting CDA XML to OHDSI. This pipeline has been tested with several CDAs (Epic, Oracle/Cerner) extracted from patient portals and Apple Health Kit.

More detail on how to map CDAs to OHDSI CDMs can be found at: https://github.com/jhajagos/PreparedSource2OHDSI/tree/main/map/prepared_source/cda

The conversion assumes that you are converting only a single person's CDAs to the OHDSI CDM.


In [1]:
import json

In [2]:
OHSDI_VOCABULARY_PATH = "/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/"

CDA_FILE_PATH = "/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/"

SALT = "salty salt"

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!ls {OHSDI_VOCABULARY_PATH}

concept_ancestor.parquet  concept.parquet		domain.parquet	       vocabulary.parquet
concept_class.parquet	  concept_relationship.parquet	drug_strength.parquet
concept_map.parquet	  concept_synonym.parquet	relationship.parquet


In [5]:
%pip install pyspark==3.5.5



In [6]:
%pip install build



In [7]:
%pip install pypdf



In [8]:
!git clone https://github.com/jhajagos/PreparedSource2OHDSI.git

fatal: destination path 'PreparedSource2OHDSI' already exists and is not an empty directory.


In [9]:
%%sh
cd ./PreparedSource2OHDSI
git pull
python -m build --wheel
cd ./dist/
pip install preparedsource2ohdsi-0.1.3-py3-none-any.whl

Already up to date.
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools
  - wheel
* Getting build dependencies for wheel...
running egg_info
writing src/preparedsource2ohdsi.egg-info/PKG-INFO
writing dependency_links to src/preparedsource2ohdsi.egg-info/dependency_links.txt
writing requirements to src/preparedsource2ohdsi.egg-info/requires.txt
writing top-level names to src/preparedsource2ohdsi.egg-info/top_level.txt
reading manifest file 'src/preparedsource2ohdsi.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'src/preparedsource2ohdsi.egg-info/SOURCES.txt'
* Building wheel...
running bdist_wheel
running build
running build_py
copying src/preparedsource2ohdsi/mapping_utilities.py -> build/lib/preparedsource2ohdsi
copying src/preparedsource2ohdsi/ohdsi_cdm_5_4.py -> build/lib/preparedsource2ohdsi
copying src/preparedsource2ohdsi/prepared_source.py -> build/lib/preparedsource2ohdsi
copying src/preparedsourc

In [10]:
%%time
!python /content/PreparedSource2OHDSI/map/prepared_source/cda/cda_to_prepared_source_fragments.py -d {CDA_FILE_PATH} --salt "{SALT}"

Parsing: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/mn_958b67c7fd1048bb249f9ac6225ef0401f63fb07c7a49719a63b751650caaf71.xml'
Writing: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_958b67c7fd1048bb249f9ac6225ef0401f63fb07c7a49719a63b751650caaf71.xml.csv'
Writing 1 rows in  '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_result.lab.mn_958b67c7fd1048bb249f9ac6225ef0401f63fb07c7a49719a63b751650caaf71.xml.csv
Writing 1 rows in '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_medication.mn_958b67c7fd1048bb249f9ac6225ef0401f63fb07c7a49719a63b751650caaf71.xml.csv'
Writing 1 rows in '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_condition.mn_958b67c7fd1048bb249f9ac6225ef0401f63fb07c7a49719a63b751650caaf71.xml.csv'
Writing 1 rows in '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_procedure.mn_958b67c7fd1048bb249f9ac6225ef040

In [11]:
%%time
!python /content/PreparedSource2OHDSI/map/prepared_source/cda/clean_combine_prepared_source_fragments.py -d {CDA_FILE_PATH}

Consolidating 'source_person.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_958b67c7fd1048bb249f9ac6225ef0401f63fb07c7a49719a63b751650caaf71.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_a53fc7a65bcfe71db703647fb67a9956c137f0c2723083e78d621ae37d7a5d93.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_eee38f2fe80f9b4194b9979ff6f6f99a23494487962616a9b8dba2d27aa74363.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_e0d2df2786c79f5c6947e3c83bb1b9be701fa7b5277963602df5ce4ee30648bb.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_71ed66206695b85fa00e188ee04208306ea3cc2070345e86cc4c5830e83f9ff5.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_ab78e6b1abfe18bb92ab

In [12]:
ps_configuration = {
  "concept_base_path": OHSDI_VOCABULARY_PATH,
  "export_concept_mapping_table_path": OHSDI_VOCABULARY_PATH,
  "concept_csv_file_extension": ".csv.bz2",
  "prepared_source_table_path": CDA_FILE_PATH + "output/prepared_source/",
  "prepared_source_csv_extension": ".csv",
  "staging_table_prefix": "",
  "ohdsi_output_location": CDA_FILE_PATH + "output/ohdsi/",
  "check_pointing": "NONE",
  "ohdsi_version": "5.4",
   "prepared_source_csv_table_path": CDA_FILE_PATH + "output/ps/",
  "prepared_source_tables_to_exclude": ["source_device","source_provider", "source_encounter_map",
  "source_person_map", "source_location", "source_care_site", "source_payer", "source_encounter", "source_encounter_detail"
  ],
  "jdbc": {
    "connection_string": "",
    "properties": {"username":  "", "password":  ""}
  }

}
with open("/content/ps_configuration.json", "w") as f:
  json.dump(ps_configuration, f)

In [13]:
%%time
%%python /content/PreparedSource2OHDSI/map/prepared_source/csv/stage_streamlined_prepared_source.py -c /content/ps_configuration.json -l

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 02:34:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_person.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_observation_period.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_condition.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_procedure.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_medication.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_result.csv'
[Stage 11:>                                                                           

CPU times: user 167 ms, sys: 36.7 ms, total: 204 ms
Wall time: 40.1 s


In [14]:
%%time
%%python /content/PreparedSource2OHDSI/map/ohdsi/map_prepared_source_to_ohdsi_cdm.py -c /content/ps_configuration.json -l

Configuration:
{'check_pointing': 'NONE',
 'concept_base_path': '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/',
 'concept_csv_file_extension': '.csv.bz2',
 'export_concept_mapping_table_path': '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/',
 'jdbc': {'connection_string': '',
          'properties': {'password': '', 'username': ''}},
 'ohdsi_output_location': '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/',
 'ohdsi_version': '5.4',
 'prepared_source_csv_extension': '.csv',
 'prepared_source_csv_table_path': '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/',
 'prepared_source_table_path': '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/prepared_source/',
 'prepared_source_tables_to_exclude': ['source_device',
                                       'source_provider',
                                       'source_encounter_map',
                                       'source_person_map',
                         

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 02:34:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO:root:Check pointing mode: NONE
INFO:root:Loading 'concept' from '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/concept.parquet'
INFO:root:Loading 'vocabulary' from '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/vocabulary.parquet'
INFO:root:Total time to load concept and concept_map: 5.641484260559082 seconds
INFO:root:Loading 'source_care_site' from '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/prepared_source/source_care_site.parquet'
INFO:root:Loading 'source_condition' from '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/prepared_source/source_condition.parquet'
INFO:root:Loading 'source_encounter' from '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/ou

CPU times: user 3.6 s, sys: 557 ms, total: 4.16 s
Wall time: 13min 3s


In [15]:
!ls /content/

drive		      ps_configuration.json			    sample_data
PreparedSource2OHDSI  ps_configuration.json.generated.parquet.json


In [16]:
!cp ps_configuration.json.generated.parquet.json {CDA_FILE_PATH}/output/ohdsi/

In [17]:
!du -h {CDA_FILE_PATH}output/ohdsi/

6.0K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/visit_source_link.parquet
21K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Condition
20K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Measurement
20K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Observation
20K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Procedure
85K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet
17K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_procedure_matched.parquet/mapped_domain_id=__HIVE_DEFAULT_PARTITION__
18K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/o