<a href="https://colab.research.google.com/github/jhajagos/PHR2OHDSI/blob/main/Map_CDAs_to_OHDSI_CDM_in_a_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Convert CDAs to OHDSI Compatible Parquet Files with PySpark

This notebook is for converting CDA XML to OHDSI. This pipeline has been tested with several CDAs (Epic, Oracle/Cerner) extracted from patient portals and Apple Health Kit.

More detail on how to map CDAs to OHDSI CDMs can be found at: https://github.com/jhajagos/PreparedSource2OHDSI/tree/main/map/prepared_source/cda

The conversion assumes that you are converting only a single person's CDAs to the OHDSI CDM.


In [None]:
import json

In [None]:
OHSDI_VOCABULARY_PATH = "/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/"

CDA_FILE_PATH = "/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/"

SALT = "salty salt"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls {OHSDI_VOCABULARY_PATH}

concept_ancestor.parquet  concept.parquet		domain.parquet	       vocabulary.parquet
concept_class.parquet	  concept_relationship.parquet	drug_strength.parquet
concept_map.parquet	  concept_synonym.parquet	relationship.parquet


In [None]:
%pip install pyspark==3.5.5



In [None]:
%pip install build

Collecting build
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting pyproject_hooks (from build)
  Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing collected packages: pyproject_hooks, build
Successfully installed build-1.2.2.post1 pyproject_hooks-1.2.0


In [None]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/302.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0


In [None]:
!git clone https://github.com/jhajagos/PreparedSource2OHDSI.git

Cloning into 'PreparedSource2OHDSI'...
remote: Enumerating objects: 1303, done.[K
remote: Counting objects: 100% (91/91), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 1303 (delta 35), reused 68 (delta 23), pack-reused 1212 (from 1)[K
Receiving objects: 100% (1303/1303), 259.50 KiB | 2.20 MiB/s, done.
Resolving deltas: 100% (716/716), done.


In [None]:
%%sh
cd ./PreparedSource2OHDSI
git pull
python -m build --wheel
cd ./dist/
pip install preparedsource2ohdsi-0.1.3-py3-none-any.whl

Already up to date.
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools
  - wheel
* Getting build dependencies for wheel...
running egg_info
creating src/preparedsource2ohdsi.egg-info
writing src/preparedsource2ohdsi.egg-info/PKG-INFO
writing dependency_links to src/preparedsource2ohdsi.egg-info/dependency_links.txt
writing requirements to src/preparedsource2ohdsi.egg-info/requires.txt
writing top-level names to src/preparedsource2ohdsi.egg-info/top_level.txt
writing manifest file 'src/preparedsource2ohdsi.egg-info/SOURCES.txt'
reading manifest file 'src/preparedsource2ohdsi.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'src/preparedsource2ohdsi.egg-info/SOURCES.txt'
* Building wheel...
running bdist_wheel
running build
running build_py
creating build/lib/preparedsource2ohdsi
copying src/preparedsource2ohdsi/mapping_utilities.py -> build/lib/preparedsource2ohdsi
copying src/preparedsource2ohdsi/ohdsi_cdm

In [None]:
%%time
!python /content/PreparedSource2OHDSI/map/prepared_source/cda/cda_to_prepared_source_fragments.py -d {CDA_FILE_PATH} --salt "{SALT}"

Parsing: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/mn_690eee739266146b55137e18d024bd8d.xml'
Writing: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_690eee739266146b55137e18d024bd8d.xml.csv'
Writing 32 rows in  '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_result.lab.mn_690eee739266146b55137e18d024bd8d.xml.csv
Writing 1 rows in '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_medication.mn_690eee739266146b55137e18d024bd8d.xml.csv'
Writing 1 rows in '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_condition.mn_690eee739266146b55137e18d024bd8d.xml.csv'
Writing 1 rows in '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_procedure.mn_690eee739266146b55137e18d024bd8d.xml.csv'
Writing 1 rows in  '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_result.vital.mn_690eee739266146b55137e18d024bd8d.xml

In [None]:
%%time
!python /content/PreparedSource2OHDSI/map/prepared_source/cda/clean_combine_prepared_source_fragments.py -d {CDA_FILE_PATH}

Consolidating 'source_person.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_690eee739266146b55137e18d024bd8d.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_a2628ac4f3a9fb0243d0d9dfbe225656.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_832f18645bcc13b27600207cd93606b7.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_a186826f2db3a0e3a49ea66b448dd231.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_b0a6b93c261e53f053a128e070bdc217.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_2c59c0cb0af5beea07b9549cae482913.xml.csv'
Reading '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps_frags/source_person.mn_a65018ddb1c54ea51a45841441890077.xml.csv'


In [None]:
ps_configuration = {
  "concept_base_path": OHSDI_VOCABULARY_PATH,
  "export_concept_mapping_table_path": OHSDI_VOCABULARY_PATH,
  "concept_csv_file_extension": ".csv.bz2",
  "prepared_source_table_path": CDA_FILE_PATH + "output/prepared_source/",
  "prepared_source_csv_extension": ".csv",
  "staging_table_prefix": "",
  "ohdsi_output_location": CDA_FILE_PATH + "output/ohdsi/",
  "check_pointing": "NONE",
  "ohdsi_version": "5.4",
   "prepared_source_csv_table_path": CDA_FILE_PATH + "output/ps/",
  "prepared_source_tables_to_exclude": ["source_device","source_provider", "source_encounter_map",
  "source_person_map", "source_location", "source_care_site", "source_payer", "source_encounter", "source_encounter_detail"
  ],
  "jdbc": {
    "connection_string": "",
    "properties": {"username":  "", "password":  ""}
  }

}
with open("/content/ps_configuration.json", "w") as f:
  json.dump(ps_configuration, f)

In [None]:
%%time
%%python /content/PreparedSource2OHDSI/map/prepared_source/csv/stage_streamlined_prepared_source.py -c /content/ps_configuration.json -l

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 13:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_person.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_observation_period.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_condition.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_procedure.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_medication.csv'
INFO:root:Loading: '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/source_result.csv'
[Stage 11:>                                                                           

CPU times: user 133 ms, sys: 25.5 ms, total: 158 ms
Wall time: 38.3 s


In [None]:
%%time
%%python /content/PreparedSource2OHDSI/map/ohdsi/map_prepared_source_to_ohdsi_cdm.py -c /content/ps_configuration.json -l

Configuration:
{'check_pointing': 'NONE',
 'concept_base_path': '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/',
 'concept_csv_file_extension': '.csv.bz2',
 'export_concept_mapping_table_path': '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/',
 'jdbc': {'connection_string': '',
          'properties': {'password': '', 'username': ''}},
 'ohdsi_output_location': '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/',
 'ohdsi_version': '5.4',
 'prepared_source_csv_extension': '.csv',
 'prepared_source_csv_table_path': '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ps/',
 'prepared_source_table_path': '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/prepared_source/',
 'prepared_source_tables_to_exclude': ['source_device',
                                       'source_provider',
                                       'source_encounter_map',
                                       'source_person_map',
                         

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 13:30:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO:root:Check pointing mode: NONE
INFO:root:Loading 'concept' from '/content/drive/MyDrive/OHDSI/vocabulary/20250317/export/concept.parquet'
[Stage 4:>                                                                              (0 + 1) / 1]                                                                                                    INFO:root:Total time to load concept and concept_map: 16.732534885406494 seconds
INFO:root:Loading 'source_care_site' from '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/prepared_source/source_care_site.parquet'
INFO:root:Loading 'source_condition' from '/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/prepared_source/source_condition.parquet'
INFO:ro

CPU times: user 3.26 s, sys: 514 ms, total: 3.78 s
Wall time: 13min 33s


In [None]:
!ls /content/

drive		      ps_configuration.json			    sample_data
PreparedSource2OHDSI  ps_configuration.json.generated.parquet.json


In [None]:
!cp ps_configuration.json.generated.parquet.json {CDA_FILE_PATH}/output/ohdsi/

In [None]:
!du -h {CDA_FILE_PATH}output/ohdsi/

6.0K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/visit_source_link.parquet
21K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Condition
20K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Measurement
20K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Observation
20K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet/mapped_domain_id=Procedure
84K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_condition_matched.parquet
17K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/ohdsi/support/source_procedure_matched.parquet/mapped_domain_id=__HIVE_DEFAULT_PARTITION__
18K	/content/drive/MyDrive/phr_ohdsi/source/jgh_documents/output/o