# DICOM to OMOP: create custom vocabularies

This notebook extract DICOM Attributes and values to add them to OMOP CDM vocabulary as custom concepts. 
1. Restructure the data: DICOM harvest to OMOP structure
2. Extract information for `CONCEPT_RELATIONSHIP` from DICOM Standard Part 3

Links to OMOP CDM
- [Create a custom vocabulary](https://forums.ohdsi.org/t/how-to-add-a-custom-vocabulary-to-the-omop-vocabulary-table/12440/3)
- [OMOP CDM v.5.4 VOCABULARY](https://ohdsi.github.io/CommonDataModel/cdm54.html#vocabulary)
- [OMOP CDM v.5.4 CONCEPT_CLASS](https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_class)
- [OMOP CDM v.5.4 CONCEPT](https://ohdsi.github.io/CommonDataModel/cdm54.html#concept)

Links to DICOM Standards
- [Value Representation (VR)](https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_6.2.html)
- [Part 3](https://dicom.nema.org/medical/dicom/current/output/html/part03.html)

## 1. Restructure data: DICOM harvest to OMOP Structure

<blockquote>
<strong>This is for your reference, you can skip this section and use the flat files in the `files` directory in this repository. The instruction to update your OMOP database is shown in another notebook, "upload_dicom_to_omop.ipynb".</strong>
</blockquote>

In [9]:
import pandas as pd

attributes = pd.read_csv("./files/DICOM Standard/part6_attributes.csv")
valuesets = pd.read_csv("./files/DICOM Standard/part16_fhir_valuesets.csv")
part3 = pd.read_pickle('./files/DICOM Standard/part3_mapping.pkl')

In [10]:
part3_att = part3[part3['CID']!=''].merge(attributes, left_on = 'Tag', right_on = 'Tag_cleaned', how = 'left')

In [11]:
# part 3 includes 1590 attributes
part3['Tag'].nunique()

1590

In [60]:
# part 6 includes 5190 attributes
attributes['Tag'].nunique()

5190

In [61]:
attributes[attributes['Name'].isna()]

Unnamed: 0,Tag,Name,Keyword,VR,VM,Remarks,Tag_cleaned
86,"(0008,0202)",,,,,RET (2020c),00080202
810,"(0018,0061)",,,DS,1.0,RET (2015c),00180061
1461,"(0018,9445)",,,,,RET (2004) - See Note,00189445
2089,"(0028,0020)",,,,,RET (2007) - See Note,00280020
3751,"(0400,0315)",,,FL,1.0,RET (2015c),04000315
4337,"(300A,0135)",,,,,RET,300A0135
4743,"(300A,0782)",,,US,1.0,RET,300A0782


In [12]:
# columns for the imaging extension tables
mi_cdm = ["0020000D", "0020000E", "00080020", "00100020", "00080060", "00180015"]
attributes[(attributes['Tag_cleaned'].isin(mi_cdm))][['Tag', 'Name', 'VR', 'VM']]

Unnamed: 0,Tag,Name,VR,VM
16,"(0008,0020)",Study Date,DA,1
40,"(0008,0060)",Modality,CS,1
267,"(0010,0020)",Patient ID,LO,1
783,"(0018,0015)",Body Part Examined,CS,1
1676,"(0020,000D)",Study Instance UID,UI,1
1677,"(0020,000E)",Series Instance UID,UI,1


In [None]:
# DICOM attributes
# concept_id: 2128000000 + sequential number in range of 10-5999
# concept_name: 'Name'
# domain_id: Candidates - 'Measurement', 'Meas Value', 'Meas/Procedure', 'Type Concept'
# vocabulary_id: 'DICOM'
# concept_class_id: 'DICOM Attributes'
# standard_concept: NULL
# concept_code: Tag
# valid_start_date: 19930101
# valid_end_date: 20991231
# invalid_reason: NULL

In [33]:
import numpy as np
import pandas as pd

attributes = attributes[~attributes['Name'].isna()].copy()

sequential_numbers = range(1, len(attributes)+1)
attributes.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
attribute_table_omop = pd.DataFrame(columns = columns)

attribute_table_omop['concept_id'] = attributes['concept_id']
attribute_table_omop['concept_name'] = attributes['Name']
attribute_table_omop['domain_id'] = 'Measurement'
attribute_table_omop['vocabulary_id'] = 'DICOM'
attribute_table_omop['concept_class_id'] = 'DICOM Attributes'
attribute_table_omop['concept_code'] = attributes['Tag_cleaned']
attribute_table_omop['valid_start_date'] = 19930101
attribute_table_omop['valid_end_date'] = 20991231

attribute_table_omop = attribute_table_omop.reset_index(drop='True')

In [34]:
attribute_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128000001,Length to End,Measurement,DICOM,DICOM Attributes,,00080001,19930101,20991231,
1,2128000002,Specific Character Set,Measurement,DICOM,DICOM Attributes,,00080005,19930101,20991231,
2,2128000003,Language Code Sequence,Measurement,DICOM,DICOM Attributes,,00080006,19930101,20991231,
3,2128000004,Image Type,Measurement,DICOM,DICOM Attributes,,00080008,19930101,20991231,
4,2128000005,Recognition Code,Measurement,DICOM,DICOM Attributes,,00080010,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
5178,2128005179,Digital Signatures Sequence,Measurement,DICOM,DICOM Attributes,,FFFAFFFA,19930101,20991231,
5179,2128005180,Data Set Trailing Padding,Measurement,DICOM,DICOM Attributes,,FFFCFFFC,19930101,20991231,
5180,2128005181,Item,Measurement,DICOM,DICOM Attributes,,FFFEE000,19930101,20991231,
5181,2128005182,Item Delimitation Item,Measurement,DICOM,DICOM Attributes,,FFFEE00D,19930101,20991231,


In [None]:
# DICOM Value Sets
# concept_id: 2128000000 + sequential number in range of 6000-999999
# concept_name: display
# domain_id: Candidates - 'Measurement', 'Meas Value', 'Meas/Procedure', 'Type Concept', 'Condition', 'Observation'
# vocabulary_id: 'DICOM'
# concept_class_id: 'DICOM Value Sets'
# standard_concept: NULL
# concept_code: code
# valid_start_date: 19930101
# valid_end_date: 20991231
# invalid_reason: NULL

In [65]:
# part 16 shape (this is only CID portions)
valuesets.shape

(26825, 8)

In [14]:
valuesets_dicom = valuesets[valuesets['system']=='http://dicom.nema.org/resources/ontology/DCM'].copy()
valuesets_dicom['display'] = valuesets_dicom['display'].str.capitalize()
valuesets_dicom #5223

Unnamed: 0,code,display,system,id,version,status,description,cid
31,110504,Patient died,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
32,110515,Patient condition prevented continuing,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
33,110503,Patient allergic to media/contrast,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
34,110514,Incorrect worklist entry selected,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
35,110502,Incorrect procedure ordered,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
...,...,...,...,...,...,...,...,...
26820,128129,Plane through posterior extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010
26821,128128,Plane through anterior extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010
26822,128130,Plane through center,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010
26823,128121,Plane through inferior extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010


In [15]:
valuesets_dicom_agg = valuesets_dicom.groupby(['code']).agg(counts = ('id', 'count'), latest_version = ('version', 'max')).reset_index()
valuesets_dicom_agg.shape

(3281, 3)

In [16]:
valuesets_dicom_agg.merge(valuesets_dicom[['code', 'version', 'display']], how = 'left', left_on = ['code', 'latest_version'], right_on = ['code', 'version']).drop_duplicates()

Unnamed: 0,code,counts,latest_version,version,display
0,109001,1,20020904,20020904,Digital timecode (nos)
1,109002,1,20020904,20020904,"Ecg-based gating signal, processed"
2,109003,1,20020904,20020904,Irig-b timecode
3,109004,1,20020904,20020904,X-ray fluoroscopy on signal
4,109005,1,20020904,20020904,X-ray on trigger
...,...,...,...,...,...
3714,VA,3,20231115,20231115,Visual acuity
3715,VIDD,1,20130617,20130617,Video tape digitizer equipment
3716,WSD,1,20190327,20190327,Workstation
3717,XA,3,20231115,20231115,X-ray angiography


In [69]:
# Number of Part 16 elements where the same code is used towards multiple CIDs
valuesets_dicom_agg[valuesets_dicom_agg['counts']>1].shape

(1063, 3)

In [17]:
# Number of Part 16 elements after deduplication
valuesets_unique = valuesets_dicom_agg.merge(valuesets_dicom[['code', 'version', 'display']], how = 'left', left_on = ['code', 'latest_version'], right_on = ['code', 'version']).drop_duplicates().reset_index(drop=True)
valuesets_unique.shape

(3281, 5)

In [18]:
import numpy as np
import pandas as pd

sequential_numbers = range(6000, len(valuesets_unique)+6000)
valuesets_unique.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
valuesets_table_omop = pd.DataFrame(columns = columns)

valuesets_table_omop['concept_id'] = valuesets_unique['concept_id']
valuesets_table_omop['concept_name'] = valuesets_unique['display']
valuesets_table_omop['domain_id'] = 'Measurement'
valuesets_table_omop['vocabulary_id'] = 'DICOM'
valuesets_table_omop['concept_class_id'] = 'DICOM Value Sets'
valuesets_table_omop['concept_code'] = valuesets_unique['code']
valuesets_table_omop['valid_start_date'] = 19930101
valuesets_table_omop['valid_end_date'] = 20991231

valuesets_table_omop = valuesets_table_omop.reset_index(drop='True')

In [72]:
valuesets_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128006000,Digital timecode (nos),Measurement,DICOM,DICOM Value Sets,,109001,19930101,20991231,
1,2128006001,"Ecg-based gating signal, processed",Measurement,DICOM,DICOM Value Sets,,109002,19930101,20991231,
2,2128006002,Irig-b timecode,Measurement,DICOM,DICOM Value Sets,,109003,19930101,20991231,
3,2128006003,X-ray fluoroscopy on signal,Measurement,DICOM,DICOM Value Sets,,109004,19930101,20991231,
4,2128006004,X-ray on trigger,Measurement,DICOM,DICOM Value Sets,,109005,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
3276,2128009276,Visual acuity,Measurement,DICOM,DICOM Value Sets,,VA,19930101,20991231,
3277,2128009277,Video tape digitizer equipment,Measurement,DICOM,DICOM Value Sets,,VIDD,19930101,20991231,
3278,2128009278,Workstation,Measurement,DICOM,DICOM Value Sets,,WSD,19930101,20991231,
3279,2128009279,X-ray angiography,Measurement,DICOM,DICOM Value Sets,,XA,19930101,20991231,


### Code String values: DICOM Defined Terms and Enumerated Values

In [6]:
import pandas as pd

modality = pd.read_csv('./files/DICOM Standard/part3_modality.csv')
patient_position = pd.read_csv('./files/DICOM Standard/part3_patient_position.csv')
lossy_image_comp_methods = pd.read_csv('./files/DICOM Standard/part3_lossy_image_comp_methods.csv')
body_part = pd.read_pickle('./files/DICOM Standard/part16_body_part_examined.pkl')

In [7]:
combined_values = pd.concat([modality, patient_position, lossy_image_comp_methods])
combined_values

Unnamed: 0,code,description
0,ANN,Annotation
1,AR,Autorefraction
2,ASMT,Content Assessment Results
3,AU,Audio
4,BDUS,Bone Densitometry (ultrasound)
...,...,...
3,ISO_15444_15,High-Throughput JPEG 2000 Irreversible Compres...
4,ISO_18181_1,JPEG XL Image Coding System - Part 1 Core Codi...
5,ISO_13818_2,MPEG2 Compression[ISO/IEC 13818-2]
6,ISO_14496_10,MPEG-4 AVC/H.264 Compression[ISO/IEC 14496-10]


In [19]:
index = valuesets_table_omop['concept_id'].max() + 1
sequential_numbers = range(6000, len(combined_values)+6000)
combined_values.loc[:, 'concept_id'] = [index + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
cs_value_table_omop = pd.DataFrame(columns = columns)

cs_value_table_omop['concept_id'] = combined_values['concept_id']
cs_value_table_omop['concept_name'] = combined_values['description']
cs_value_table_omop['domain_id'] = 'Measurement'
cs_value_table_omop['vocabulary_id'] = 'DICOM'
cs_value_table_omop['concept_class_id'] = 'DICOM Value Sets'
cs_value_table_omop['concept_code'] = combined_values['code']
cs_value_table_omop['valid_start_date'] = 19930101
cs_value_table_omop['valid_end_date'] = 20991231

cs_value_table_omop = cs_value_table_omop.reset_index(drop='True')

In [20]:
cs_value_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128015281,Annotation,Measurement,DICOM,DICOM Value Sets,,ANN,19930101,20991231,
1,2128015282,Autorefraction,Measurement,DICOM,DICOM Value Sets,,AR,19930101,20991231,
2,2128015283,Content Assessment Results,Measurement,DICOM,DICOM Value Sets,,ASMT,19930101,20991231,
3,2128015284,Audio,Measurement,DICOM,DICOM Value Sets,,AU,19930101,20991231,
4,2128015285,Bone Densitometry (ultrasound),Measurement,DICOM,DICOM Value Sets,,BDUS,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
98,2128015379,High-Throughput JPEG 2000 Irreversible Compres...,Measurement,DICOM,DICOM Value Sets,,ISO_15444_15,19930101,20991231,
99,2128015380,JPEG XL Image Coding System - Part 1 Core Codi...,Measurement,DICOM,DICOM Value Sets,,ISO_18181_1,19930101,20991231,
100,2128015381,MPEG2 Compression[ISO/IEC 13818-2],Measurement,DICOM,DICOM Value Sets,,ISO_13818_2,19930101,20991231,
101,2128015382,MPEG-4 AVC/H.264 Compression[ISO/IEC 14496-10],Measurement,DICOM,DICOM Value Sets,,ISO_14496_10,19930101,20991231,


In [21]:
dicom_code_duplicates = cs_value_table_omop[cs_value_table_omop['concept_code'].isin(valuesets_table_omop['concept_code'])]['concept_id']
len(dicom_code_duplicates)

74

In [22]:
overlap = cs_value_table_omop[['concept_id', 'concept_name', 'concept_code']].merge(valuesets_table_omop[['concept_id', 'concept_name', 'concept_code']], on = 'concept_code')
print(overlap.shape)
overlap.head()

(74, 5)


Unnamed: 0,concept_id_x,concept_name_x,concept_code,concept_id_y,concept_name_y
0,2128015282,Autorefraction,AR,2128009178,Autorefraction
1,2128015283,Content Assessment Results,ASMT,2128009180,Content assessment result
2,2128015284,Audio,AU,2128009181,Basic voice audio
3,2128015285,Bone Densitometry (ultrasound),BDUS,2128009182,Ultrasound bone densitometry
4,2128015286,Biomagnetic imaging,BI,2128009183,Biomagnetic imaging


In [23]:
print(modality.shape)
modality.merge(valuesets_table_omop[['concept_id', 'concept_name', 'concept_code']], how = 'left', left_on = 'code', right_on = 'concept_code').loc[lambda x: x['concept_id'].isna()]

(79, 2)


Unnamed: 0,code,description,concept_id,concept_name,concept_code
0,ANN,Annotation,,,
59,RTINTENT,Radiotherapy Intent,,,
61,RTRAD,RT Radiation,,,
63,RTSEGANN,Radiotherapy Segment Annotation,,,
77,XAPROTOCOL,XA Protocol (Performed),,,


In [28]:
cs_value_table_omop = cs_value_table_omop[~cs_value_table_omop['concept_id'].isin(dicom_code_duplicates)].copy()

In [29]:
# Remove elements without DICOM Code Strings
body_part = body_part[body_part['Body Part Examined']!=""].copy()

In [30]:
index = cs_value_table_omop['concept_id'].max() + 1
sequential_numbers = range(6000, len(body_part)+6000)
body_part.loc[:, 'concept_id'] = [index + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
body_part_table_omop = pd.DataFrame(columns = columns)

body_part_table_omop['concept_id'] = body_part['concept_id']
body_part_table_omop['concept_name'] = body_part['Code Meaning']
body_part_table_omop['domain_id'] = 'Measurement'
body_part_table_omop['vocabulary_id'] = 'DICOM'
body_part_table_omop['concept_class_id'] = 'DICOM Value Sets'
body_part_table_omop['concept_code'] = body_part['Body Part Examined']
body_part_table_omop['valid_start_date'] = 19930101
body_part_table_omop['valid_end_date'] = 20991231

body_part_table_omop = body_part_table_omop.reset_index(drop='True')
body_part_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128021384,Abdomen,Measurement,DICOM,DICOM Value Sets,,ABDOMEN,19930101,20991231,
1,2128021385,Abdomen and Pelvis,Measurement,DICOM,DICOM Value Sets,,ABDOMENPELVIS,19930101,20991231,
2,2128021386,Abdominal aorta,Measurement,DICOM,DICOM Value Sets,,ABDOMINALAORTA,19930101,20991231,
3,2128021387,Acromioclavicular joint,Measurement,DICOM,DICOM Value Sets,,ACJOINT,19930101,20991231,
4,2128021388,Adrenal gland,Measurement,DICOM,DICOM Value Sets,,ADRENAL,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
313,2128021697,Vein,Measurement,DICOM,DICOM Value Sets,,VEIN,19930101,20991231,
314,2128021698,Vertebral artery,Measurement,DICOM,DICOM Value Sets,,VERTEBRALA,19930101,20991231,
315,2128021699,Vulva,Measurement,DICOM,DICOM Value Sets,,VULVA,19930101,20991231,
316,2128021700,Wrist joint,Measurement,DICOM,DICOM Value Sets,,WRIST,19930101,20991231,


### Combine DICOM concepts

In [35]:
print(attribute_table_omop.shape, valuesets_table_omop.shape, cs_value_table_omop.shape, body_part_table_omop.shape)

(5183, 10) (3281, 10) (29, 10) (318, 10)


In [36]:
omop_table_staging = pd.concat([attribute_table_omop, valuesets_table_omop, cs_value_table_omop, body_part_table_omop], ignore_index=True)
omop_table_staging['standard_concept'] = ''
print(omop_table_staging.shape)

(8811, 10)


In [38]:
omop_table_staging.to_csv('./files/OMOP CDM Staging/omop_table_staging_v3.csv', index=False)
omop_table_staging

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128000001,Length to End,Measurement,DICOM,DICOM Attributes,,00080001,19930101,20991231,
1,2128000002,Specific Character Set,Measurement,DICOM,DICOM Attributes,,00080005,19930101,20991231,
2,2128000003,Language Code Sequence,Measurement,DICOM,DICOM Attributes,,00080006,19930101,20991231,
3,2128000004,Image Type,Measurement,DICOM,DICOM Attributes,,00080008,19930101,20991231,
4,2128000005,Recognition Code,Measurement,DICOM,DICOM Attributes,,00080010,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
8806,2128021697,Vein,Measurement,DICOM,DICOM Value Sets,,VEIN,19930101,20991231,
8807,2128021698,Vertebral artery,Measurement,DICOM,DICOM Value Sets,,VERTEBRALA,19930101,20991231,
8808,2128021699,Vulva,Measurement,DICOM,DICOM Value Sets,,VULVA,19930101,20991231,
8809,2128021700,Wrist joint,Measurement,DICOM,DICOM Value Sets,,WRIST,19930101,20991231,


In [39]:
omop_table_staging.groupby('concept_class_id')['concept_id'].count()

concept_class_id
DICOM Attributes    5183
DICOM Value Sets    3628
Name: concept_id, dtype: int64

## 2. Set Concept_relationship from Part 3

### Attribute & Value Sets with CIDs: Attributes = concept_id_1, Value Sets = concept_id_2
This relationship includes mapping to standard coding systems, such as SNOMED and LOINC.

In [40]:
# Join Attribute table with the Part 3 Mapping table where CID is present
part3_cid = part3[part3['CID']!=''].merge(attributes[['Tag_cleaned', 'concept_id']], 
                                          how = 'inner', left_on = 'Tag', right_on = 'Tag_cleaned')
part3_cid['cid'] = pd.to_numeric(part3_cid['CID'], errors='coerce').astype('Int64')
part3_cid = part3_cid.rename(columns={'concept_id':'concept_id_1'})
part3_cid.head()

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted,Attribute Name,Tag,Type,Attribute Description,CID,SOP Class UID,Tag_cleaned,concept_id_1,cid
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Ethnic Group Code Sequence,102161,3,{},6099,1.2.840.10008.5.1.4.1.1.1,102161,2128000322,6099
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient Species Code Sequence,102202,1C,{},7454,1.2.840.10008.5.1.4.1.1.1,102202,2128000330,7454
2,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient Breed Code Sequence,102293,2C,{},7480,1.2.840.10008.5.1.4.1.1.1,102293,2128000334,7480
3,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,De-identification Method Code Sequence,120064,1C,{},7050,1.2.840.10008.5.1.4.1.1.1,120064,2128000364,7050
4,table_A.2-1,Computed Radiography Image IOD Modules,Study,General Study,sect_C.7.2.1,M,M,sect_C.7.2.1,Requesting Service Code Sequence,321034,3,{},7030,1.2.840.10008.5.1.4.1.1.1,321034,2128002339,7030


In [41]:
# Add the value sets from Part 16 to the table
part3_cid_val = part3_cid[['iod', 'Module', 'Attribute Name', 'Tag', 'cid', 'concept_id_1']].merge(
    valuesets[['code', 'cid', 'system']], how = 'left', on = 'cid')
part3_cid_val = part3_cid_val.rename(columns={'code': 'concept_code'})
part3_cid_val.head()

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,C41219,http://ncit.nci.nih.gov
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413490006,http://snomed.info/sct
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413581001,http://snomed.info/sct
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413773004,http://snomed.info/sct
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413600007,http://snomed.info/sct


In [42]:
# Compute number of IODs and CIDs per attribute
part3_tag_agg = part3.groupby(['Tag', 'Attribute Name']).agg(
    unique_iod_count = ('iod', 'nunique'),
    unique_cid_count = ('CID', 'nunique')
    ).reset_index()
# Count of total rows, count of rows with multiple CIDs
print(part3_tag_agg.shape, part3_tag_agg[part3_tag_agg['unique_cid_count']>1].shape)
# Inspect rows with multiple CIDs
part3_tag_agg[part3_tag_agg['unique_cid_count']>1]

(1594, 4) (14, 4)


Unnamed: 0,Tag,Attribute Name,unique_iod_count,unique_cid_count
98,00082218,Anatomic Region Sequence,15,3
99,00082228,Primary Anatomic Structure Sequence,5,2
108,00089215,Derivation Code Sequence,71,2
773,00220015,Acquisition Device Type Code Sequence,5,2
774,00220016,Illumination Type Code Sequence,5,2
806,00221423,Acquisition Method Algorithm Sequence,2,2
823,00221612,Derivation Algorithm Sequence,4,2
994,00400275,Request Attributes Sequence,139,3
1018,00409096,Real World Value Mapping Sequence,43,2
1021,0040A043,Concept Name Code Sequence,29,2


In [43]:
# total mappings IOD x CID x Coded Value
part3_cid_val.shape

(609019, 8)

In [44]:
# Count by coding systems for the coded values
part3_cid_val['system'].value_counts()

system
http://snomed.info/sct                                                 590255
http://dicom.nema.org/resources/ontology/DCM                            12195
http://sig.biostr.washington.edu/projects/fm/AboutFM.html                2928
http://www.nlm.nih.gov/research/umls                                     1170
http://braininfo.rprc.washington.edu/aboutBrainInfo.aspx#NeuroNames       748
doi:10.1016/S0735-1097(99)00126-6                                         544
http://www.itis.gov                                                       513
http://ncit.nci.nih.gov                                                   303
http://www.radlex.org                                                      80
http://unitsofmeasure.org                                                  61
http://www.nlm.nih.gov/research/umls/rxnorm                                40
http://loinc.org                                                           10
Name: count, dtype: int64

In [45]:
# Add OMOP Vocabulary ID to the DICOM coding system
# Exclude the rows without vocabulary ID
mapping = {
    'http://snomed.info/sct': 'SNOMED',
    'http://dicom.nema.org/resources/ontology/DCM': 'DICOM',
    'http://unitsofmeasure.org': 'UCUM',
    'http://loinc.org': 'LOINC',
}

part3_cid_val['vocabulary_id'] = part3_cid_val['system'].map(mapping)
part3_cid_val.head()

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system,vocabulary_id
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,C41219,http://ncit.nci.nih.gov,
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413490006,http://snomed.info/sct,SNOMED
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413581001,http://snomed.info/sct,SNOMED
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413773004,http://snomed.info/sct,SNOMED
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000322,413600007,http://snomed.info/sct,SNOMED


In [46]:
# import concept table from the SQL database
# *** This was ran after uploading DICOM custom concepts ***

import psycopg2

# Connect to your database
conn = psycopg2.connect(
    database="",
    user="",
    password="",
    host="",
    port="",
    connect_timeout = 6000
)
cursor = conn.cursor()

sql = "select * from dbo.concept"
concept_df = pd.read_sql_query(sql, conn)
concept_df.head()

# close the cursor and connection
cursor.close()
conn.close()

  concept_df = pd.read_sql_query(sql, conn)


In [47]:
# Using the vocabulary ID, find the concept IDs for the coded values
part3_cid_val_concept = part3_cid_val.merge(concept_df[['concept_id', 'concept_name', 'concept_code', 'vocabulary_id']], how = 'left', on = ['concept_code', 'vocabulary_id'])
part3_cid_val_concept = part3_cid_val_concept.rename(columns={'concept_id': 'concept_id_2'})
part3_cid_val_concept['concept_id_2'] = part3_cid_val_concept['concept_id_2'].astype('Int64')

In [95]:
# check if the previous step created duplicated rows
part3_cid_val_concept.shape

(609019, 11)

In [96]:
part3_cid_val_concept.head()

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system,vocabulary_id,concept_id_2,concept_name
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000323,C41219,http://ncit.nci.nih.gov,,,
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000323,413490006,http://snomed.info/sct,SNOMED,4184966.0,American Indian or Alaska native
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000323,413581001,http://snomed.info/sct,SNOMED,4184984.0,Asian or Pacific islander
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000323,413773004,http://snomed.info/sct,SNOMED,4185154.0,Caucasian
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000323,413600007,http://snomed.info/sct,SNOMED,4186705.0,Australian aborigine


In [48]:
# Inspect mapping results
part3_val_agg = part3_cid_val_concept.groupby(['concept_id_1', 'concept_code'])['concept_id_2'].nunique().reset_index()
print('No concept_id:', part3_val_agg[part3_val_agg['concept_id_2']==0].shape)
print('Exactly one concept_id:', part3_val_agg[part3_val_agg['concept_id_2']==1].shape)
print('Multiple concept_ids:', part3_val_agg[part3_val_agg['concept_id_2']>1].shape)

No concept_id: (219, 3)
Exactly one concept_id: (7101, 3)
Multiple concept_ids: (0, 3)


In [49]:
# Inspect rows without OMOP concept ID 
part3_cid_val_concept[part3_cid_val_concept['concept_id_2'].isna()].groupby('system')['concept_code'].nunique()

system
doi:10.1016/S0735-1097(99)00126-6                                       8
http://braininfo.rprc.washington.edu/aboutBrainInfo.aspx#NeuroNames    11
http://ncit.nci.nih.gov                                                 6
http://sig.biostr.washington.edu/projects/fm/AboutFM.html              43
http://snomed.info/sct                                                 10
http://unitsofmeasure.org                                              21
http://www.itis.gov                                                     3
http://www.nlm.nih.gov/research/umls                                    7
http://www.nlm.nih.gov/research/umls/rxnorm                             1
http://www.radlex.org                                                   2
Name: concept_code, dtype: int64

In [50]:
# Save the list with OMOP concept ID mapped
print(part3_cid_val_concept[~part3_cid_val_concept['concept_id_2'].isna()].shape) #601825
part3_cid_val_concept_omop = part3_cid_val_concept[~part3_cid_val_concept['concept_id_2'].isna()]

(601825, 11)


In [51]:
# Prepare the dataset for Concept_relationship table
columns = ['concept_id_1', 'concept_id_2', 'relationship_id', 'valid_start_date', 'valid_end_date']
concept_relationship_staging = pd.DataFrame(columns = columns)

concept_relationship_staging['concept_id_1'] = part3_cid_val_concept_omop['concept_id_1']
concept_relationship_staging['concept_id_2'] = part3_cid_val_concept_omop['concept_id_2']
concept_relationship_staging['relationship_id'] = 'Maps to value'
concept_relationship_staging['valid_start_date'] = 19930101
concept_relationship_staging['valid_end_date'] = 20991231

concept_relationship_staging = concept_relationship_staging.reset_index(drop='True')
concept_relationship_staging.head()

Unnamed: 0,concept_id_1,concept_id_2,relationship_id,valid_start_date,valid_end_date
0,2128000322,4184966,Maps to value,19930101,20991231
1,2128000322,4184984,Maps to value,19930101,20991231
2,2128000322,4185154,Maps to value,19930101,20991231
3,2128000322,4186705,Maps to value,19930101,20991231
4,2128000322,4185920,Maps to value,19930101,20991231


In [52]:
concept_relationship_staging.shape

(601825, 5)

In [53]:
# Drop duplicated rows
# Part 3 table have repeated attributes because Part 3 is based on IODs and there are shared attributes 
# (i.e. Attributes in Patient, General Study, General Series modules)
concept_relationship_staging.drop_duplicates().shape

(7101, 5)

In [54]:
# Export the relationship table
concept_relationship_staging = concept_relationship_staging.drop_duplicates()
concept_relationship_staging.to_pickle('./files/OMOP CDM Staging/part3_to_part16_relationship_via_CID.pkl')

### Collect Value Sets other than CIDs

In [55]:
body_part.head()

Unnamed: 0,Coding Scheme Designator,Code Value,Code Meaning,Body Part Examined,SNOMED-RT ID (Retired),FMA Code Value,UMLS Concept UniqueID,concept_id
0,SCT,818981001,Abdomen,ABDOMEN,,,,2128021384
1,SCT,818982008,Abdomen and Pelvis,ABDOMENPELVIS,,,,2128021385
2,SCT,7832008,Abdominal aorta,ABDOMINALAORTA,T-42500,,,2128021386
3,SCT,85856004,Acromioclavicular joint,ACJOINT,T-15420,,,2128021387
4,SCT,23451007,Adrenal gland,ADRENAL,T-B3000,,,2128021388


In [56]:
body_part['Coding Scheme Designator'].value_counts()

Coding Scheme Designator
SCT    313
         5
Name: count, dtype: int64

In [57]:
# Print the body part coded values missing standard coding system
body_part[body_part['Coding Scheme Designator']==""]

Unnamed: 0,Coding Scheme Designator,Code Value,Code Meaning,Body Part Examined,SNOMED-RT ID (Retired),FMA Code Value,UMLS Concept UniqueID,concept_id
129,,,Fetal arm,FETALARM,,,,2128021486
130,,,Fetal digit,FETALDIGIT,,,,2128021487
131,,,Fetal heart,FETALHEART,,63931.0,,2128021488
132,,,Fetal leg,FETALLEG,,,,2128021489
133,,,Fetal pole,FETALPOLE,,,,2128021490


In [58]:
body_part = body_part.rename(columns={'concept_id': 'concept_id_1'})

In [59]:
# join with Concept table 
body_part_maps_to = body_part[body_part['Coding Scheme Designator']=="SCT"].merge(
    concept_df[concept_df['vocabulary_id']=="SNOMED"][['concept_code', 'concept_id']]
    , how = 'left', left_on = 'Code Value', right_on = 'concept_code')
body_part_maps_to['concept_id'] = body_part_maps_to['concept_id'].astype('Int64')
body_part_maps_to = body_part_maps_to.rename(columns={'concept_id':'concept_id_2'})
body_part_maps_to['relationship_id'] = 'Maps to'
body_part_maps_to = body_part_maps_to[~body_part_maps_to['concept_id_2'].isna()].copy().reset_index(drop=True)
body_part_maps_to

Unnamed: 0,Coding Scheme Designator,Code Value,Code Meaning,Body Part Examined,SNOMED-RT ID (Retired),FMA Code Value,UMLS Concept UniqueID,concept_id_1,concept_code,concept_id_2,relationship_id
0,SCT,818981001,Abdomen,ABDOMEN,,,,2128021384,818981001,37303869,Maps to
1,SCT,818982008,Abdomen and Pelvis,ABDOMENPELVIS,,,,2128021385,818982008,37303868,Maps to
2,SCT,7832008,Abdominal aorta,ABDOMINALAORTA,T-42500,,,2128021386,7832008,4301737,Maps to
3,SCT,85856004,Acromioclavicular joint,ACJOINT,T-15420,,,2128021387,85856004,4311928,Maps to
4,SCT,23451007,Adrenal gland,ADRENAL,T-B3000,,,2128021388,23451007,4051774,Maps to
...,...,...,...,...,...,...,...,...,...,...,...
302,SCT,29092000,Vein,VEIN,T-48000,,,2128021697,29092000,4104340,Maps to
303,SCT,85234005,Vertebral artery,VERTEBRALA,T-45700,,,2128021698,85234005,4310816,Maps to
304,SCT,45292006,Vulva,VULVA,T-81000,,,2128021699,45292006,4166066,Maps to
305,SCT,74670003,Wrist joint,WRIST,T-15460,,,2128021700,74670003,4254083,Maps to


In [None]:
# six SNOMED codes are not found in OMOP CDM Concept table
test = body_part[body_part['Coding Scheme Designator']=="SCT"].merge(
    concept_df[concept_df['vocabulary_id']=="SNOMED"][['concept_code', 'concept_id']]
    , how = 'left', left_on = 'Code Value', right_on = 'concept_code')
test[test['concept_id'].isna()]

Unnamed: 0,Coding Scheme Designator,Code Value,Code Meaning,Body Part Examined,SNOMED-RT ID (Retired),FMA Code Value,UMLS Concept UniqueID,concept_id_1,concept_code,concept_id
46,SCT,1217257000,Cervico-thoracic spine,CTSPINE,,,,2128021430,,
155,SCT,1017210004,Left lumbar region,LLUMBAR,,,,2128021544,,
167,SCT,1217253001,Lumbo-sacral spine,LSSPINE,,,,2128021556,,
200,SCT,1231522001,Pelvis and lower extremities,PELVISLOWEXTREMT,,,,2128021589,,
236,SCT,1017211000,Right lumbar region,RLUMBAR,,,,2128021625,,
284,SCT,1217256009,Thoraco-lumbar spine,TLSPINE,,,,2128021673,,


In [94]:
# Create Concept_relationship table for the Body Part Examined Attribute
body_part_maps_to_value = body_part[['concept_id_1']].copy()
body_part_maps_to_value = body_part_maps_to_value.rename(columns = {'concept_id_1': 'concept_id_2'})
body_part_maps_to_value['concept_id_1'] = attribute_table_omop.loc[attribute_table_omop['concept_name']=="Body Part Examined",'concept_id'].iloc[0] 
body_part_maps_to_value['relationship_id'] = 'Maps to value'
body_part_maps_to_value

Unnamed: 0,concept_id_2,concept_id_1,relationship_id
0,2128021384,2128000783,Maps to value
1,2128021385,2128000783,Maps to value
2,2128021386,2128000783,Maps to value
3,2128021387,2128000783,Maps to value
4,2128021388,2128000783,Maps to value
...,...,...,...
390,2128021697,2128000783,Maps to value
393,2128021698,2128000783,Maps to value
395,2128021699,2128000783,Maps to value
396,2128021700,2128000783,Maps to value


In [95]:
# Create Concept_relationship table for the Anatomic Region Sequence Attribute
body_part_maps_to_value_2 = body_part_maps_to_value[['concept_id_2', 'relationship_id']].copy()
body_part_maps_to_value_2['concept_id_1'] = attribute_table_omop.loc[attribute_table_omop['concept_name']=="Anatomic Region Sequence",'concept_id'].iloc[0] 
body_part_maps_to_value_2

Unnamed: 0,concept_id_2,relationship_id,concept_id_1
0,2128021384,Maps to value,2128000225
1,2128021385,Maps to value,2128000225
2,2128021386,Maps to value,2128000225
3,2128021387,Maps to value,2128000225
4,2128021388,Maps to value,2128000225
...,...,...,...
390,2128021697,Maps to value,2128000225
393,2128021698,Maps to value,2128000225
395,2128021699,Maps to value,2128000225
396,2128021700,Maps to value,2128000225


In [79]:
# Create Concept_relationship table for the Modality Attribute
modality_maps_to_value = combined_values[(combined_values['code'].isin(modality['code']))][['code', 'description', 'concept_id']].copy().reset_index(drop=True)
modality_maps_to_value['relationship_id'] = 'Maps to value'
modality_maps_to_value['concept_id_1'] = attribute_table_omop.loc[attribute_table_omop['concept_name']=="Modality",'concept_id'].iloc[0]
modality_maps_to_value = modality_maps_to_value.rename(columns={'concept_id':'concept_id_2'})
print(modality_maps_to_value.shape)
modality_maps_to_value.head()

(79, 5)


Unnamed: 0,code,description,concept_id_2,relationship_id,concept_id_1
0,ANN,Annotation,2128015281,Maps to value,2128000041
1,AR,Autorefraction,2128015282,Maps to value,2128000041
2,ASMT,Content Assessment Results,2128015283,Maps to value,2128000041
3,AU,Audio,2128015284,Maps to value,2128000041
4,BDUS,Bone Densitometry (ultrasound),2128015285,Maps to value,2128000041


In [80]:
# Create Concept_relationship table for the Patient Position Attribute
patient_position_maps_to_value = combined_values[(combined_values['code'].isin(patient_position['code']))][['code', 'description', 'concept_id']].copy().reset_index(drop=True)
patient_position_maps_to_value['relationship_id'] = 'Maps to value'
patient_position_maps_to_value['concept_id_1'] = attribute_table_omop.loc[attribute_table_omop['concept_name']=="Patient Position",'concept_id'].iloc[0]
patient_position_maps_to_value = patient_position_maps_to_value.rename(columns={'concept_id':'concept_id_2'})
print(patient_position_maps_to_value.shape)
patient_position_maps_to_value.head()

(16, 5)


Unnamed: 0,code,description,concept_id_2,relationship_id,concept_id_1
0,HFP,Head First-Prone,2128015360,Maps to value,2128001088
1,HFS,Head First-Supine,2128015361,Maps to value,2128001088
2,HFDR,Head First-Decubitus Right,2128015362,Maps to value,2128001088
3,HFDL,Head First-Decubitus Left,2128015363,Maps to value,2128001088
4,FFDR,Feet First-Decubitus Right,2128015364,Maps to value,2128001088


In [85]:
# Create Concept_relationship table for the Lossy Image Compression Attribute
lossy_image_comp_methods_maps_to_value = combined_values[(combined_values['code'].isin(lossy_image_comp_methods['code']))][['code', 'description', 'concept_id']].copy().reset_index(drop=True)
lossy_image_comp_methods_maps_to_value['relationship_id'] = 'Maps to value'
lossy_image_comp_methods_maps_to_value['concept_id_1'] = attribute_table_omop.loc[attribute_table_omop['concept_name']=="Lossy Image Compression Method",'concept_id'].iloc[0]
lossy_image_comp_methods_maps_to_value = lossy_image_comp_methods_maps_to_value.rename(columns={'concept_id':'concept_id_2'})
print(lossy_image_comp_methods_maps_to_value.shape)
lossy_image_comp_methods_maps_to_value.head()

(8, 5)


Unnamed: 0,code,description,concept_id_2,relationship_id,concept_id_1
0,ISO_10918_1,JPEG Lossy Compression[ISO/IEC 10918-1],2128015376,Maps to value,2128002221
1,ISO_14495_1,JPEG-LS Near-lossless Compression[ISO/IEC 1449...,2128015377,Maps to value,2128002221
2,ISO_15444_1,JPEG 2000 Irreversible Compression[ISO/IEC 154...,2128015378,Maps to value,2128002221
3,ISO_15444_15,High-Throughput JPEG 2000 Irreversible Compres...,2128015379,Maps to value,2128002221
4,ISO_18181_1,JPEG XL Image Coding System - Part 1 Core Codi...,2128015380,Maps to value,2128002221


In [86]:
cs_values_maps_to = body_part_maps_to[['concept_id_1', 'concept_id_2', 'relationship_id']].copy()
cs_values_maps_to

Unnamed: 0,concept_id_1,concept_id_2,relationship_id
0,2128021384,37303869,Maps to
1,2128021385,37303868,Maps to
2,2128021386,4301737,Maps to
3,2128021387,4311928,Maps to
4,2128021388,4051774,Maps to
...,...,...,...
302,2128021697,4104340,Maps to
303,2128021698,4310816,Maps to
304,2128021699,4166066,Maps to
305,2128021700,4254083,Maps to


In [87]:
# Export Maps to relationship
cs_values_maps_to.to_csv('./files/OMOP CDM Staging/cs_values_maps_to.csv')

In [96]:
# Combine all maps to value rows
cs_values_maps_to_value = pd.concat([body_part_maps_to_value[['concept_id_1', 'concept_id_2', 'relationship_id']], 
                                     body_part_maps_to_value_2[['concept_id_1', 'concept_id_2', 'relationship_id']],
                                     modality_maps_to_value[['concept_id_1', 'concept_id_2', 'relationship_id']], 
                                     patient_position_maps_to_value[['concept_id_1', 'concept_id_2', 'relationship_id']],
                                     lossy_image_comp_methods_maps_to_value[['concept_id_1', 'concept_id_2', 'relationship_id']]])
cs_values_maps_to_value

Unnamed: 0,concept_id_1,concept_id_2,relationship_id
0,2128000783,2128021384,Maps to value
1,2128000783,2128021385,Maps to value
2,2128000783,2128021386,Maps to value
3,2128000783,2128021387,Maps to value
4,2128000783,2128021388,Maps to value
...,...,...,...
3,2128002221,2128015379,Maps to value
4,2128002221,2128015380,Maps to value
5,2128002221,2128015381,Maps to value
6,2128002221,2128015382,Maps to value


In [97]:
cs_values_maps_to_value['relationship_id'].value_counts()

relationship_id
Maps to value    739
Name: count, dtype: int64

In [98]:
# Export Maps to value relationship
cs_values_maps_to_value.to_csv('./files/OMOP CDM Staging/cs_values_maps_to_value.csv')