# DICOM to OMOP: create custom vocabularies

We will be creating custom vocabularies for DICOM attributes and Value Sets (DICOM terminology, Enumerated Values, Defined Terms). 
1. Restructure the data: DICOM harvest to OMOP structure
2. DML Scripts using Python & SQL
   1. Update `VOCABULARY` table
   2. Update `CONCEPT_CLASS` table
   3. Update `CONCEPT` table

Links to OMOP CDM
- [Create a custom vocabulary](https://forums.ohdsi.org/t/how-to-add-a-custom-vocabulary-to-the-omop-vocabulary-table/12440/3)
- [OMOP CDM v.5.4 VOCABULARY](https://ohdsi.github.io/CommonDataModel/cdm54.html#vocabulary)
- [OMOP CDM v.5.4 CONCEPT_CLASS](https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_class)
- [OMOP CDM v.5.4 CONCEPT](https://ohdsi.github.io/CommonDataModel/cdm54.html#concept)

Links to DICOM Standards
- [Value Representation (VR)](https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_6.2.html)

## 1. Restructure data: DICOM harvest to OMOP Structure

<blockquote>
<strong>This is for your reference, you can skip this section and use the flat files in the `files` directory in this repository. The instruction to update your OMOP database is at the end of this notebook.</strong>
</blockquote>

In [None]:
import pandas as pd

attributes = pd.read_csv("./files/part6_attributes.csv")
valuesets = pd.read_csv("./files/part16_fhir_valuesets.csv")

In [6]:
# DICOM attributes
# concept_id: 2128000000 + sequential number in range of 10-5999
# concept_name: 'Name'
# domain_id: Candidates - 'Measurement', 'Meas Value', 'Meas/Procedure', 'Type Concept'
# vocabulary_id: 'DICOM'
# concept_class_id: 'DICOM Attributes'
# standard_concept: NULL
# concept_code: Tag
# valid_start_date: 19930101
# valid_end_date: 20991231
# invalid_reason: NULL


In [29]:
included_VR = ['AT', 'CS', 'DA', 'DT', 'DS', 'FL', 'FD', 'IS', 'SL', 'SS', 'SV', 'TM', 'UL', 'US', 'UV']
attributes_included = attributes[attributes['VR'].isin(included_VR)]
attributes_included #2824

Unnamed: 0,Tag,Name,Keyword,VR,VM
0,"(0008,0001)",Length to End,Length​To​End,UL,1
1,"(0008,0005)",Specific Character Set,Specific​Character​Set,CS,1-n
3,"(0008,0008)",Image Type,Image​Type,CS,2-n
5,"(0008,0012)",Instance Creation Date,Instance​Creation​Date,DA,1
6,"(0008,0013)",Instance Creation Time,Instance​Creation​Time,TM,1
...,...,...,...,...,...
5051,"(60xx,1301)",ROI Area,ROI​Area,IS,1
5052,"(60xx,1302)",ROI Mean,ROI​Mean,DS,1
5053,"(60xx,1303)",ROI Standard Deviation,ROI​Standard​Deviation,DS,1
5059,"(7FE0,0003)",Encapsulated Pixel Data Value Total Length,Encapsulated​Pixel​Data​Value​Total​Length,UV,1


In [31]:
import numpy as np
import pandas as pd

sequential_numbers = range(10, len(attributes_included)+10)
attributes_included.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
attribute_table_omop = pd.DataFrame(columns = columns)

attribute_table_omop['concept_id'] = attributes_included['concept_id']
attribute_table_omop['concept_name'] = attributes_included['Name']
attribute_table_omop['domain_id'] = 'Measurement'
attribute_table_omop['vocabulary_id'] = 'DICOM'
attribute_table_omop['concept_class_id'] = 'DICOM Attributes'
attribute_table_omop['concept_code'] = attributes_included['Tag']
attribute_table_omop['valid_start_date'] = 19930101
attribute_table_omop['valid_end_date'] = 20991231

attribute_table_omop = attribute_table_omop.reset_index(drop='True')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  attributes_included.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]


In [39]:
attribute_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128000010,Length to End,Measurement,DICOM,DICOM Attributes,,"(0008,0001)",19930101,20991231,
1,2128000011,Specific Character Set,Measurement,DICOM,DICOM Attributes,,"(0008,0005)",19930101,20991231,
2,2128000012,Image Type,Measurement,DICOM,DICOM Attributes,,"(0008,0008)",19930101,20991231,
3,2128000013,Instance Creation Date,Measurement,DICOM,DICOM Attributes,,"(0008,0012)",19930101,20991231,
4,2128000014,Instance Creation Time,Measurement,DICOM,DICOM Attributes,,"(0008,0013)",19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
2819,2128002829,ROI Area,Measurement,DICOM,DICOM Attributes,,"(60xx,1301)",19930101,20991231,
2820,2128002830,ROI Mean,Measurement,DICOM,DICOM Attributes,,"(60xx,1302)",19930101,20991231,
2821,2128002831,ROI Standard Deviation,Measurement,DICOM,DICOM Attributes,,"(60xx,1303)",19930101,20991231,
2822,2128002832,Encapsulated Pixel Data Value Total Length,Measurement,DICOM,DICOM Attributes,,"(7FE0,0003)",19930101,20991231,


In [None]:
# DICOM Value Sets
# concept_id: 2128000000 + sequential number in range of 6000-999999
# concept_name: display
# domain_id: Candidates - 'Measurement', 'Meas Value', 'Meas/Procedure', 'Type Concept', 'Condition', 'Observation'
# vocabulary_id: 'DICOM'
# concept_class_id: 'DICOM Value Sets'
# standard_concept: NULL
# concept_code: code
# valid_start_date: 19930101
# valid_end_date: 20991231
# invalid_reason: NULL

In [8]:
valuesets_dicom = valuesets[valuesets['system']=='http://dicom.nema.org/resources/ontology/DCM']
valuesets_dicom #5223

Unnamed: 0,code,display,system,id,version,status,description
31,110504,Patient died,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...
32,110515,Patient condition prevented continuing,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...
33,110503,Patient allergic to media/contrast,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...
34,110514,Incorrect worklist entry selected,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...
35,110502,Incorrect procedure ordered,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...
...,...,...,...,...,...,...,...
26820,128129,Plane through Posterior Extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...
26821,128128,Plane through Anterior Extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...
26822,128130,Plane through Center,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...
26823,128121,Plane through Inferior Extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...


In [34]:
import numpy as np
import pandas as pd

sequential_numbers = range(6000, len(valuesets_dicom)+6000)
valuesets_dicom.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
valuesets_table_omop = pd.DataFrame(columns = columns)

valuesets_table_omop['concept_id'] = valuesets_dicom['concept_id']
valuesets_table_omop['concept_name'] = valuesets_dicom['display']
valuesets_table_omop['domain_id'] = 'Measurement'
valuesets_table_omop['vocabulary_id'] = 'DICOM'
valuesets_table_omop['concept_class_id'] = 'DICOM Value Sets'
valuesets_table_omop['concept_code'] = valuesets_dicom['code']
valuesets_table_omop['valid_start_date'] = 19930101
valuesets_table_omop['valid_end_date'] = 20991231

valuesets_table_omop = valuesets_table_omop.reset_index(drop='True')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valuesets_dicom.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]


In [40]:
valuesets_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128006000,Patient died,Measurement,DICOM,DICOM Value Sets,,110504,19930101,20991231,
1,2128006001,Patient condition prevented continuing,Measurement,DICOM,DICOM Value Sets,,110515,19930101,20991231,
2,2128006002,Patient allergic to media/contrast,Measurement,DICOM,DICOM Value Sets,,110503,19930101,20991231,
3,2128006003,Incorrect worklist entry selected,Measurement,DICOM,DICOM Value Sets,,110514,19930101,20991231,
4,2128006004,Incorrect procedure ordered,Measurement,DICOM,DICOM Value Sets,,110502,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
5218,2128011218,Plane through Posterior Extent,Measurement,DICOM,DICOM Value Sets,,128129,19930101,20991231,
5219,2128011219,Plane through Anterior Extent,Measurement,DICOM,DICOM Value Sets,,128128,19930101,20991231,
5220,2128011220,Plane through Center,Measurement,DICOM,DICOM Value Sets,,128130,19930101,20991231,
5221,2128011221,Plane through Inferior Extent,Measurement,DICOM,DICOM Value Sets,,128121,19930101,20991231,


In [41]:
omop_table_staging = pd.concat([attribute_table_omop, valuesets_table_omop], ignore_index=True)
omop_table_staging.to_csv('./files/omop_table_staging.csv', index=False)
omop_table_staging

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128000010,Length to End,Measurement,DICOM,DICOM Attributes,,"(0008,0001)",19930101,20991231,
1,2128000011,Specific Character Set,Measurement,DICOM,DICOM Attributes,,"(0008,0005)",19930101,20991231,
2,2128000012,Image Type,Measurement,DICOM,DICOM Attributes,,"(0008,0008)",19930101,20991231,
3,2128000013,Instance Creation Date,Measurement,DICOM,DICOM Attributes,,"(0008,0012)",19930101,20991231,
4,2128000014,Instance Creation Time,Measurement,DICOM,DICOM Attributes,,"(0008,0013)",19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
8042,2128011218,Plane through Posterior Extent,Measurement,DICOM,DICOM Value Sets,,128129,19930101,20991231,
8043,2128011219,Plane through Anterior Extent,Measurement,DICOM,DICOM Value Sets,,128128,19930101,20991231,
8044,2128011220,Plane through Center,Measurement,DICOM,DICOM Value Sets,,128130,19930101,20991231,
8045,2128011221,Plane through Inferior Extent,Measurement,DICOM,DICOM Value Sets,,128121,19930101,20991231,


## 2. DML Scripts using Python & SQL

In [None]:
import pyodbc

# Connect to your database
server = 'server_name'
database = 'database_name'
# here, we use a trusted connection (i.e., using Windows account), you can also use user_name and password by replacing it with ';UID=<user_name>;PWD=<password>;Encrypt=no'
conn = pyodbc.connect('Driver={SQL Server};Server=' + server + ';Database=' + database + ';Trusted_Connection=yes;')
cursor = conn.cursor()

# Update VOCABULARY
sql = '''
    INSERT INTO dbo.VOCABULARY (vocabulary_id, vocabulary_name, vocabulary_reference, vocabulary_version, vocabulary_concept_id)
    VALUES ('DICOM', 'Digital Imaging and Communications in Medicine (National Electrical Manufacturers Association)',  'https://www.dicomstandard.org/current', 'NEMA Standard PS3', 2128000000)
    '''
cursor.execute(sql)
conn.commit()

In [None]:
# Update CONCEPT_CLASS
sql = '''
    INSERT INTO CONCEPT_CLASS (concept_class_id, concept_class_name, concept_class_concept_id)
    VALUES ('DICOM Attributes', 'DICOM Attributes', 2128000001),
           ('DICOM Value Sets', 'DICOM Value Sets', 2128000002)
    '''
cursor.execute(sql)
conn.commit()

In [None]:
# Load the file for DICOM attributes and value sets
omop_table_staging = pd.read_csv('./files/omop_table_staging.csv')

# Update CONCEPT
sql = '''
    INSERT INTO dbo.concept (concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason) 
    VALUES (?,?,?,?,?,?,?,?,?,?)
    '''
for index, row in omop_table_staging.iterrows():
    cursor.execute(sql, row['concept_id'], row['concept_name'], row['domain_id'], row['vocabulary_id'], 
                   row['concept_class_id'], row['standard_concept'], row['concept_code'], row['valid_start_date'], row['valid_end_date'],row['invalid_reason'])

conn.commit()

In [None]:
# close the cursor and connection
cursor.close()
conn.close()