# DICOM to OMOP: create custom vocabularies

We will be creating custom vocabularies for DICOM Attributes and Value Sets. 
1. Restructure the data: DICOM harvest to OMOP structure
2. DML Scripts using Python & SQL
   1. Update `VOCABULARY` table
   2. Update `CONCEPT_CLASS` table
   3. Update `CONCEPT` table
3. Update `CONCEPT_RELATIONSHIP` table from DICOM Standard Part 3

Links to OMOP CDM
- [Create a custom vocabulary](https://forums.ohdsi.org/t/how-to-add-a-custom-vocabulary-to-the-omop-vocabulary-table/12440/3)
- [OMOP CDM v.5.4 VOCABULARY](https://ohdsi.github.io/CommonDataModel/cdm54.html#vocabulary)
- [OMOP CDM v.5.4 CONCEPT_CLASS](https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_class)
- [OMOP CDM v.5.4 CONCEPT](https://ohdsi.github.io/CommonDataModel/cdm54.html#concept)

Links to DICOM Standards
- [Value Representation (VR)](https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_6.2.html)

## 1. Restructure data: DICOM harvest to OMOP Structure

<blockquote>
<strong>This is for your reference, you can skip this section and use the flat files in the `files` directory in this repository. The instruction to update your OMOP database is at the end of this notebook.</strong>
</blockquote>

In [3]:
import pandas as pd

attributes = pd.read_csv("./files/DICOM Standard/part6_attributes.csv")
valuesets = pd.read_csv("./files/DICOM Standard/part16_fhir_valuesets.csv")
part3 = pd.read_pickle('./files/DICOM Standard/part3_mapping.pkl')

In [6]:
# DICOM attributes
# concept_id: 2128000000 + sequential number in range of 10-5999
# concept_name: 'Name'
# domain_id: Candidates - 'Measurement', 'Meas Value', 'Meas/Procedure', 'Type Concept'
# vocabulary_id: 'DICOM'
# concept_class_id: 'DICOM Attributes'
# standard_concept: NULL
# concept_code: Tag
# valid_start_date: 19930101
# valid_end_date: 20991231
# invalid_reason: NULL

In [4]:
part3_att = part3[part3['CID']!=''].merge(attributes, left_on = 'Tag', right_on = 'Tag_cleaned', how = 'left')

In [5]:
attributes_cid = part3_att['Tag_x'].unique()

In [6]:
included_VR = ['AT', 'CS', 'DA', 'DT', 'DS', 'FL', 'FD', 'IS', 'SL', 'SS', 'SV', 'TM', 'UL', 'US', 'UV']
attributes_included = attributes[(attributes['VR'].isin(included_VR)) | (attributes['Tag_cleaned'].isin(attributes_cid))]
attributes_included #2915 -> 2983

Unnamed: 0,Tag,Name,Keyword,VR,VM,Unnamed: 5,Tag_cleaned
0,"(0008,0001)",Length to End,Length​To​End,UL,1,RET,00080001
1,"(0008,0005)",Specific Character Set,Specific​Character​Set,CS,1-n,,00080005
3,"(0008,0008)",Image Type,Image​Type,CS,2-n,,00080008
5,"(0008,0012)",Instance Creation Date,Instance​Creation​Date,DA,1,,00080012
6,"(0008,0013)",Instance Creation Time,Instance​Creation​Time,TM,1,,00080013
...,...,...,...,...,...,...,...
5165,"(60xx,1301)",ROI Area,ROI​Area,IS,1,,60xx1301
5166,"(60xx,1302)",ROI Mean,ROI​Mean,DS,1,,60xx1302
5167,"(60xx,1303)",ROI Standard Deviation,ROI​Standard​Deviation,DS,1,,60xx1303
5173,"(7FE0,0003)",Encapsulated Pixel Data Value Total Length,Encapsulated​Pixel​Data​Value​Total​Length,UV,1,,7FE00003


In [7]:
import numpy as np
import pandas as pd

sequential_numbers = range(10, len(attributes_included)+10)
attributes_included.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
attribute_table_omop = pd.DataFrame(columns = columns)

attribute_table_omop['concept_id'] = attributes_included['concept_id']
attribute_table_omop['concept_name'] = attributes_included['Name']
attribute_table_omop['domain_id'] = 'Measurement'
attribute_table_omop['vocabulary_id'] = 'DICOM'
attribute_table_omop['concept_class_id'] = 'DICOM Attributes'
attribute_table_omop['concept_code'] = attributes_included['Tag_cleaned']
attribute_table_omop['valid_start_date'] = 19930101
attribute_table_omop['valid_end_date'] = 20991231

attribute_table_omop = attribute_table_omop.reset_index(drop='True')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  attributes_included.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]


In [8]:
attribute_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128000010,Length to End,Measurement,DICOM,DICOM Attributes,,00080001,19930101,20991231,
1,2128000011,Specific Character Set,Measurement,DICOM,DICOM Attributes,,00080005,19930101,20991231,
2,2128000012,Image Type,Measurement,DICOM,DICOM Attributes,,00080008,19930101,20991231,
3,2128000013,Instance Creation Date,Measurement,DICOM,DICOM Attributes,,00080012,19930101,20991231,
4,2128000014,Instance Creation Time,Measurement,DICOM,DICOM Attributes,,00080013,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
2978,2128002988,ROI Area,Measurement,DICOM,DICOM Attributes,,60xx1301,19930101,20991231,
2979,2128002989,ROI Mean,Measurement,DICOM,DICOM Attributes,,60xx1302,19930101,20991231,
2980,2128002990,ROI Standard Deviation,Measurement,DICOM,DICOM Attributes,,60xx1303,19930101,20991231,
2981,2128002991,Encapsulated Pixel Data Value Total Length,Measurement,DICOM,DICOM Attributes,,7FE00003,19930101,20991231,


In [None]:
# DICOM Value Sets
# concept_id: 2128000000 + sequential number in range of 6000-999999
# concept_name: display
# domain_id: Candidates - 'Measurement', 'Meas Value', 'Meas/Procedure', 'Type Concept', 'Condition', 'Observation'
# vocabulary_id: 'DICOM'
# concept_class_id: 'DICOM Value Sets'
# standard_concept: NULL
# concept_code: code
# valid_start_date: 19930101
# valid_end_date: 20991231
# invalid_reason: NULL

In [9]:
valuesets.shape

(26825, 8)

In [10]:
valuesets[["code", "display", "system"]].drop_duplicates().shape

(13966, 3)

In [11]:
valuesets_dicom = valuesets[valuesets['system']=='http://dicom.nema.org/resources/ontology/DCM']
valuesets_dicom #5223

Unnamed: 0,code,display,system,id,version,status,description,cid
31,110504,Patient died,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
32,110515,Patient condition prevented continuing,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
33,110503,Patient allergic to media/contrast,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
34,110514,Incorrect worklist entry selected,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
35,110502,Incorrect procedure ordered,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-9301-ModalityPPSDiscontinuationReason,20140419,active,Transitive closure of CID 9301 ModalityPPSDisc...,9301
...,...,...,...,...,...,...,...,...
26820,128129,Plane through Posterior Extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010
26821,128128,Plane through Anterior Extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010
26822,128130,Plane through Center,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010
26823,128121,Plane through Inferior Extent,http://dicom.nema.org/resources/ontology/DCM,dicom-cid-1010-ReferenceGeometryPlane,20160905,active,Transitive closure of CID 1010 ReferenceGeomet...,1010


In [12]:
valuesets_unique = valuesets_dicom[["code", "display", "system"]].drop_duplicates().reset_index(drop=True)
valuesets_unique.shape #3295

(3295, 3)

In [13]:
import numpy as np
import pandas as pd

sequential_numbers = range(6000, len(valuesets_unique)+6000)
valuesets_unique.loc[:, 'concept_id'] = [2128000000 + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
valuesets_table_omop = pd.DataFrame(columns = columns)

valuesets_table_omop['concept_id'] = valuesets_unique['concept_id']
valuesets_table_omop['concept_name'] = valuesets_unique['display']
valuesets_table_omop['domain_id'] = 'Measurement'
valuesets_table_omop['vocabulary_id'] = 'DICOM'
valuesets_table_omop['concept_class_id'] = 'DICOM Value Sets'
valuesets_table_omop['concept_code'] = valuesets_unique['code']
valuesets_table_omop['valid_start_date'] = 19930101
valuesets_table_omop['valid_end_date'] = 20991231

valuesets_table_omop = valuesets_table_omop.reset_index(drop='True')

In [14]:
valuesets_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128006000,Patient died,Measurement,DICOM,DICOM Value Sets,,110504,19930101,20991231,
1,2128006001,Patient condition prevented continuing,Measurement,DICOM,DICOM Value Sets,,110515,19930101,20991231,
2,2128006002,Patient allergic to media/contrast,Measurement,DICOM,DICOM Value Sets,,110503,19930101,20991231,
3,2128006003,Incorrect worklist entry selected,Measurement,DICOM,DICOM Value Sets,,110514,19930101,20991231,
4,2128006004,Incorrect procedure ordered,Measurement,DICOM,DICOM Value Sets,,110502,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
3290,2128009290,Plane through Posterior Extent,Measurement,DICOM,DICOM Value Sets,,128129,19930101,20991231,
3291,2128009291,Plane through Anterior Extent,Measurement,DICOM,DICOM Value Sets,,128128,19930101,20991231,
3292,2128009292,Plane through Center,Measurement,DICOM,DICOM Value Sets,,128130,19930101,20991231,
3293,2128009293,Plane through Inferior Extent,Measurement,DICOM,DICOM Value Sets,,128121,19930101,20991231,


In [15]:
import pandas as pd

modality = pd.read_csv('./files/DICOM Standard/part3_modality.csv')
patient_position = pd.read_csv('./files/DICOM Standard/part3_patient_position.csv')
lossy_image_comp_methods = pd.read_csv('./files/DICOM Standard/part3_lossy_image_comp_methods.csv')
other_values = pd.read_csv('./files/DICOM Standard/part3_other_values.csv')
body_part = pd.read_pickle('./files/DICOM Standard/part16_body_part_examined.pkl')

In [16]:
other_values['concept_id'] = pd.to_numeric(other_values['concept_id'], errors='coerce').astype('Int64')
modality['concept_id'] = pd.to_numeric(modality['concept_id'], errors='coerce').astype('Int64')

In [17]:
combined_values = pd.concat([modality, patient_position, lossy_image_comp_methods, other_values[['code', 'description', 'concept_id']]])
combined_values = combined_values.rename(columns={'concept_id': 'syn_concept_id'})
combined_values

Unnamed: 0,code,description,syn_concept_id
0,ANN,Annotation,
1,AR,Autorefraction,
2,ASMT,Content Assessment Results,
3,AU,Audio,
4,BDUS,Bone Densitometry (ultrasound),
...,...,...,...
8,INVERSE,Inverse,4114662
9,U,Unpaired,
10,B,Both,45883500
11,00,image has not been subjected to lossy compression,


In [18]:
index = valuesets_table_omop['concept_id'].max() + 1
sequential_numbers = range(6000, len(combined_values)+6000)
combined_values.loc[:, 'concept_id'] = [index + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
cs_value_table_omop = pd.DataFrame(columns = columns)

cs_value_table_omop['concept_id'] = combined_values['concept_id']
cs_value_table_omop['concept_name'] = combined_values['description']
cs_value_table_omop['domain_id'] = 'Measurement'
cs_value_table_omop['vocabulary_id'] = 'DICOM'
cs_value_table_omop['concept_class_id'] = 'DICOM Value Sets'
cs_value_table_omop['concept_code'] = combined_values['code']
cs_value_table_omop['valid_start_date'] = 19930101
cs_value_table_omop['valid_end_date'] = 20991231

cs_value_table_omop = cs_value_table_omop.reset_index(drop='True')

In [19]:
cs_value_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128015295,Annotation,Measurement,DICOM,DICOM Value Sets,,ANN,19930101,20991231,
1,2128015296,Autorefraction,Measurement,DICOM,DICOM Value Sets,,AR,19930101,20991231,
2,2128015297,Content Assessment Results,Measurement,DICOM,DICOM Value Sets,,ASMT,19930101,20991231,
3,2128015298,Audio,Measurement,DICOM,DICOM Value Sets,,AU,19930101,20991231,
4,2128015299,Bone Densitometry (ultrasound),Measurement,DICOM,DICOM Value Sets,,BDUS,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
111,2128015406,Inverse,Measurement,DICOM,DICOM Value Sets,,INVERSE,19930101,20991231,
112,2128015407,Unpaired,Measurement,DICOM,DICOM Value Sets,,U,19930101,20991231,
113,2128015408,Both,Measurement,DICOM,DICOM Value Sets,,B,19930101,20991231,
114,2128015409,image has not been subjected to lossy compression,Measurement,DICOM,DICOM Value Sets,,00,19930101,20991231,


### For Concept_relationship

In [27]:
# Pivot (melt) the DataFrame to go from wide to long format
other_values_long = pd.melt(other_values, id_vars=['code', 'description', 'concept_id'], 
                  value_vars=['tag_1', 'tag_2', 'tag_3'], 
                  var_name='tag_type', value_name='tags')

# Drop the 'tag_type' column as it is not needed in the final format
other_values_long = other_values_long.drop('tag_type', axis=1)
other_values_long = other_values_long[~other_values_long['tags'].isna()]
other_values_long['Tag'] = other_values_long['tags'].str.replace(r'[(),]', '', regex = True)
other_values_long = other_values_long.rename(columns={'concept_id': 'syn_concept_id'})
other_values_long.head()

Unnamed: 0,code,description,syn_concept_id,tags,Tag
0,R,Right,4080761.0,"(0020,0060)",200060
1,L,Left,4300877.0,"(0020,0060)",200060
2,BIPED,BIPED,,"(0010,2210)",102210
3,QUADRUPED,QUADRUPED,,"(0010,2210)",102210
4,YES,YES,4188539.0,"(0028,0300)",280300


In [28]:
other_values_long = other_values_long.merge(cs_value_table_omop[['concept_id', 'concept_code']], how = 'left', right_on = 'concept_code', left_on= 'code')
other_values_long = other_values_long.drop(columns=['tags', 'concept_code'])
other_values_long = other_values_long.rename(columns={'concept_id': 'concept_id_2'}) #value sets' concept ID is concept_id_2 for Concept_relationship table
other_values_long.head()

Unnamed: 0,code,description,syn_concept_id,Tag,concept_id_2
0,R,Right,4080761.0,200060,2128015398
1,L,Left,4300877.0,200060,2128015399
2,BIPED,BIPED,,102210,2128015400
3,QUADRUPED,QUADRUPED,,102210,2128015401
4,YES,YES,4188539.0,280300,2128015402


In [29]:
other_values_long = other_values_long.merge(attribute_table_omop[['concept_id', 'concept_code']], how='left', left_on = 'Tag', right_on = 'concept_code')
other_values_long = other_values_long.drop(columns=['concept_code'])
other_values_long = other_values_long.rename(columns={'concept_id': 'concept_id_1'}) #attributes' concept ID is concept_id_1 for Concept_relationship table
other_values_long

Unnamed: 0,code,description,syn_concept_id,Tag,concept_id_2,concept_id_1
0,R,Right,4080761.0,200060,2128015398,2128001088
1,L,Left,4300877.0,200060,2128015399,2128001088
2,BIPED,BIPED,,102210,2128015400,2128000124
3,QUADRUPED,QUADRUPED,,102210,2128015401,2128000124
4,YES,YES,4188539.0,280300,2128015402,2128001367
5,NO,NO,4188540.0,280300,2128015403,2128001367
6,BOTH,BOTH,45883500.0,280300,2128015404,2128001367
7,IDENTITY,Identity,,20500020,2128015405,2128002254
8,INVERSE,Inverse,4114662.0,20500020,2128015406,2128002254
9,U,Unpaired,,200062,2128015407,2128001089


In [30]:
body_part.head()

Unnamed: 0,Coding Scheme Designator,Code Value,Code Meaning,Body Part Examined,SNOMED-RT ID (Retired),FMA Code Value,UMLS Concept UniqueID
0,SCT,818981001,Abdomen,ABDOMEN,,,
1,SCT,818982008,Abdomen and Pelvis,ABDOMENPELVIS,,,
2,SCT,7832008,Abdominal aorta,ABDOMINALAORTA,T-42500,,
3,SCT,85856004,Acromioclavicular joint,ACJOINT,T-15420,,
4,SCT,23451007,Adrenal gland,ADRENAL,T-B3000,,


In [31]:
index = cs_value_table_omop['concept_id'].max() + 1
sequential_numbers = range(6000, len(body_part)+6000)
body_part.loc[:, 'concept_id'] = [index + num for num in sequential_numbers]

columns = ['concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date','invalid_reason']
body_part_table_omop = pd.DataFrame(columns = columns)

body_part_table_omop['concept_id'] = body_part['concept_id']
body_part_table_omop['concept_name'] = body_part['Code Meaning']
body_part_table_omop['domain_id'] = 'Measurement'
body_part_table_omop['vocabulary_id'] = 'DICOM'
body_part_table_omop['concept_class_id'] = 'DICOM Value Sets'
body_part_table_omop['concept_code'] = body_part['Body Part Examined']
body_part_table_omop['valid_start_date'] = 19930101
body_part_table_omop['valid_end_date'] = 20991231

body_part_table_omop = body_part_table_omop.reset_index(drop='True')
body_part_table_omop

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128021411,Abdomen,Measurement,DICOM,DICOM Value Sets,,ABDOMEN,19930101,20991231,
1,2128021412,Abdomen and Pelvis,Measurement,DICOM,DICOM Value Sets,,ABDOMENPELVIS,19930101,20991231,
2,2128021413,Abdominal aorta,Measurement,DICOM,DICOM Value Sets,,ABDOMINALAORTA,19930101,20991231,
3,2128021414,Acromioclavicular joint,Measurement,DICOM,DICOM Value Sets,,ACJOINT,19930101,20991231,
4,2128021415,Adrenal gland,Measurement,DICOM,DICOM Value Sets,,ADRENAL,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
393,2128021804,Vertebral artery,Measurement,DICOM,DICOM Value Sets,,VERTEBRALA,19930101,20991231,
394,2128021805,Vertebral column and cranium,Measurement,DICOM,DICOM Value Sets,,,19930101,20991231,
395,2128021806,Vulva,Measurement,DICOM,DICOM Value Sets,,VULVA,19930101,20991231,
396,2128021807,Wrist joint,Measurement,DICOM,DICOM Value Sets,,WRIST,19930101,20991231,


### Combine DICOM concepts

In [32]:
omop_table_staging = pd.concat([attribute_table_omop, valuesets_table_omop, cs_value_table_omop, body_part_table_omop], ignore_index=True)
omop_table_staging['standard_concept'] = ''
omop_table_staging.to_csv('./files/OMOP CDM Staging/omop_table_staging_new.csv', index=False)
omop_table_staging #6792

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128000010,Length to End,Measurement,DICOM,DICOM Attributes,,00080001,19930101,20991231,
1,2128000011,Specific Character Set,Measurement,DICOM,DICOM Attributes,,00080005,19930101,20991231,
2,2128000012,Image Type,Measurement,DICOM,DICOM Attributes,,00080008,19930101,20991231,
3,2128000013,Instance Creation Date,Measurement,DICOM,DICOM Attributes,,00080012,19930101,20991231,
4,2128000014,Instance Creation Time,Measurement,DICOM,DICOM Attributes,,00080013,19930101,20991231,
...,...,...,...,...,...,...,...,...,...,...
6787,2128021804,Vertebral artery,Measurement,DICOM,DICOM Value Sets,,VERTEBRALA,19930101,20991231,
6788,2128021805,Vertebral column and cranium,Measurement,DICOM,DICOM Value Sets,,,19930101,20991231,
6789,2128021806,Vulva,Measurement,DICOM,DICOM Value Sets,,VULVA,19930101,20991231,
6790,2128021807,Wrist joint,Measurement,DICOM,DICOM Value Sets,,WRIST,19930101,20991231,


## 2. DML Scripts using Python & SQL

In [141]:
import psycopg2
 
conn = psycopg2.connect(
    database="adni",
    user="dbadmin",
    password="hopkinsx93ewD",
    host="ohdsicdmdb.postgres.database.azure.com",
    port="5432",
    connect_timeout = 6000
)
 
cursor = conn.cursor()

In [None]:
import pyodbc

# Connect to your database
server = 'server_name'
database = 'database_name'
# here, we use a trusted connection (i.e., using Windows account), you can also use user_name and password by replacing it with ';UID=<user_name>;PWD=<password>;Encrypt=no'
conn = pyodbc.connect('Driver={SQL Server};Server=' + server + ';Database=' + database + ';Trusted_Connection=yes;')
cursor = conn.cursor()

# Update VOCABULARY
sql = '''
    INSERT INTO dbo.VOCABULARY (vocabulary_id, vocabulary_name, vocabulary_reference, vocabulary_version, vocabulary_concept_id)
    VALUES ('DICOM', 'Digital Imaging and Communications in Medicine (National Electrical Manufacturers Association)',  'https://www.dicomstandard.org/current', 'NEMA Standard PS3', 2128000000)
    '''
cursor.execute(sql)
conn.commit()

In [None]:
# Update CONCEPT_CLASS
sql = '''
    INSERT INTO CONCEPT_CLASS (concept_class_id, concept_class_name, concept_class_concept_id)
    VALUES ('DICOM Attributes', 'DICOM Attributes', 2128000001),
           ('DICOM Value Sets', 'DICOM Value Sets', 2128000002)
    '''
cursor.execute(sql)
conn.commit()

In [61]:
# Load the file for DICOM attributes and value sets
omop_table_staging = pd.read_csv('./files/OMOP CDM Staging/omop_table_staging_new.csv')

from datetime import datetime

sql = '''
    INSERT INTO dbo.concept (concept_id, concept_name, domain_id, vocabulary_id, concept_class_id, concept_code, valid_start_date, valid_end_date) 
    VALUES (%s,%s,%s,%s,%s,%s,%s,%s)
    '''
for index, row in omop_table_staging.iterrows():
    # Convert the valid_start_date and valid_end_date from integer (YYYYMMDD) to date
    valid_start_date = datetime.strptime(str(row['valid_start_date']), '%Y%m%d').date() if pd.notnull(row['valid_start_date']) else None
    valid_end_date = datetime.strptime(str(row['valid_end_date']), '%Y%m%d').date() if pd.notnull(row['valid_end_date']) else None

    cursor.execute(sql, (
        row['concept_id'], 
        row['concept_name'], 
        row['domain_id'], 
        row['vocabulary_id'], 
        row['concept_class_id'], 
        row['concept_code'], 
        valid_start_date, 
        valid_end_date
    ))

conn.commit()


In [62]:
# check the upload
sql = "select * from dbo.concept where vocabulary_id = 'DICOM'"
omop_dicom_test = pd.read_sql_query(sql, conn)
omop_dicom_test.head()

  omop_dicom_test = pd.read_sql_query(sql, conn)


Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,2128006112,TG18-GA10 Pattern,Measurement,DICOM,DICOM Value Sets,,109873,1993-01-01,2099-12-31,
1,2128006113,TG18-GA25 Pattern,Measurement,DICOM,DICOM Value Sets,,109876,1993-01-01,2099-12-31,
2,2128006114,TG18-GA20 Pattern,Measurement,DICOM,DICOM Value Sets,,109875,1993-01-01,2099-12-31,
3,2128006115,TG18-GA03 Pattern,Measurement,DICOM,DICOM Value Sets,,109870,1993-01-01,2099-12-31,
4,2128006116,TG18-GA08 Pattern,Measurement,DICOM,DICOM Value Sets,,109872,1993-01-01,2099-12-31,


In [67]:
omop_dicom_test.groupby('concept_class_id')['concept_id'].agg(['count', 'min', 'max']) #2128000010-2128002992, 2128006000-2128021808

Unnamed: 0_level_0,count,min,max
concept_class_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DICOM Attributes,2983,2128000010,2128002992
DICOM Value Sets,3809,2128006000,2128021808


In [142]:
sql_query = "SELECT * FROM dbo.concept"
concept_df = pd.read_sql_query(sql_query, conn)
concept_df.head()

  concept_df = pd.read_sql_query(sql_query, conn)


Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,41886036,Kyogle,Geography,OSM,10th level,S,6069960,1970-01-01,2099-12-31,
1,41886037,Kyvalley,Geography,OSM,10th level,S,2504902,1970-01-01,2099-12-31,
2,41886038,Laang,Geography,OSM,10th level,S,3155973,1970-01-01,2099-12-31,
3,41886039,Lake Bolac,Geography,OSM,10th level,S,3153764,1970-01-01,2099-12-31,
4,41886040,Lake Brewster,Geography,OSM,10th level,S,5816798,1970-01-01,2099-12-31,


In [144]:
# close the cursor and connection
cursor.close()
conn.close()

## 3. Set Concept_relationship from Part 3

### Attribute & Value Sets with CIDs: Attributes = concept_id_1, Value Sets = concept_id_2
This relationship includes mapping to standard coding systems, such as SNOMED and LOINC.

In [75]:
part3_cid = part3[part3['CID']!=''].merge(attributes_included[['Tag_cleaned', 'concept_id']], how = 'inner', left_on = 'Tag', right_on = 'Tag_cleaned')
part3_cid['cid'] = pd.to_numeric(part3_cid['CID'], errors='coerce').astype('Int64')
part3_cid = part3_cid.rename(columns={'concept_id':'concept_id_1'})
part3_cid.head()

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted,Attribute Name,Tag,Type,Attribute Description,CID,SOP Class UID,Tag_cleaned,concept_id_1,cid
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Ethnic Group Code Sequence,102161,3,{},6099,1.2.840.10008.5.1.4.1.1.1,102161,2128000118,6099
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient Species Code Sequence,102202,1C,{},7454,1.2.840.10008.5.1.4.1.1.1,102202,2128000122,7454
2,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient Breed Code Sequence,102293,2C,{},7480,1.2.840.10008.5.1.4.1.1.1,102293,2128000125,7480
3,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,De-identification Method Code Sequence,120064,1C,{},7050,1.2.840.10008.5.1.4.1.1.1,120064,2128000132,7050
4,table_A.2-1,Computed Radiography Image IOD Modules,Study,General Study,sect_C.7.2.1,M,M,sect_C.7.2.1,Requesting Service Code Sequence,321034,3,{},7030,1.2.840.10008.5.1.4.1.1.1,321034,2128001481,7030


In [78]:
part3_cid_val = part3_cid[['iod', 'Module', 'Attribute Name', 'Tag', 'cid', 'concept_id_1']].merge(valuesets[['code', 'cid', 'system']], how = 'left', on = 'cid')
part3_cid_val = part3_cid_val.rename(columns={'code': 'concept_code'})
part3_cid_val.head()

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,C41219,http://ncit.nci.nih.gov
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413490006,http://snomed.info/sct
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413581001,http://snomed.info/sct
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413773004,http://snomed.info/sct
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413600007,http://snomed.info/sct


In [80]:
part3_cid_val.shape

(609019, 8)

In [125]:
part3_cid_val['system'].value_counts()

system
http://snomed.info/sct                                                 590255
http://dicom.nema.org/resources/ontology/DCM                            12195
http://sig.biostr.washington.edu/projects/fm/AboutFM.html                2928
http://www.nlm.nih.gov/research/umls                                     1170
http://braininfo.rprc.washington.edu/aboutBrainInfo.aspx#NeuroNames       748
doi:10.1016/S0735-1097(99)00126-6                                         544
http://www.itis.gov                                                       513
http://ncit.nci.nih.gov                                                   303
http://www.radlex.org                                                      80
http://unitsofmeasure.org                                                  61
http://www.nlm.nih.gov/research/umls/rxnorm                                40
http://loinc.org                                                           10
Name: count, dtype: int64

In [127]:
mapping = {
    'http://snomed.info/sct': 'SNOMED',
    'http://dicom.nema.org/resources/ontology/DCM': 'DICOM',
    'http://unitsofmeasure.org': 'UCUM',
    'http://loinc.org': 'LOINC',
}

part3_cid_val['vocabulary_id'] = part3_cid_val['system'].map(mapping)
part3_cid_val.head()

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system,vocabulary_id
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,C41219,http://ncit.nci.nih.gov,
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413581001,http://snomed.info/sct,SNOMED
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413773004,http://snomed.info/sct,SNOMED
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413600007,http://snomed.info/sct,SNOMED


In [143]:
part3_cid_val_concept = part3_cid_val.merge(concept_df[['concept_id', 'concept_name', 'concept_code', 'vocabulary_id']], how = 'left', on = ['concept_code', 'vocabulary_id'])
part3_cid_val_concept = part3_cid_val_concept.rename(columns={'concept_id': 'concept_id_2'})

In [129]:
part3_cid_val_concept.shape

(609019, 11)

In [130]:
part3_cid_val_concept.head()

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system,vocabulary_id,concept_id_2,concept_name
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,C41219,http://ncit.nci.nih.gov,,,
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413581001,http://snomed.info/sct,SNOMED,,
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413773004,http://snomed.info/sct,SNOMED,,
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,102161,6099,2128000118,413600007,http://snomed.info/sct,SNOMED,,


In [131]:
part3_val_agg = part3_cid_val_concept.groupby(['concept_id_1', 'concept_code'])['concept_id_2'].nunique().reset_index()
print('No concept_id:', part3_val_agg[part3_val_agg['concept_id_2']==0].shape)
print('Exactly one concept_id:', part3_val_agg[part3_val_agg['concept_id_2']==1].shape)
print('Multiple concept_ids:', part3_val_agg[part3_val_agg['concept_id_2']>1].shape)

No concept_id: (737, 3)
Exactly one concept_id: (6583, 3)
Multiple concept_ids: (0, 3)


In [135]:
no_concept_ids = part3_val_agg[part3_val_agg['concept_id_2']==0]['concept_code'].unique()
part3_cid_val_concept[part3_cid_val_concept['concept_code'].isin(no_concept_ids)]

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system,vocabulary_id,concept_id_2,concept_name
0,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,C41219,http://ncit.nci.nih.gov,,,
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
2,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413581001,http://snomed.info/sct,SNOMED,,
3,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413773004,http://snomed.info/sct,SNOMED,,
4,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413600007,http://snomed.info/sct,SNOMED,,
...,...,...,...,...,...,...,...,...,...,...,...
608913,Confocal Microscopy Tiled Pyramidal Image IOD ...,Specimen,Specimen Description Sequence,00400560,8134,2128001563,116010006,http://snomed.info/sct,SNOMED,,
608940,Confocal Microscopy Tiled Pyramidal Image IOD ...,Specimen,Specimen Description Sequence,00400560,8134,2128001563,1231522001,http://snomed.info/sct,SNOMED,,
609004,Confocal Microscopy Tiled Pyramidal Image IOD ...,Specimen,Specimen Description Sequence,00400560,8134,2128001563,244251006,http://snomed.info/sct,SNOMED,,
609017,Confocal Microscopy Tiled Pyramidal Image IOD ...,Specimen,Specimen Description Sequence,00400560,8134,2128001563,C0221297,http://www.nlm.nih.gov/research/umls,,,


In [138]:
part3_cid_val_concept[part3_cid_val_concept['concept_code']=="413490006"]

Unnamed: 0,iod,Module,Attribute Name,Tag,cid,concept_id_1,concept_code,system,vocabulary_id,concept_id_2,concept_name
1,Computed Radiography Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
4447,CT Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
8910,MR Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
13356,Nuclear Medicine Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
17757,Ultrasound Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
...,...,...,...,...,...,...,...,...,...,...,...
588520,RT Patient Position Acquisition Instruction IO...,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
591498,Microscopy Bulk Simple Annotations IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
595892,Photoacoustic Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,
600336,Confocal Microscopy Image IOD Modules,Patient,Ethnic Group Code Sequence,00102161,6099,2128000118,413490006,http://snomed.info/sct,SNOMED,,


In [140]:
concept_df[concept_df['concept_code']=="413490006"]

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason


In [106]:
omop_table_staging[omop_table_staging['concept_code']=="00180082"]

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
437,2128000447,Inversion Time,Measurement,DICOM,DICOM Attributes,,180082,19930101,20991231,


In [105]:
concept_df[concept_df['concept_name']=="Inversion Time"]

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason


In [107]:
concept_df[concept_df['concept_id']==2128000447]

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
