# Volume Dataset Analysis

In this exercise you will be analyzing a DICOM dataset. This dataset is not as conveniently organized on the filesystem as some of the datasets we have seen in the lesson just now. Rather, this dataset looks like somethign that you are likely to get as a raw dump from a clinical data archive.  

Your task is to use the skills you have acquired in this lesson to go through this dataset and answer (or provide your best quess to) the following questions:
1. What imaging modality was used to produce this dataset?
1. Data from how many patients does the dataset include?
1. How many studies are in the dataset?
1. What is the oldest and most recent study in the dataset?
1. How many series are in the dataset?

In addition, do the following tasks:

1. List voxel dimensions of all 3D volumes (i.e. series) that the dataset contains as WxHxD
1. The dataset contains two outliers. Can you find them? Type up why do you think these volumes are outliers.

<TYPE YOUR ANSWERS HERE>

In [6]:
import pydicom
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy.ma as ma
import numpy as np
import os

In [8]:
instances = []
for path,dirs,files in os.walk('data'):
    for f in files:
        dcm = pydicom.dcmread(os.path.join(path, f), stop_before_pixels = True)
        instances.append(dcm)

In [9]:
len(instances)

864

In [11]:
instances[0]

(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'OTHER']
(0008, 0016) SOP Class UID                       UI: MR Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.4429.7055.229098847496270476126665028704
(0008, 0020) Study Date                          DA: '19940112'
(0008, 0021) Series Date                         DA: '19940112'
(0008, 0023) Content Date                        DA: '19940112'
(0008, 0030) Study Time                          TM: '085518'
(0008, 0031) Series Time                         TM: '090803'
(0008, 0033) Content Time                        TM: '090803'
(0008, 0050) Accession Number                    SH: '4468773825686010'
(0008, 0060) Modality                            CS: 'MR'
(0008, 0070) Manufacturer                        LO: 'Imaging Biometrics LLC'
(0008, 0090) Referring Physician's Name          PN: ''
(0008, 1030) Study

### Modality

In [13]:
modalities = [dcm.Modality for dcm in instances]
np.unique(modalities).tolist()

['CT', 'MR']

### Patients

In [18]:
patientIDs = [dcm.PatientID for dcm in instances]
patientIDs = np.unique(patientIDs).tolist()
patientIDs

['123456', 'OPA135179', 'PGBM-003', 'PGBM-004', 'PGBM-005', 'PGBM-009']

In [19]:
len(patientIDs)

6

### Studies

In [16]:
studyIDs = [dcm.StudyInstanceUID for dcm in instances]
studyIDs = np.unique(studyIDs).tolist()
studyIDs

['1.2.752.24.7.1550985044.2753616',
 '1.2.826.0.1.3680043.2.1125.1.9839520250291993940949520616565396',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.142872696423641709332868254917',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.173425268180092806279003900097',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.192304828189026101657054875952',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.205075845250444634584024148845',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.231796058972489013039499925050',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.873522078320106668263470477052']

In [17]:
len(studyIDs)

10

In [27]:
studiesPatients = [(dcm.PatientID, dcm.StudyInstanceUID) for dcm in instances]
studiesPatients = set(studiesPatients)
studiesPatients

{('123456', '1.2.826.0.1.3680043.2.1125.1.9839520250291993940949520616565396'),
 ('OPA135179', '1.2.752.24.7.1550985044.2753616'),
 ('PGBM-003',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198'),
 ('PGBM-003',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184'),
 ('PGBM-004',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.173425268180092806279003900097'),
 ('PGBM-004',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.873522078320106668263470477052'),
 ('PGBM-005',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.192304828189026101657054875952'),
 ('PGBM-005',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.205075845250444634584024148845'),
 ('PGBM-009',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.142872696423641709332868254917'),
 ('PGBM-009',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.231796058972489013039499925050')}

### Study Dates

In [29]:
studyDates = [(dcm.PatientID, dcm.StudyInstanceUID, dcm.StudyDate) for dcm in instances]
studyDates = set(studyDates)
studyDates

{('123456',
  '1.2.826.0.1.3680043.2.1125.1.9839520250291993940949520616565396',
  '20190101'),
 ('OPA135179', '1.2.752.24.7.1550985044.2753616', '20150116'),
 ('PGBM-003',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198',
  '19951017'),
 ('PGBM-003',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184',
  '19950329'),
 ('PGBM-004',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.173425268180092806279003900097',
  '19930622'),
 ('PGBM-004',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.873522078320106668263470477052',
  '19940112'),
 ('PGBM-005',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.192304828189026101657054875952',
  '19920505'),
 ('PGBM-005',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.205075845250444634584024148845',
  '19910702'),
 ('PGBM-009',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.142872696423641709332868254917',
  '19880512'),
 ('PGBM-009',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.231796058972489013039499925050',
  '19910103')}

In [30]:
studyDates = sorted(studyDates, key=lambda x: x[2])
studyDates

[('PGBM-009',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.142872696423641709332868254917',
  '19880512'),
 ('PGBM-009',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.231796058972489013039499925050',
  '19910103'),
 ('PGBM-005',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.205075845250444634584024148845',
  '19910702'),
 ('PGBM-005',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.192304828189026101657054875952',
  '19920505'),
 ('PGBM-004',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.173425268180092806279003900097',
  '19930622'),
 ('PGBM-004',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.873522078320106668263470477052',
  '19940112'),
 ('PGBM-003',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184',
  '19950329'),
 ('PGBM-003',
  '1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198',
  '19951017'),
 ('OPA135179', '1.2.752.24.7.1550985044.2753616', '20150116'),
 ('123456',
  '1.2.826.0.1.3680043.2.1125.1.9839520250291993940949520616565396',
  '20190101')]

### Series

In [32]:
series = [dcm.SeriesInstanceUID for dcm in instances]
series = set(series)
series

{'1.2.826.0.1.3680043.2.1125.1.45859137663006505718300393375464286',
 '1.3.12.2.1107.5.2.33.37105.2015011616025092819028166.0.0.0',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.105521800202421035670670758706',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.108708982094690934070899838243',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.111299569371716382165219422799',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.131971402732874033229609248302',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.137802635701410656176169562528',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.148342356080268980546237840587',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.149455479236394071679725178532',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.158869091666854803918782490935',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.165843183220097757648432257390',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.167613564536106399232524912048',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.170099978014836890431312652906',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.201921745402287812448748810010',
 '1.3.6.1.4.1.14519.5.2.1.4429.7055.2111

In [33]:
len(series)

32

### Instances per Series per Study per Patient

In [34]:
patientStudySeries = [(dcm.PatientID, dcm.StudyInstanceUID, dcm.SeriesInstanceUID, dcm.InstanceNumber) for dcm in instances]
patientStudySeries = set(patientStudySeries)
patients = {}

for t in patientStudySeries:
    if t[0] not in patients:
        patients[t[0]] = {}
    
    if t[1] not in patients[t[0]]:
        patients[t[0]][t[1]] = {}
        
    if t[2] not in patients[t[0]][t[1]]:
        patients[t[0]][t[1]][t[2]] = {}
    
    if t[3] not in patients[t[0]][t[1]][t[2]]:
        patients[t[0]][t[1]][t[2]][t[3]] = {}
        
    patients[t[0]][t[1]][t[2]][t[3]]['test'] = 1
    

In [35]:
len(patients)

6

In [37]:
sorted(patients.keys())

['123456', 'OPA135179', 'PGBM-003', 'PGBM-004', 'PGBM-005', 'PGBM-009']

In [41]:
for p in sorted(patients.keys()):
    print(f'{p}:\t studies: {len(patients[p])}')

123456:	 studies: 1
OPA135179:	 studies: 1
PGBM-003:	 studies: 2
PGBM-004:	 studies: 2
PGBM-005:	 studies: 2
PGBM-009:	 studies: 2


In [45]:
for p in sorted(patients.keys()):
    print(f'{p}:\t studies: {len(patients[p])}')
    for study in sorted(patients[p].keys()):
        print(f'\t {study}:\t series: {len(patients[p][study])}')
        for series in sorted(patients[p][study].keys()):
            print(f'\t\t {study}:\t instances: {len(patients[p][study][series])}')
    print('')


123456:	 studies: 1
	 1.2.826.0.1.3680043.2.1125.1.9839520250291993940949520616565396:	 series: 1
		 1.2.826.0.1.3680043.2.1125.1.9839520250291993940949520616565396:	 instances: 139

OPA135179:	 studies: 1
	 1.2.752.24.7.1550985044.2753616:	 series: 1
		 1.2.752.24.7.1550985044.2753616:	 instances: 36

PGBM-003:	 studies: 2
	 1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198:	 series: 4
		 1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198:	 instances: 24
		 1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198:	 instances: 24
		 1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198:	 instances: 24
		 1.3.6.1.4.1.14519.5.2.1.4429.7055.729192279796218307950472057198:	 instances: 24
	 1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184:	 series: 4
		 1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184:	 instances: 22
		 1.3.6.1.4.1.14519.5.2.1.4429.7055.812059520220877654137991744184:	 instances: 22
		 1.3.6.1.4