In [31]:
!pip install pydicom
import pandas as pd
import numpy as np
from pathlib import Path
import random
import pickle
import json
from pydicom import dcmread
from pydicom.data import get_testdata_file



In [2]:
train = pd.read_csv('data/train.csv')
train_localizers = pd.read_csv('data/train_localizers.csv')

The provided data consists of the following files and folders:

1. [data/train.csv](data/train.csv)

A table consisting of 4,438 rows. Each row corresponds to a single patient. The most important columns are `SeriesInstanceUID`, which specifies the folder in which the images attached to this patient are located (`series/SeriesInstanceUID`), and the `Aneurysm Present` column, which is either 1 if an aneurysm is present, or 0 if there is no aneurysm. There are also columns indicating the patient's age, sex, and the modality of the images taken for that patient (which is either `MRA`, `CTA`, `MRI T1post` or `MRI T2`). The remaining columns indicate the presence or absence of an aneurysm in one of 13 specific regions of the brain.

2. [data/train_localizers.csv](data/train_localizers.csv)

This table provides localization data for each aneurysm in the training set. That is to say, for each image in the data set with a visible aneurysm, this table indicates where in the image and the brain that aneurysm is located. The table has 2,251 rows, because 2,251 out of the 4,438 patients in the dataset have an aneurysm present, while the remaining 2,187 patients have no aneurysm present. The table has 4 columns: `SeriesInstanceUID`, which identifies the patient and matches the value in `train.csv`, `SOPInstanceUID`, which indicates the specific image within the directory for that patient, `coordinates`, which indicates where in that image the aneurysm is located, and `location`, which indicates the region of the brain in which that aneurysm is located, corresponding to the same 13 locations as in `train.csv`.

Right now, I am not sure how to use the location and coordinate data. For the evaluation criteria for the competition, most of the score just comes down to identifying whether an aneurysm is present; identifying exactly where the aneurysm(s) are located is just a small bonus. Perhaps later on, I will consider in more detail how to incorporate the location and coordinate data into my model.

3. series (directory) *Note: not present in GitHub repository due to size.*

Contains the images. Each subdirectory corresponds to 1 row in `train.csv`. Each of these subdirectories contains many images; even for patients who have one or more aneurysms, the aneurysm will only be visible in a small number of the images.

4. segmentations (directory) *Note: not present in GitHub repository due to size.*

This is a supplementary folder that contains NIfTI files - which appear to store a sort of 3d composite of medical images - for some but not all of the patients. Right now I do not plan to use these since they are only available for some of the patients, but I will return to this later to investigate how it could potentially be used.

In [3]:
train.head()

Unnamed: 0,SeriesInstanceUID,PatientAge,PatientSex,Modality,Left Infraclinoid Internal Carotid Artery,Right Infraclinoid Internal Carotid Artery,Left Supraclinoid Internal Carotid Artery,Right Supraclinoid Internal Carotid Artery,Left Middle Cerebral Artery,Right Middle Cerebral Artery,Anterior Communicating Artery,Left Anterior Cerebral Artery,Right Anterior Cerebral Artery,Left Posterior Communicating Artery,Right Posterior Communicating Artery,Basilar Tip,Other Posterior Circulation,Aneurysm Present
0,1.2.826.0.1.3680043.8.498.10004044428023505108...,64,Female,MRA,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.2.826.0.1.3680043.8.498.10004684224894397679...,76,Female,MRA,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.2.826.0.1.3680043.8.498.10005158603912009425...,58,Male,CTA,0,0,0,0,0,0,0,0,0,0,0,0,1,1
3,1.2.826.0.1.3680043.8.498.10009383108068795488...,71,Male,MRA,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.2.826.0.1.3680043.8.498.10012790035410518400...,48,Female,MRA,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
train_localizers.head()

Unnamed: 0,SeriesInstanceUID,SOPInstanceUID,coordinates,location
0,1.2.826.0.1.3680043.8.498.10005158603912009425...,1.2.826.0.1.3680043.8.498.10775329348174902199...,"{'x': 258.3621186176837, 'y': 261.359900373599}",Other Posterior Circulation
1,1.2.826.0.1.3680043.8.498.10022796280698534221...,1.2.826.0.1.3680043.8.498.53868409774237283281...,"{'x': 194.87253141831238, 'y': 178.32675044883...",Right Middle Cerebral Artery
2,1.2.826.0.1.3680043.8.498.10023411164590664678...,1.2.826.0.1.3680043.8.498.24186535344744886473...,"{'x': 189.23979878597123, 'y': 209.19184886465...",Right Middle Cerebral Artery
3,1.2.826.0.1.3680043.8.498.10030095840917973694...,1.2.826.0.1.3680043.8.498.75217084841854214544...,"{'x': 208.2805049088359, 'y': 229.78962131837307}",Right Infraclinoid Internal Carotid Artery
4,1.2.826.0.1.3680043.8.498.10034081836061566510...,1.2.826.0.1.3680043.8.498.71237104731452368587...,"{'x': 249.86745590416498, 'y': 220.623044646393}",Anterior Communicating Artery


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4348 entries, 0 to 4347
Data columns (total 18 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   SeriesInstanceUID                           4348 non-null   object
 1   PatientAge                                  4348 non-null   int64 
 2   PatientSex                                  4348 non-null   object
 3   Modality                                    4348 non-null   object
 4   Left Infraclinoid Internal Carotid Artery   4348 non-null   int64 
 5   Right Infraclinoid Internal Carotid Artery  4348 non-null   int64 
 6   Left Supraclinoid Internal Carotid Artery   4348 non-null   int64 
 7   Right Supraclinoid Internal Carotid Artery  4348 non-null   int64 
 8   Left Middle Cerebral Artery                 4348 non-null   int64 
 9   Right Middle Cerebral Artery                4348 non-null   int64 
 10  Anterior Communicating A

In [6]:
train_localizers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2254 entries, 0 to 2253
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   SeriesInstanceUID  2254 non-null   object
 1   SOPInstanceUID     2254 non-null   object
 2   coordinates        2254 non-null   object
 3   location           2254 non-null   object
dtypes: object(4)
memory usage: 70.6+ KB


In [7]:
train_localizers.SOPInstanceUID.value_counts()

SOPInstanceUID
1.2.826.0.1.3680043.8.498.31624213251003577891438927087161881149    3
1.2.826.0.1.3680043.8.498.81825809496237506946898886010593439013    3
1.2.826.0.1.3680043.8.498.74508280112609225659548860717187732465    2
1.2.826.0.1.3680043.8.498.12451889826654706934051560968094172570    2
1.2.826.0.1.3680043.8.498.10330800208132441897498011905367111919    2
                                                                   ..
1.2.826.0.1.3680043.8.498.44185937258859556516668591199094598570    1
1.2.826.0.1.3680043.8.498.35987680962070754560160756829496799559    1
1.2.826.0.1.3680043.8.498.11937074874094187336286047915868824604    1
1.2.826.0.1.3680043.8.498.10433249525306561584387515837440191252    1
1.2.826.0.1.3680043.8.498.53868409774237283281776807176852774246    1
Name: count, Length: 2214, dtype: int64

In [8]:
def ls_value_counts(df):
    for column in df.columns:
        print(df[column].value_counts())
        print('\n---\n')

In [9]:
ls_value_counts(train.drop(columns=['SeriesInstanceUID', 'PatientAge']))

PatientSex
Female    3005
Male      1343
Name: count, dtype: int64

---

Modality
CTA           1808
MRA           1252
MRI T2         983
MRI T1post     305
Name: count, dtype: int64

---

Left Infraclinoid Internal Carotid Artery
0    4270
1      78
Name: count, dtype: int64

---

Right Infraclinoid Internal Carotid Artery
0    4250
1      98
Name: count, dtype: int64

---

Left Supraclinoid Internal Carotid Artery
0    4018
1     330
Name: count, dtype: int64

---

Right Supraclinoid Internal Carotid Artery
0    4070
1     278
Name: count, dtype: int64

---

Left Middle Cerebral Artery
0    4129
1     219
Name: count, dtype: int64

---

Right Middle Cerebral Artery
0    4054
1     294
Name: count, dtype: int64

---

Anterior Communicating Artery
0    3985
1     363
Name: count, dtype: int64

---

Left Anterior Cerebral Artery
0    4302
1      46
Name: count, dtype: int64

---

Right Anterior Cerebral Artery
0    4292
1      56
Name: count, dtype: int64

---

Left Posterior Communica

Plan:

Create an algorithm that selects 5 images at random from each folder
However, for patients which do have an aneurysm, those 5 images will include all images which detected an aneurysm for that patient.
As seen here:

In [10]:
train_localizers.SeriesInstanceUID.value_counts()

SeriesInstanceUID
1.2.826.0.1.3680043.8.498.31629979420404800139928339434297456334    5
1.2.826.0.1.3680043.8.498.11527986509512933171256788651291467752    5
1.2.826.0.1.3680043.8.498.11292203154407642658894712229998766945    5
1.2.826.0.1.3680043.8.498.99028068919105186302294079606577228686    5
1.2.826.0.1.3680043.8.498.76928456732082261565048056589908832861    5
                                                                   ..
1.2.826.0.1.3680043.8.498.97970165518053195797247488050816887286    1
1.2.826.0.1.3680043.8.498.97975645720920888704056258456447231054    1
1.2.826.0.1.3680043.8.498.98066774276620948484052227331467077834    1
1.2.826.0.1.3680043.8.498.11079102674589284483149404820469555321    1
1.2.826.0.1.3680043.8.498.85592547875146602878105706110456654773    1
Name: count, Length: 1863, dtype: int64

There are at most 5 images with an aneurysm for each patient.

In [27]:
random.seed(42)
training_images_dict = {}
for series_id in train['SeriesInstanceUID']:
    series_directory = Path(f'series/{series_id}')
    all_image_ids = [file.stem for file in series_directory.iterdir()]
    flagged_image_ids = train_localizers.query('SeriesInstanceUID == @series_id')['SOPInstanceUID']
    five_images = []
    count = 0
    for flagged_image in flagged_image_ids:
        if flagged_image in all_image_ids:
            count += 1
            five_images.append(f'#{flagged_image}')
            all_image_ids.remove(flagged_image)

    if count < 5:
        five_images = five_images + random.choices(all_image_ids, k = min(5 - count, len(all_image_ids)))

    training_images_dict[series_id] = five_images

training_images_dict

{'1.2.826.0.1.3680043.8.498.10004044428023505108375152878107656647': ['1.2.826.0.1.3680043.8.498.59538439921532583151641435410183569222',
  '1.2.826.0.1.3680043.8.498.10229246287448303586334999931645148833',
  '1.2.826.0.1.3680043.8.498.12783047262572052393652814834001467952',
  '1.2.826.0.1.3680043.8.498.12396711188070994245238798082430967707',
  '1.2.826.0.1.3680043.8.498.68792808057051152605669810403318403420'],
 '1.2.826.0.1.3680043.8.498.10004684224894397679901841656954650085': ['1.2.826.0.1.3680043.8.498.53155513795891847770717428417028406753',
  '1.2.826.0.1.3680043.8.498.84993672078661833540733117968313713860',
  '1.2.826.0.1.3680043.8.498.10705592637578021800582403207600385126',
  '1.2.826.0.1.3680043.8.498.26477353149312747292324593505542813040',
  '1.2.826.0.1.3680043.8.498.10493925767419543265807841904021073845'],
 '1.2.826.0.1.3680043.8.498.10005158603912009425635473100344077317': ['#1.2.826.0.1.3680043.8.498.10775329348174902199350466348663848346',
  '1.2.826.0.1.3680043.

In [25]:
training_images_dict_file_path = 'data_gen/training_images_dict.txt'
# Store training images dict as json in text file
with open(training_images_dict_file_path, 'w', encoding='utf-8') as file:
    json.dump(training_images_dict, file, indent=4, ensure_ascii=False)

In [28]:
training_image_dict = {}
with open(training_images_dict_file_path, 'r', encoding='utf-8') as file:
    training_image_dict = json.loads(file.read())
training_image_dict

{'1.2.826.0.1.3680043.8.498.10004044428023505108375152878107656647': ['1.2.826.0.1.3680043.8.498.59538439921532583151641435410183569222',
  '1.2.826.0.1.3680043.8.498.10229246287448303586334999931645148833',
  '1.2.826.0.1.3680043.8.498.12783047262572052393652814834001467952',
  '1.2.826.0.1.3680043.8.498.12396711188070994245238798082430967707',
  '1.2.826.0.1.3680043.8.498.68792808057051152605669810403318403420'],
 '1.2.826.0.1.3680043.8.498.10004684224894397679901841656954650085': ['1.2.826.0.1.3680043.8.498.53155513795891847770717428417028406753',
  '1.2.826.0.1.3680043.8.498.84993672078661833540733117968313713860',
  '1.2.826.0.1.3680043.8.498.10705592637578021800582403207600385126',
  '1.2.826.0.1.3680043.8.498.26477353149312747292324593505542813040',
  '1.2.826.0.1.3680043.8.498.10493925767419543265807841904021073845'],
 '1.2.826.0.1.3680043.8.498.10005158603912009425635473100344077317': ['#1.2.826.0.1.3680043.8.498.10775329348174902199350466348663848346',
  '1.2.826.0.1.3680043.

Next, we want to create our actual training data that we will directly input to the neural network. To do so, we take the image paths in training_image_dict and retrieve the DICOM images using a DICOM image library. Then, we convert those DICOM images to a 3d array of pixel values. Each of these 3d array is marked 1 (aneurysm present) or 0 (aneurysm absent).

In [44]:
def dicom_to_3d_pixel_array(series_id, image_id):
    filepath = f'series/{series_id}/{image_id}.dcm'
    return dcmread(filepath).pixel_array

In [46]:
series_id_example = '1.2.826.0.1.3680043.8.498.10004044428023505108375152878107656647'
image_id_example = '1.2.826.0.1.3680043.8.498.59538439921532583151641435410183569222'
ex_dcm = dicom_to_3d_pixel_array(series_id_example, image_id_example)
pd.DataFrame(ex_dcm)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,502,503,504,505,506,507,508,509,510,511
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,8,8,8,9,9,11,12,11,10,...,10,11,10,9,10,12,12,9,9,9
2,0,8,8,8,8,9,11,12,12,11,...,9,11,12,10,10,12,12,9,8,9
3,0,8,8,8,8,8,10,11,12,12,...,8,11,12,11,10,11,11,9,8,9
4,0,9,8,7,8,8,8,10,12,13,...,9,10,11,11,10,9,10,9,8,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
507,0,10,11,9,8,7,7,10,12,12,...,6,6,7,8,7,6,6,6,7,7
508,0,10,11,10,9,8,8,10,12,11,...,7,6,7,8,7,7,7,6,6,7
509,0,10,10,10,10,10,10,10,10,9,...,8,7,7,8,8,8,7,6,5,7
510,0,9,9,9,10,11,11,9,7,7,...,8,6,7,8,7,7,7,7,7,8
