## Getting Started

- Kaggle Competition - [siim-acr-pneumothorax-segmentation](https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation/overview)
- Download Data 
- Create a Pytorch Dataset and Dataloader.


### Prepare data

1. Download Data
2. Extract Data


In [1]:
!kaggle datasets download -d seesee/siim-train-test

Downloading siim-train-test.zip to /home/siim-acr
100%|██████████████████████████████████████| 1.92G/1.92G [02:15<00:00, 24.3MB/s]
100%|██████████████████████████████████████| 1.92G/1.92G [02:15<00:00, 15.2MB/s]


In [2]:
!unzip -q siim-train-test.zip

In [3]:
ls siim/

[0m[01;34mdicom-images-test[0m/  [01;34mdicom-images-train[0m/  train-rle.csv


In [1]:
from fastai.vision.all import *
from seg_utils import run_length_decode
import pydicom

In [2]:
data_path = Path('siim/')

In [3]:
data_path.ls()

(#4) [Path('siim/dicom-images-test'),Path('siim/dicom-images-train'),Path('siim/train-rle.csv'),Path('siim/train')]

In [4]:
train_meta = pd.read_csv(data_path/'train-rle.csv')
train_meta.columns = [o.strip() for o in train_meta.columns]

In [5]:
train_meta.head(2)

Unnamed: 0,ImageId,EncodedPixels
0,1.2.276.0.7230010.3.1.4.8323329.6904.1517875201.850819,-1
1,1.2.276.0.7230010.3.1.4.8323329.13666.1517875247.117800,557374 2 1015 8 1009 14 1002 20 997 26 990 32 985 38 980 42 981 42 979 43 978 45 976 47 964 59 956 66 925 98 922 101 917 106 916 106 916 107 914 109 909 113 907 116 904 118 903 120 902 120 902 121 900 122 899 124 898 124 898 125 897 125 898 125 896 126 895 127 895 128 895 128 895 128 894 128 895 128 895 128 895 128 895 128 895 128 894 130 893 130 893 130 893 130 893 129 894 129 894 129 894 129 895 127 897 126 898 126 898 125 898 126 898 125 899 125 899 125 899 124 900 124 900 125 899 125 899 125 899 125 899 126 898 127 897 128 897 128 896 129 895 130 894 132 892 133 891 134 890 136 888 137...


In [6]:
train_path = (data_path/'dicom-images-train/')
train_files = list(train_path.glob('*/*/*.dcm'))

In [7]:
img_path = (data_path/'train/images')
mask_path = (data_path/'train/mask')

img_path.mkdir(parents=True, exist_ok=True)
mask_path.mkdir(parents=True, exist_ok=True)

In [8]:
def save_dcm_png(file,dst_path):
    """Extracts image from DCM file and saves it as a PNG file."""
    fname = file.stem
    pyds = pydicom.dcmread(file)
    img = Image.fromarray(pyds.pixel_array)
    img.save(f'{dst_path}/{fname}.png')
    

In [62]:
save_dcm_png(train_files[1],data_path/'train/images/')

In [9]:
def save_rle_mask(df,fname,dst_path):
    """Converts mask from rle format to PIL Image format"""
    mask = np.zeros((1024,1024))
    rles = df[df['ImageId']==fname]['EncodedPixels']
    for rle in rles:
        if rle != -1:
            mask += run_length_decode(rle).T
    mask = mask.clip(0,1).astype(np.uint8) * 255
    mask = Image.fromarray(mask)
    mask.save(f'{data_path}/{fname}.png')
    
    

Sequential approach which could take close to 1 hour.


```
for file in progress_bar(train_files):
    save_dcm_png(file,data_path/'train/images/')
    save_rle_mask(train_meta,file.stem,data_path/'train/mask/')
    
```

In [15]:
def setup(file):
    save_dcm_png(file,data_path/'train/images/')
    save_rle_mask(train_meta,file.stem,data_path/'train/mask/')

In [17]:
from fastcore.parallel import parallel

In [21]:
#Lets make it faster by using all the cores.

parallel(setup, train_files, n_workers=7, progress=True)

(#12089) [None,None,None,None,None,None,None,None,None,None...]

In [22]:
len(train_files)

12089