# **IMC data processing pipeline**
 
This pipeline has been specifically developed for the Imaging Mass Cytometry (IMC) - Type 1 Diabetes (T1D) project.
 
## **Introduction**
 
This pipeline extracts image data from Imaging Mass Cytometry aquisitions, performs islet- and cell-level image segmentation and extracts measurements from the segmented objects.  
This pipeline is designed to work with two antibody panels applied to two consecutive tissue sections.

As input, the user should provide zipped folders containing IMC acquisition (one `.mcd` file with the associated `.txt` files), and a panel file (`panel.csv`) for each antibody panel that indicates the channels that were measured and the channels that should be used for segmentation. Detailed information about zipped folders and panel files can be found below.
 
This pipeline is based on functions from the [steinbock package](https://github.com/BodenmillerGroup/steinbock), full steinbock documentation can be found here: https://bodenmillergroup.github.io/steinbock.


# **Preprocessing**
 
## **Configuration**
 
### **Import packages**

In [15]:
import pandas as pd
import pickle
import re
import sys

from pathlib import Path

from steinbock import io
from steinbock.preprocessing import imc
print(sys.path)
print(sys.executable)

['/home/T1D_preprocessing', '/opt/conda/lib/python39.zip', '/opt/conda/lib/python3.9', '/opt/conda/lib/python3.9/lib-dynload', '', '/opt/conda/lib/python3.9/site-packages']
/opt/conda/bin/python


### Helper functions

In [16]:
# Helper functions to get unique elements from 2 lists with elements of list 1 having .tiff suffixes.
def get_unique_elements(list1, list2):
    # Include some " - split here"
    return [element for element in list1 if not any(element.replace(" - split.tiff", "").replace(".tiff", "") in item for item in list2)]

# Helper function to get all duplicates in 2 lists.
def get_duplicates(lst):
    unique_elements = set()
    duplicates = []
    for item in lst:
        if item in unique_elements:
            duplicates.append(item)
        else:
            unique_elements.add(item)
    
    return duplicates

### **Define input and output directories**
 
*Manual step:* enter the path to the directory where the data will be saved (named `folder_data` from here on).


In [17]:
# Data folder
folder_data = Path("/home/processing/")
Path(folder_data).mkdir(parents=True, exist_ok=True)
assert Path.exists(folder_data), f"{folder_data} does not exist"
print("Data folder:", folder_data)

# Git folder (folder containing the current notebook)
folder_git = Path.cwd()
assert Path.exists(folder_git), f"{folder_git} does not exist"
print("Git folder:", folder_git)

Data folder: /home/processing
Git folder: /home/T1D_preprocessing


#### **Create folders for intermediate processing steps**
- `raw`: should contain user-provided zipped `.mcd` and `.txt` acquisitions.
- `img`: store extracted images in `.tiff` format.
- `seg_cells`: store image stacks for cell segmentation.
- `masks_cells`: store cell segmentation masks.
- `data_cells`: store generated single cell-level data.

In [18]:
folders = {
    "raw": folder_data / "raw",
    "img": folder_data / "img",
    "seg_cells": folder_data / "seg_cells",
    "masks_cells": folder_data / "masks_cells",
    "data_cells": folder_data / "data_cells",
    "variables": folder_data / "variables"
}

In [19]:
# Make directories (if they do not exist)
for folder in folders.values():
    folder.mkdir(exist_ok=True)
    
# Add base previously defined data and git folders
folders["data"] = folder_data
folders["git"] = folder_git

# Export folder names for use in downstream notebooks
with open(folder_data / "variables" / "folders.txt", "wb") as handle:
    pickle.dump(folders, handle)

### **Antibody panels**
 
Panel files are user-provided `.csv` files that should be located in `folder_data` and contain the following columns:
- `channel`: unique channel ID.
- `metal`: metal isotope to which the antibody is conjugated (format: `Nd144`, `Ir191`).
- `name`: name of the target marker.
- `keep` should the channel be retained for processing and analysis? (1 = yes, 0 = no).
- `deepcell` should the channel be used for cell segmentation and in which compartment the marker is expressed? (1 = nucleus, 2 = membrane, empty if the channel should not be used)

In [20]:
# Columns required in the panel file(s)
panel_cols = {
    "col_channel": "channel",
    "col_metal": "metal",
    "col_name": "name",
    "col_keep": "keep",
    "col_deeepcell": "deepcell",
}

*Manual step:* adapt the panel names and panel file names if needed. 

In [21]:
# List panel files
panels = {
    "Uncompressed": (folder_data / 'panel_Uncompressed.csv'),
    "Compressed": (folder_data / 'panel_Compressed.csv')
}

#### **Load and display the panels**

In [22]:
# Loop through the panels
for panel_name, panel_path in panels.items():
    print("Panel:", panel_name)
    
    # Load the panel file
    assert Path.exists(panel_path), f"{panel_path} does not exist"
    cur_panel = pd.read_csv(panel_path, sep = ',', index_col = False)

    # Make sure that the required columns exist
    for col in panel_cols.values():
        assert(col in cur_panel.columns), f"Column {col} missing from panel"
    
    # Subset the panel
    cur_panel = cur_panel[cur_panel[panel_cols["col_keep"]]==1]
    panels[panel_name] = cur_panel
    
    # Display the panel
    print(panels[panel_name].head())
    
# Export the panels for use in downstream scripts
with open(folder_data / "variables" / "panels.txt", "wb") as handle:
     pickle.dump(panels, handle)

Panel: Uncompressed
   channel  metal                     name    antibody_clone  keep  deepcell  \
0        0  In113               Histone H3              D1H2     1       1.0   
1        1  In115                      SMA               1A4     1       2.0   
5        5  Pr141                  insulin             C27C9     1       0.0   
6        6  Nd143                     CD44               IM7     1       2.0   
7        7  Nd144  Prohormone Convertase 2  Polyclonal _ PC2     1       0.0   

   clustering  dimred shortname  
0           0       0        H3  
1           1       1       SMA  
5           1       1       INS  
6           1       1      CD44  
7           1       1     PCSK2  
Panel: Compressed
   channel  metal                     name  antibody_clone  keep  deepcell  \
0        0  In113               Histone H3            D1H2     1       1.0   
1        1  In115                      SMA             1A4     1       2.0   
5        5  Nd143                 CD44_GCG 


## **Process zipped folders**
 
IMC acquisitions generate `.mcd` and `.txt` files. Each acquisition session (corresponding ot one `.mcd` file) should be zipped in a folder containing:
- The `.mcd` file.
- All the associated `.txt` files generated during the acquisition (do not change any of the file names).

The `.txt` files are used as a backup in case the data cannot be extracted from the `.mcd` file. 
All the zipped folders should be stored in subfolders of the `raw` folder (in the `folder_data` directory). The subfolders should be named exactly like the panels in `panels` (see "List panel files" above).   

For the current dataset, the folder structure is the following, with zipped MCD and TXT files stored in `raw/Immune` and `raw/Islet`:

folder_data
|_ data_cells
|_ data_islets
|_ img
|_ masks_cells
|_ masks_islets
|_ raw
    |_ Immune <- ZIP files from the Immune panel stored here
    |_ Islet  <- ZIP files from the Islet panel stored here
|_ seg_cells
|_ seg_islets
|
|_ panel_Immune.csv <- Panel file (Immune panel)
|_ panel_Islet.csv  <- Panel file (Islet panel)

### **List `.zip` folders**
*Manual step:* define a regular expression to identify the naming scheme of `.zip` files.


Part that all zipped files need to have in common

In [23]:
file_regex = '(?P<caseid>[0-9]{3,4})_(?P<panel>[a-zA-Z0-9]+)*'
file_regex = '(?P<caseid>[0-9]{3,4})_(?P<panel>[a-zA-Z0-9]+)'

#List all zip folders that match the regular expression.

# List zip folders. This should 190 folders.
re_fn = re.compile(file_regex)
zip_folders = [f for f in folders['raw'].rglob("*") if 
               re.match(file_regex, f.name)]
print("\nTotal number of '.zip' folders:", len(zip_folders))

# List all case IDs and panels
case_list = []
panel_list = []

for file in zip_folders:
    case_list.append(re_fn.search(file.name).group("caseid"))
    panel_list.append(re_fn.search(file.name).group("panel"))

case_list = set(case_list)
panel_list = set(panel_list)

# Generate a table with case IDs as indexes and panels as columns
zip_table = pd.DataFrame(dtype=str, columns=panel_list, index=case_list)

for file in zip_folders:
    cur_case = re_fn.search(file.name).group("caseid")
    cur_panel = re_fn.search(file.name).group("panel")
    zip_table.loc[cur_case, cur_panel] = file.name
zip_table


Total number of '.zip' folders: 16


Unnamed: 0,Compressed,Uncompressed
6227,6227_Compressed.mcd,6227_Uncompressed.mcd
6396,6396_Compressed.mcd,6396_Uncompressed.mcd
6238,6238_Compressed.mcd,6238_Uncompressed.mcd
6399,6399_Compressed.mcd,6399_Uncompressed.mcd


## **Extract images from IMC acquisitions**
 
Here, images are extracted from raw IMC files and saved in the `img` folder. Each image corresponds to one acquisition in one file, with the image channels filtered (`keep` column in antibody panel) and sorted according to the the panel file.  
 
In case an `.mcd` file is corrupted, the steinbock function tries to extract missing acquisitions from matching `.txt` files. In a second step, images from unmatched `.txt` files are extracted as well.  

See the full documentation here: https://bodenmillergroup.github.io/steinbock/latest/cli/preprocessing/#image-conversion 

### **Settings**
After image extraction, hot pixel filtering is performed using the threshold defined by the `hot_pixel_filtering` variable.

In [24]:
hot_pixel_filtering = 50

Here: fix potential mismatched regions!


### **Image conversion**
Extract image stacks from IMC acquisitions. 
Image and acquisition metadata are exported to `folder_data` as `images.csv`.


In [25]:
panels["Uncompressed"]

for panel_name, panel in panels.items():
    print("Processing", panel_name, "panel")
    
    # Input and output folders
    image_info = []
    raw_subdir = folders["raw"] / panel_name
    img_subdir = folders["img"] / panel_name
    img_subdir.mkdir(exist_ok = True)  
    
    # List zipped files
    cur_mcd_files = imc.list_mcd_files(raw_subdir, unzip=True)
    cur_txt_files = imc.list_txt_files(raw_subdir, unzip=True)
    
    # Process files
    for (mcd_file, acquisition, img, matched_txt, recovered) in \
    imc.try_preprocess_images_from_disk(
        cur_mcd_files, cur_txt_files,
        hpf = hot_pixel_filtering,
        channel_names = panels[panel_name]["metal"],
        unzip = True
    ):
        cur_desc = acquisition.description
        cur_case = re_fn.search(mcd_file.name).group("caseid")
        
        img_file = f"{mcd_file.stem}_{cur_desc}.tiff"
        io.write_image(img, img_subdir / img_file)

        # Save acquisition metadata
        image_info_row = imc.create_image_info(
            mcd_file, acquisition, img, matched_txt, recovered, img_file
        )
    
        image_info_row["panel"] = panel_name
        image_info.append(image_info_row)

    image_info = pd.DataFrame(image_info)
    image_meta_file = f"images_{panel_name}.csv"
    image_info.to_csv(folders["data"] / image_meta_file, index = False)

Processing Uncompressed panel


Error reading acquisition 33 from file /home/processing/raw/Uncompressed/6399_Uncompressed/6399_Uncompressed.mcd: MCD file '6399_Uncompressed.mcd' corrupted: inconsistent acquisition image data size
Error reading acquisition 35 from file /home/processing/raw/Uncompressed/6399_Uncompressed/6399_Uncompressed.mcd: MCD file '6399_Uncompressed.mcd' corrupted: inconsistent acquisition image data size
Error reading acquisition 2 from file /home/processing/raw/Uncompressed/6238_Uncompressed/6238_Uncompressed.mcd: MCD file '6238_Uncompressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 5 from file /home/processing/raw/Uncompressed/6227_Uncompressed/6227_Uncompressed.mcd: MCD file '6227_Uncompressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 23 from file /home/processing/raw/Uncompressed/6227_Uncompressed/6227_Uncompressed.mcd: MCD file '6227_Uncompressed.mcd' corrupted: invalid acquisition image data offsets
Error reading

Processing Compressed panel


Error reading acquisition 19 from file /home/processing/raw/Compressed/6396_Compressed/6396_Compressed.mcd: MCD file '6396_Compressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 27 from file /home/processing/raw/Compressed/6227_Compressed/6227_Compressed.mcd: MCD file '6227_Compressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 29 from file /home/processing/raw/Compressed/6227_Compressed/6227_Compressed.mcd: MCD file '6227_Compressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 30 from file /home/processing/raw/Compressed/6227_Compressed/6227_Compressed.mcd: MCD file '6227_Compressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 32 from file /home/processing/raw/Compressed/6227_Compressed/6227_Compressed.mcd: MCD file '6227_Compressed.mcd' corrupted: invalid acquisition image data offsets
Error reading acquisition 33 from file /home/processing

## **Catch unmatched images** 

### **Flag missing images**
The ablated regions should be the same on all consecutive sections. Here, we attempt to match images from different panels, based on the ROI number. Images from one panel that do not have a corresponding image in the other panel(s) are flagged.


In [26]:
panel_names = list(panels.keys())
missing = set()

# List files for the first panel
images_panel0 = sorted([img.name.replace(panel_names[0], "") \
                        for img in Path.iterdir(folders["img"] / panel_names[0])])
images_panel0 = frozenset(images_panel0)

# Find matched images in the other panels
for panel_name in panel_names[1:]:
    cur_images = [img.name for img in Path.iterdir(folders["img"] / panel_name)]
    cur_list = set([img.replace(panel_name, "") for \
                    img in cur_images])
    
    missing.add(frozenset(images_panel0.difference(cur_list)))
    missing.add(frozenset(cur_list.difference(images_panel0)))

# Print out all missing images
missing = [list(x) for x in missing]
missing = sorted([x for xs in missing for x in xs])
print("Images with missing corresponding images (", len(missing),
      "missing images ):\n", missing)


Images with missing corresponding images ( 9 missing images ):
 ['6227__ROI_005.tiff', '6227__ROI_025.tiff', '6227__ROI_028.tiff', '6227__ROI_035_test.tiff', '6238__ROI_002.tiff', '6238__ROI_031.tiff', '6396__ROI_019.tiff', '6399__Test-2.tiff', '6399__Test-4.tiff']


#FIXME: If there is a problem with next part: remember here we adjusted script.


### **Delete unmatched images**

Images that do not have a matching image in all the other panels are deleted.

In [27]:
delete_unmatched_images = True

if missing and delete_unmatched_images:
    for panel_name in panel_names:
        cur_dir = folders["img"] / panel_name
        unmatched_images = [
            cur_dir / im.replace("__", ("_" + panel_name + "_")) \
            for im in missing]
        
        for image in unmatched_images:
            print(f"Deleting {image}")
            Path.unlink(image, missing_ok=True)

Deleting /home/processing/img/Uncompressed/6227_Uncompressed_ROI_005.tiff
Deleting /home/processing/img/Uncompressed/6227_Uncompressed_ROI_025.tiff
Deleting /home/processing/img/Uncompressed/6227_Uncompressed_ROI_028.tiff
Deleting /home/processing/img/Uncompressed/6227_Uncompressed_ROI_035_test.tiff
Deleting /home/processing/img/Uncompressed/6238_Uncompressed_ROI_002.tiff
Deleting /home/processing/img/Uncompressed/6238_Uncompressed_ROI_031.tiff
Deleting /home/processing/img/Uncompressed/6396_Uncompressed_ROI_019.tiff
Deleting /home/processing/img/Uncompressed/6399_Uncompressed_Test-2.tiff
Deleting /home/processing/img/Uncompressed/6399_Uncompressed_Test-4.tiff
Deleting /home/processing/img/Compressed/6227_Compressed_ROI_005.tiff
Deleting /home/processing/img/Compressed/6227_Compressed_ROI_025.tiff
Deleting /home/processing/img/Compressed/6227_Compressed_ROI_028.tiff
Deleting /home/processing/img/Compressed/6227_Compressed_ROI_035_test.tiff
Deleting /home/processing/img/Compressed/6238_

## **Next steps**

The next step in this pipeline is cell segmentation, which is performed with the `02_CellSegmentation.ipynb` notebook.