# How to Get Marker Metadata 
### What is Marker Metadata
Marker metadata provides the essential information that the KRONOS model requires for inference alongside the input images. Unlike models based on RGB images—where the order and type of each channel are predefined and only the mean and standard deviation of the red, green, and blue channels are required as metadata (often hardcoded in the inference code)—spatial proteomics (SP) multiplex datasets vary in the number and type of image channels from one dataset to another.

Furthermore, marker names in SP datasets are not standardized; they often differ in capitalization, spacing, or hyphenation (e.g., the KI67 marker might appear as "Ki-67" or "KI-67"). As a result, a separate CSV file is used to store all the metadata for the markers in the KRONOS pretraining dataset.

### What is Included in the Marker Metadata File
The marker metadata file contains four columns: `marker_name`, `marker_id`, `marker_mean`, and `marker_std`. Each column represents the following:
- **marker_name**: The name of the marker, presented in uppercase.
- **marker_id**: A unique identifier assigned to the marker in the pretraining dataset.
- **marker_mean**: The mean intensity value of the marker, calculated from all images of that marker in the pretraining dataset.
- **marker_std**: The standard deviation of the marker's intensity values, calculated from all images of that marker in the pretraining dataset.

<br/>
The complete list of marker names, IDs, and their corresponding mean and standard deviation values are available in the CSV file ([marker_metadata.csv](https://huggingface.co/MahmoodLab/KRONOS/blob/main/marker_metadata.csv) on our HuggingFace repository.

### How Marker IDs are assigned to Markers
Marker IDs are assigned within the range of 1 to 512 to distinguish different biological markers. In the pretrained dataset, nuclear markers are assigned IDs from 1 to 127, while non-nuclear markers receive IDs from 128 to 512. This grouping helps capture high-level similarities between markers of the same type. Within each category, markers are arranged alphabetically, but only even-numbered IDs are assigned to those included in the pretrained dataset. The odd-numbered IDs are intentionally left unassigned, reserved for biologically similar markers that were not part of the pretrained dataset. This approach allows end-users to assign marker IDs from the odd-numbered values, ensuring that any newly added markers remain closely linked to the existing structure while preserving biological relevance.

## Step 1: Create Marker Info File and Download Marker Metadata File
Create a CSV file (`marker_info.csv`) with two columns: `channel_id` and `marker_name` for a given multiplex dataset. The `channel_id` should indicates the index of the corresponding marker in the multiplex images of the given dataset. 

Save the `marker_info.csv` file at at `project_dir/dataset/` folder where `project_dir` will be you project directory.
Download the `marker_metadata.csv` file either Hugging Face and copy it to `project_dir/dataset/` direcorty as well.

In [None]:
# In this tutorial, we will use a public CRC dataset
from utils.crc_dataset_prep import download_and_prepare_dataset

project_dir = "/path/to/project/directory"
# Download and prepare the dataset
download_and_prepare_dataset(project_dir)

## Step 2: Name based Marker Matching
The following script maps the markers in the `marker_info.csv` file to those used in the KRONOS pretraining dataset (`marker_metadata.csv`) based on marker names. <br />
It also displays a list of unmatched markers along with suggestions derived from marker name similarity with entries in `marker_metadata.csv`.

In [None]:
from utils import MarkerMetadata
# Define the project directory
project_dir = "/path/to/project/directory"  # Replace with your actual project directory
# Define paths for the dataset-specific marker info and the pretrained marker metadata files.
marker_info_csv_path = f"{project_dir}/dataset/marker_info.csv"        # Path to the dataset-specific marker info file.
marker_metadata_csv_path = f"{project_dir}/dataset/marker_metadata.csv"  # Path to the pretrained marker metadata file.
top_suggestions = 5  # Number of top suggestions to display for each unmatched marker.

# Create an instance of MarkerMetadata and retrieve the marker metadata.
obj = MarkerMetadata(marker_info_csv_path, marker_metadata_csv_path, top_suggestions)
obj.get_marker_metadata()

# Display the number of markers that do not match the pretrained dataset.
print(f"There are {len(obj.missing_marker_dict)} markers that do not match with the markers in the pretrained dataset.")

# Show the top suggestions based on marker name similarity for each unmatched marker.
print(f"Below are the top {top_suggestions} marker name similarity suggestions for each missing marker:")
display(obj.missing_marker_df)

# Display the dictionary for missing markers, which needs to be manually mapped to a biologically similar marker in marker_metadata.csv.
print("The following dictionary contains missing markers that need to be manually mapped:")
display(obj.missing_marker_dict)

There are 17 markers that do not match with the markers in the pretrained dataset.
Below are the top 5 marker name similarity suggestions for each missing marker:


Unnamed: 0_level_0,Suggestion 1,Suggestion 2,Suggestion 3,Suggestion 4,Suggestion 5
Missing Marker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BCL-2,BCL2,BDCA-2,BCL6,B2M,CD28
CD46,CD61,CD54,CD45,CD40,CD36
COLLAGEN 4,COLLAGEN,LANGERIN,CTLA4,LAG3,CLUSTERIN
CYTOKERITIN,CYTOKERATIN,CLUSTERIN,E-CADHERIN,CATHEPSIN-L,LANGERIN
DAPI-01,DAPI,VDAC1,IL-1B,HLA-1,BDCA-2
GRANZYME B,GZMB,LYSOZYME,RELB,RB,LANGERIN
IDO-1,IDO1,IL-1B,ARID1A,PD1,IGD
LAG-3,LAG3,HLA-1,HLA-DRA,SIGELC-3,LANGERIN
MMP-9,MMP9,MPO,LMP1,TMPRSS2,SIGLEC-9
MUC-1,MUC1,MUC5AC,C1Q,TCF1,LMP1


The following dictionary contains missing markers that need to be manually mapped:


{'BCL-2': '',
 'CD46': '',
 'COLLAGEN 4': '',
 'CYTOKERITIN': '',
 'DAPI-01': '',
 'GRANZYME B': '',
 'IDO-1': '',
 'LAG-3': '',
 'MMP-9': '',
 'MUC-1': '',
 'PD-1': '',
 'PD-L1': '',
 'T-BET': '',
 'TCR-G-D': '',
 'TCRB': '',
 'TIM-3': '',
 'VISA': ''}

## Step 3: Manual Marker Mapping 
If some markers do not match based on their names, you can manually adjust the mapping. Use the provided suggestions and/or the list of marker names in the marker_metadata.csv file. <br/>
Simply copy the dictionary syntax from the previous step and update the values for the unmatched markers with a valid, biologically similar marker from the suggestions or the marker_metadata.csv file.

In [2]:
obj.missing_marker_dict = {
    'BCL-2': 'BCL2',
    'CD46': '',
    'COLLAGEN 4': 'COLLAGEN',
    'CYTOKERITIN': 'CYTOKERATIN',
    'DAPI-01': 'DAPI',
    'GRANZYME B': 'GZMB',
    'IDO-1': 'IDO1',
    'LAG-3': 'LAG3',
    'MMP-9': 'MMP9',
    'MUC-1': 'MUC1',
    'PD-1': 'PD1',
    'PD-L1': 'PDL1',
    'T-BET': 'TBET',
    'TCR-G-D': 'TCR-GD',
    'TCRB': 'TCR-B',
    'TIM-3': 'TIM3',
    'VISA': ''
    }

# Retrieve marker metadata using the updated mapping.
obj.get_marker_metadata_with_mapping()

if len(obj.missing_marker_dict) > 0:
    # Display the count of markers that still do not match the pretrained dataset.
    print(f"There are {len(obj.missing_marker_dict)} markers that still do not match the markers in the pretrained dataset.")

    # Display the dataframe of unmatched markers.
    display(obj.missing_marker_df)

    # Display the dictionary of unmatched markers that require manual mapping.
    display(obj.missing_marker_dict)
else:
    print("All markers have been successfully mapped to the pretrained dataset.")

There are 2 markers that still do not match the markers in the pretrained dataset.


Unnamed: 0_level_0,Suggestion 1,Suggestion 2,Suggestion 3,Suggestion 4,Suggestion 5
Missing Marker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CD46,CD61,CD54,CD45,CD40,CD36
VISA,VISTA,INOS,IGA2,IGA1,ICOS


{'CD46': '', 'VISA': ''}

## Step 4 (Optional): Manually Set Metadata
If some markers are still unmatched with the pretrained dataset and you can not ignore these marker then you can manually assign their marker ID, mean, and standard deviation values:

- **Marker ID**: Choose an unassigned ID from the range 1–512 in marker_metadata.csv. Ideally, select an ID close to a biologically similar marker.
- **Mean & Std Values**: Calculate these from your dataset for the corresponding markers. Ensure marker intensities are converted to float type and intensities are in range of 0-1 before computing the mean and standard deviation.

In [3]:
marker_metadata_dict = {
        'CD46': {'marker_id': 295, 'marker_mean': 0.051, 'marker_std': 0.085},
        'VISA': {'marker_id': 45, 'marker_mean': 0.015, 'marker_std': 0.014},
    }

obj.set_marker_metadata(marker_metadata_dict)
# Display the count of markers that still do not match the pretrained dataset.
if len(obj.missing_marker_dict) > 0:
    print(f"There are {len(obj.missing_marker_dict)} markers that still do not match the markers in the pretrained dataset.")
    
    # Display the dataframe of unmatched markers.
    display(obj.missing_marker_df)

    # Display the dictionary of unmatched markers that require manual mapping.
    display(obj.missing_marker_dict)
else:
    print("All markers now have valid metadata.")
display(obj.marker_info)

All markers now have valid metadata.


Unnamed: 0,channel_id,marker_name,marker_id,marker_mean,marker_std
0,0,BCL-2,150,0.047104,0.060276
1,1,CCR6,166,0.044867,0.042833
2,2,CD11B,180,0.032169,0.052366
3,3,CD11C,182,0.019039,0.044336
4,4,CD15,194,0.016322,0.040416
5,5,CD16,196,0.041869,0.055626
6,6,CD162,198,0.012217,0.040094
7,7,CD163,200,0.014384,0.033087
8,8,CD2,212,0.161256,0.110404
9,9,CD20,214,0.045192,0.057727


## Step 5: Save Final Dataset Specific Metadata File

In [4]:
output_csv_path = f"{project_dir}/dataset/marker_info_with_metadata.csv"
obj.export_marker_metadata(output_csv_path)
display(obj.marker_info)

Exported marker metadata to /media/shaban/hd1/Projects_HD1/Multiplex_Image_Analysis/Github/cHL_new//dataset/marker_info_with_metadata.csv


Unnamed: 0,channel_id,marker_name,marker_id,marker_mean,marker_std
0,0,BCL-2,150,0.047104,0.060276
1,1,CCR6,166,0.044867,0.042833
2,2,CD11B,180,0.032169,0.052366
3,3,CD11C,182,0.019039,0.044336
4,4,CD15,194,0.016322,0.040416
5,5,CD16,196,0.041869,0.055626
6,6,CD162,198,0.012217,0.040094
7,7,CD163,200,0.014384,0.033087
8,8,CD2,212,0.161256,0.110404
9,9,CD20,214,0.045192,0.057727
