# FBMN Annotation & Visualization in Cytoscape

<details>
    <summary>Click to view the narrative</summary>

## Overall narrative of the workshop

- import a network (GNPS or MSQuery) in Cytoscape

- import the MZmine table for node metadata

- import the IIMN edge table.

- show the styling with ion identity network.

- duplicate the views and rename the views for each tool.

- explain requirements for further import (unique feature id, first column, smiles [name and ,] ). And opportunity for information-rich labeling for each tool. 

- import tool prepared tool output into the network.

- bring-in styles (Get the Cytoscape style for each tool).

- explore with each tool
</details>

## Let's go

**Installation**: install Cytoscape [https://cytoscape.org/](https://cytoscape.org/) and its ChemViz2 pluggin [https://apps.cytoscape.org/apps/chemviz2](https://apps.cytoscape.org/apps/chemviz2).

**Notebook**: the notebook is available at [https://github.com/lfnothias/FBMN_annotation_fusion_visualization_Cytoscape](https://github.com/lfnothias/FBMN_annotation_fusion_visualization_Cytoscape). It can run as a Binder instance (see link).

**Input files**: the FBMN network and annotation files were uploaded to the github repo as of April 15th 2024. If needed, update them.

## Step 1 - Import a network

- Open Cytoscape and import a FBMN derived network file (graphml file). The process can be done by drag & drop or with `File / Import / Network from File`.

- Demo about Cytoscape and key definitions (node table, edge table).

- We observe the limited number of information available in the file.

## Step 2 - Import a Node Table

Node table contains metadata about nodes (a feature MS1/MS2 spectra) and are mapped with column containing the 'Feature_ID' (or its variant = ['scan',featureID', 'feature_id', ... ) on the 'shared name' column. Preferably on the first column of the node table (or select the mapping key accordingly). It must be unique. If columns are already present, they will be ovewritten.

- We will import the `mzmine_results_iimn_gnps_quant.csv` into Cytoscape. First lets take a look at its content and comment.

In [None]:
import pandas as pd 
mzmine = 'results_download/mzmine/mzmine_results_iimn_gnps_quant.csv'
pd.read_csv(mzmine).head(5)

### Lets import it in Cytoscape 

- The node table import can be done with  `File / Import / Table from File`. \
Make sure you select the correct the right Network for import (Network Collection) and the correct Data Type (Node or Edge Table).

- Start basic Node styling with the Style panel.

## Step 3 - Import an Edge Table

Edge table contains metadata about nodes (a feature MS1/MS2 spectra) and are mapped with column containing the interaction ID [NodeID1 (-) NodeID2) and named 'shared name'. Preferably on the first column of the node table (or select the mapping key accordingly). It must be unique. If columns are already present, they will be ovewritten.


### Prepare IIMN edges for Cytoscape

We need to create a 'shared name' column in the edge table to streamline Cytoscape import.

Original table:
| ID1 | ID2 | EdgeType | Score | Annotation |
|-----|-----|----------|-------|------------|
| 36  | 37  | MS1 annotation | 2      | [M+K]+ [2M+Na]+ dm/z=299.18058           |

Prepared table:
| shared name | ID1 | ID2 | EdgeType | Score | Annotation |
|-------------|-----|-----|----------|-------|------------|
| 36 (-) 37   | 36  | 37  |MS1 annotation| 2      | [M+K]+ [2M+Na]+ dm/z=299.18058           |


In [None]:
def detect_separator(filename):
    with open(filename, 'r') as file:
        first_line = file.readline()
        if '\t' in first_line:
            return '\t'
        else:
            return ','  # Default to comma if no tab found

def prepare_fbmn_iimn_edge_annotation_cytoscape(input_file, output_suffix='_prep'):
    # Load the CSV file into a DataFrame
    separator = detect_separator(input_file)
    df = pd.read_csv(input_file, sep=separator)

    # Check if ID1 and ID2 are in the DataFrame
    if 'ID1' not in df.columns or 'ID2' not in df.columns:
        raise ValueError("ID1 or ID2 column is missing in the DataFrame.")

    # Create a new column 'ID1-ID2' by concatenating 'ID1' and 'ID2' with " (-) " in between
    df['shared name'] = df['ID1'].astype(str) + ' (-) ' + df['ID2'].astype(str)

    # Move the new column to the first position
    cols = df.columns.tolist()
    cols = ['shared name'] + [col for col in cols if col != 'shared name']
    df = df[cols]

    # Save the modified DataFrame to a new CSV file with the specified suffix
    output_file = f"{input_file.rsplit('.', 1)[0]}{output_suffix}.tsv"
    df.to_csv(output_file, index=False, sep='\t')

    # Print the new output file name
    print(f"File saved as: {output_file}")
    return df.head(5)

In [None]:
iimn_edge_table_path = 'results_download/mzmine/mzmine_results_iimn_gnps_edges_msannotation.csv'
prepare_fbmn_iimn_edge_annotation_cytoscape(iimn_edge_table_path)

### Import the prepared IIMN Edge Table into Cytoscape

- The node table import can be done with  `File / Import / Table from File`.\
Make sure you select the correct the right Network for import (Network Collection) and the correct Data Type (Node or Edge Table).

- Start styling Edges with the Style panel. 

## Step 4 - Importing other annotations

### MS2Query annotations

Lets take a look at MS2Query table and import it 

In [None]:
ms2query_annotation_path = 'results_download/matchms/results_for_cytoscape/ms2query_results_for_cytoscape.csv'
pd.read_csv(ms2query_annotation_path).head(5)

It has the requirements for Cytoscape import on FBMN network !

**Requirements**:
- a `feature_id` column from the FBMN. The exact column naming is flexible. 
- the `feature_id` entries must be unique and consistent.

    
**Bonus to streamline Cytoscape**:
- The `feature_id` column (or equivalent) should be the first column.
- If there are structural annotations, the annotation name should be in a `name` column, the smiles should be in a `smiles` column.
- We introduce a prefix for all the annotation columns.
- We will add a 'annotation' tool column for visualization.


Lets check the minimum requirements

In [None]:
ms2query_annotation = pd.read_csv(ms2query_annotation_path)
print(ms2query_annotation.columns)
ms2query_annotation.head(2)

#### We check for uniqueness of feature_id

In [None]:
# Check for if feature_id is unique.
print('Is feature_id unique ? = ' + str(ms2query_annotation['feature_id'].is_unique))

# Lets rename the name and same
ms2query_annotation['name'] = ms2query_annotation['analog_compound_name']
ms2query_annotation.to_csv(ms2query_annotation_path[:-4]+'_prep.tsv', sep='\t', index=False)

### Lets import it in Cytoscape

- The node table import can be done with  `File / Import / Table from File`. \
Make sure you select the correct the right Network for import (Network Collection) and the correct Data Type (Node or Edge Table).

Lets Customize the node style. Lets search with the filter function.

### Lets do the same with another table

#### MassQL annotation



In [None]:
# Load the CSV file into a DataFrame
tool = 'MassQL'
massql_annotations = 'results_download/GNPS2/ea4293bedd5440148267cb201ef7edbc-merged_query_results_MassQL.tsv'
df = pd.read_csv(massql_annotations, sep='\t')
print(df.columns)
df.head(3)

#### MassQL table

Lets open the MassQL table 

Original table
| charge | filename            | i     | i_norm | scan | ... |
|---------------|----------------------------|--------------|---------------|---------------|---------------|
| 1        | mzmine_results_iimn_gnps.mgf                       |  125140000.0 | 1.0 | 5051| ... |

Prepared table
| scan | MassQL_charge | MassQL_filename     | MassQL_i     | MassQL_i_norm | ... |
|---------------|----------------------------|--------------|---------------|---------------|--------------|
| 5051 | 1         | mzmine_results_iimn_gnps.mgf                         | 125140000.0 | 1.0 | ... |

### Helpers function for preparing the annotation table

In [None]:
def detect_separator(filename):
    with open(filename, 'r') as file:
        first_line = file.readline()
        if '\t' in first_line:
            return '\t'
        else:
            return ','  # Default to comma if no tab found


def duplicate_column_if_string_found(df, substring, new_column_name):
    # Track if a column was found and duplicated
    found_and_duplicated = False

    # Loop through all column names in the DataFrame
    for col in df.columns:
        # Check if the substring matches part of any column name
        if substring.lower() in col.lower():  # This makes the search case-insensitive
            # Create a new column name by appending the specified new column name
            # Duplicate the column
            df[new_column_name] = df[col].astype(str).replace('Spectral Match to ', '', regex=True)
            print(f"Column '{col}' duplicated into '{new_column_name}'.")
            found_and_duplicated = True

    # If no column matches the substring, print a message
    if not found_and_duplicated:
        print(f"No columns found containing the substring '{substring}'.")


def aggregate_columns(series):
    # Convert all entries to strings, ensure they are unique and handle NaN values
    sorted_values = sorted(series.dropna().astype(str))
    aggregated_string = ','.join(sorted(sorted_values))
    return aggregated_string


def prepare_fbmn_annotation_for_cytoscape(input_file, feature_id_column, tool_prefix, output_suffix='_prep'):
    # Load the CSV file into a DataFrame
    separator = detect_separator(input_file)
    df = pd.read_csv(input_file, sep=separator)

    # Drop columns where name contains 'Unnamed'
    df = df.loc[:, ~df.columns.str.contains('Unnamed')]

    # Identify any column containing 'smiles' in its name, case-insensitively
    smiles_columns = [col for col in df.columns if 'smiles' in col.lower()]

    # Case insensitive check for 'compound_name' or 'name'
    lower_columns = {col.lower(): col for col in df.columns}  # Create a dict with lower case keys and original column names as values
    compound_col = lower_columns.get('compound_name', lower_columns.get('name'))


    # Check for and remove duplicates based on feature_id_column with either compound_name or smiles
    if compound_col:
        # Create a combined duplicate check list
        for smiles_col in smiles_columns:
            # Remove duplicates where the feature ID and either the compound name or one of the smiles columns are the same
            initial_row_count = len(df)
            df.drop_duplicates(subset=[feature_id_column, compound_col], keep='first', inplace=True)
            df.drop_duplicates(subset=[feature_id_column, smiles_col], keep='first', inplace=True)
            final_row_count = len(df)
            print(f"Removed {initial_row_count - final_row_count} duplicates based on {feature_id_column}, {compound_col}, and {smiles_col}")

    else:
        print("Neither 'compound_name' nor 'name' column is present. Checking for duplicates based on SMILES only.")
        for smiles_col in smiles_columns:
            initial_row_count = len(df)
            df.drop_duplicates(subset=[feature_id_column, smiles_col], keep='first', inplace=True)
            final_row_count = len(df)
            print(f"Removed {initial_row_count - final_row_count} duplicates based on {feature_id_column} and {smiles_col}")

    # Check existence of expected columns
    expected_cols = ['score', 'adduct', 'mol_formula', 'inchi', 'inchi_key', 'compound_name']
    smiles_like_cols = [col for col in df.columns if 'smiles' in col.lower()]
    cols_to_aggregate = expected_cols + smiles_like_cols
    cols_to_aggregate = [col for col in cols_to_aggregate if col in df.columns]

    # If feature_id_column is present and not unique, handle aggregation
    if feature_id_column in df.columns:
            if not df[feature_id_column].is_unique:
                grouped = df.groupby(feature_id_column)[cols_to_aggregate].agg(aggregate_columns).reset_index()
                # Drop the original aggregated columns from main DataFrame and merge with aggregated data
                df = df.drop(columns=cols_to_aggregate).drop_duplicates(subset=feature_id_column)
                df = pd.merge(df, grouped, on=feature_id_column, how='left')
                print('Aggregation completed. FeatureID had duplicates.')
            else:
                print('FeatureID is unique. No aggregation needed.')
    else:
        raise ValueError(f"{feature_id_column} is not a column in the DataFrame.")

    # Copy 'smiles' columns with prefixes and keep original
    for smiles_column in smiles_columns:
        prefixed_smiles_column = f"{tool_prefix}_{smiles_column}"
        if prefixed_smiles_column not in df.columns:  # Check if prefixed column already exists
            df[prefixed_smiles_column] = df[smiles_column]

    # Prepare to add prefix to all columns except feature_id_column and original smiles_columns
    rename_dict = {}
    for col in df.columns:
        if col not in smiles_columns and col != feature_id_column and not col.startswith(tool_prefix):
            rename_dict[col] = f"{tool_prefix}_{col}"

    df.rename(columns=rename_dict, inplace=True)

    # Add an extra column 'annotation_tool' with the value of the tool prefix
    df['annotation_tool'] = tool_prefix

    duplicate_column_if_string_found(df, 'Compound_name', 'name')

    # Renaming and moving feature_id_column to the first position
    df = df[[feature_id_column] + [col for col in df.columns if col != feature_id_column]]  # This moves the feature_id_column to the first position

    # Special handling if tool are for sirius
    if tool_prefix.lower() == 'sir_class':
        # Check if necessary columns exist
        if 'sir_class_NPC#class' in df.columns and 'sir_class_molecularFormula' in df.columns and 'sir_class_adduct' in df.columns:
            df['name'] = df['sir_class_NPC#class'] + ' | ' + df['sir_class_molecularFormula'] + ' | ' + df['sir_class_adduct']
        else:
            print("Required Canopus columns are not all present.")

    elif tool_prefix.lower() == 'sir_struct':
        # Check if necessary columns exist
        if 'sir_struct_name' in df.columns and 'sir_struct_ConfidenceScore' in df.columns:
            df['name'] = df['sir_struct_name'] + ' (' + df['sir_struct_ConfidenceScore'].astype(str) + ')'
        else:
            print("Required Sirius columns are not all present.")

    elif tool_prefix.lower() == 'tima':
        # Check if necessary columns exist
        if 'tima_candidate_structure_name' in df.columns and 'tima_candidate_structure_tax_npc_03cla' in df.columns:
            df['name'] = df['tima_candidate_structure_name'] + ' | ' +df['tima_candidate_structure_tax_npc_03cla']

        else:
            print("Required TIMA columns are not all present.")

        smiles_col = 'tima_candidate_structure_smiles_no_stereo'
        if smiles_col in df.columns:
            df['smiles'] = df[smiles_col].str.replace('|', ',', regex=False)

    # Save the modified DataFrame to a new CSV file with the specified suffix
    output_file = f"{input_file.rsplit('.', 1)[0]}{output_suffix}.tsv"
    df.to_csv(output_file, index=False, sep='\t')

    # Print the new output file name
    print(f"File saved as: {output_file}")
    return df.head(5)


Lets process MassQL table

In [None]:
prepare_fbmn_annotation_for_cytoscape(massql_annotations, 'scan', tool)

#### Lets import MassQL Node Table in Cytoscape

### Lets prepare the GNPS table

In [None]:
# Load the CSV file into a DataFrame
tool = 'GNPS'
gnps_annotations = 'results_download/GNPS2/861f707d5a4f42e88486c77a4693a38d-merged_results_with_gnps.tsv'
df = pd.read_csv(gnps_annotations, sep='\t')
df.columns

#We will add a 'tool' prefix to the column and move the 'scan' column to the first position.

In [None]:
prepare_fbmn_annotation_for_cytoscape(gnps_annotations, '#Scan#', tool)

### Lets prepare the MZmine spectral library annotation table

- an `id` column but are not unique -> we can concatenate them

In [None]:
# Load the CSV file into a DataFrame
tool = 'MZmine'
mzmine_annotations = 'results_download/mzmine/mzmine_results_annotations.csv'
df = pd.read_csv(mzmine_annotations, sep=',')
df.columns
df.head(2)

Original table
| id | compound_name            | adduct    | score | scan | ... |smiles |... |
|---------------|----------------------------|--------------|---------------|---------------|---------------|---------------|---------------|
| 57        | Nicotinic acid, pyridine-3-carboxylic acid                      | [M+H]+ | 1.0 | 0.975| ... |	OC(=O)C1=CC=CN=C1 | ... | 
| 57        | Nicotinic acid, pyridine-3-carboxylic acid                      |  [M+H]+ | 1.0 | 0.975| ... | OC(=O)C1=CC=CN=C1 |... |
| 57        | Isonicotinic acid                     |  [M+H]+ | 1.0 | 0.874| ... | c1cnccc1C(=O)O | ... |

Prepared table
| id | compound_name            | adduct    | score | scan | ... |smiles |... |
|---------------|----------------------------|--------------|---------------|---------------|---------------|---------------|---------------|
| 57        | Nicotinic acid, pyridine-3-carboxylic acid,  Isonicotinic acid | [M+H]+,[M+H]+ | 1.0, 1.0 | 0.975, 0.874| ... |	OC(=O)C1=CC=CN=C1,c1cnccc1C(=O)O | ... | 

In [None]:
prepare_fbmn_annotation_for_cytoscape(mzmine_annotations, 'id', tool)

## SIRIUS annotations

Lets see SIRIUS class and structure annotation.

#### SIRIUS class annnotation

There is a 'featureId' column. 

Lets do some bonus formatting

In [None]:
tool = 'sir_class'
class_annotations_path = 'results_download/sirius/summary-files/canopus_compound_summary.tsv'
pd.read_csv(class_annotations_path, sep= '\t').head(5)

In [None]:
prepare_fbmn_annotation_for_cytoscape(class_annotations_path, 'featureId', tool)

#### SIRIUS structure annotations

There is a `feature_id` column. 

Lets do some bonus formatting

In [None]:
tool = 'sir_struct'
sirius_annotations_path = 'results_download/sirius/summary-files/compound_identifications.tsv'
pd.read_csv(sirius_annotations_path, sep= '\t').head(5)

In [None]:
prepare_fbmn_annotation_for_cytoscape(sirius_annotations_path, 'featureId', tool)

### TIMA annotations

There is a `feature_id` column. 

Lets do some bonus formatting

In [None]:
tool = 'tima'
tima_annotations_path = 'results_download/tima/240414_114832_comp_ms_prague/comp_ms_prague_results.tsv'
tima_annotations = pd.read_csv(tima_annotations_path, sep='\t')
tima_annotations.head(5)

In [None]:
prepare_fbmn_annotation_for_cytoscape(tima_annotations_path, 'feature_id', tool)

## We continue with Cytoscape

Exploration and style

## We download all the files for Cytoscape

In [None]:
import os
import zipfile

def zip_prep_files(directory, zip_name, depth=5):
    """
    Search for all files ending in _prep.tsv within the given depth of the directory
    and make a zip archive out of them.

    :param directory: The directory to search for files in.
    :param zip_name: The name of the output zip file.
    :param depth: The depth to search for files. If -1, search all levels.
                  Depth of 0 means the current directory only,
                  1 means the current directory and its immediate subdirectories, and so on.
    """
    def should_include_dir(root_depth, current_depth):
        # If depth is negative, no limit is applied.
        if depth < 0:
            return True
        # Include directories within the desired depth.
        return (current_depth - root_depth) <= depth

    root_depth = directory.count(os.sep)
    with zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(directory):
            current_depth = root.count(os.sep)
            if should_include_dir(root_depth, current_depth):
                for file in files:
                    if file.endswith('_prep.tsv'):
                        print(file)
                        filepath = os.path.join(root, file)
                        zipf.write(filepath, os.path.relpath(filepath, start=directory))
                # Modify the dirs in place to avoid unnecessary recursion into subdirectories beyond the depth
                if not should_include_dir(root_depth, current_depth + 1):
                    dirs.clear()  # This prevents os.walk from going into deeper directories
    print(f"Created zip archive: {zip_name}")

# Example usage:
# Provide the directory to search in, the desired zip file name, and the depth
# zip_prep_files('/path/to/directory', 'prep_files_archive.zip', depth=1)  # Adjust the depth as needed



In [None]:
zip_prep_files('results_download', 'cytoscape_input.zip')
