# Mass Spectrometry Analysis Notebook

## Purpose
This notebook is designed to facilitate the analysis of mass spectrometry (MS) data for triplicate experiments. It supports:
- Loading and preprocessing MS data.
- Annotating proteins of interest (POIs) based on a user-provided database.
- Mapping proteins to specific classes or families with color codes.
- Generating volcano plots to visualize differential abundance.
- **Search** the plot for specific proteins 
- Creating heatmaps for selected protein classes or families.

## Input File Formats
To successfully run this notebook, you must provide the following files in the specified formats:

### 1. **Mass Spec Data File (`ms_data_path`)**
- **File Type**: Tab-delimited `.txt`
- **Contents**: Quantitative MS data with a column labeled `Accession` (for protein identifiers) and abundance columns prefixed by `apQuant Area`.

### 2. **Proteins of Interest (POI) File (`poi_file_path`)**
- **File Type**: Excel `.xlsx`
- **Contents**: A database of proteins of interest (POIs) containing:
  - A column named `Other UniProt Accessions` listing protein accessions.
  - A column named `Class / family` specifying the classification/group of the proteins.

### 3. **Color Code File (`color_code_file_path`)**
- **File Type**: CSV
- **Contents**: Class-specific color codes, containing:
  - A column named `Class / family` corresponding to the POI file.
  - A column named `Color Hex` with hexadecimal color codes for visualization.

## Configurable Parameters
The following settings allow users to customize plot appearance and analysis behavior:

### Volcano Plot Settings
- **Plot Dimensions**:
  - `plot_width`: Width of the plot (default: 800).
  - `plot_height`: Height of the plot (default: 600).
- **Thresholds**:
  - `show_threshold_lines`: Show or hide threshold lines (default: `True`).
  - `log2_fc_positive`: Positive log2 fold-change threshold (default: `2`).
  - `log2_fc_negative`: Negative log2 fold-change threshold (default: `-2`).
  - `p_value_threshold`: P-value threshold (default: `0.05`).
- **Dot Appearance**:
  - `unlabeled_dot_size`: Dot size for unlabeled proteins (default: `5`).
  - `labeled_dot_size`: Dot size for labeled proteins (default: `10`).
  - `unlabeled_opacity`: Opacity for unlabeled protein dots (default: `0.5`).
  - `labeled_opacity`: Opacity for labeled protein dots (default: `1`).

### Directory for Outputs
- **Output Directory**: `output_directory` (default: `./plots`)

>Note:  In the code block below, Windows paths must be formatted as absolute paths like this `path = r"/path/to/your/file"`

In [1]:

# Paths for tour data and annotation files
ms_data_path = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\spliceosome\volcano_20230630_E3_RSLC6_Vorlaender_Plaschka_IMP_ID1242_onbead_dig_20per_6x_step2_normsum_limma_Proteins.txt"  # Mass spec data from facility in .txt file
poi_file_path = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\MS_analysis\protein_annotations\POIs_c_elegens_spliceosome_database_entries.xlsx"  # POI file containign the Uniprot Accessions of your proteins of interest as well as a column labelled "Class / family" that is used to identify groups of proteins)
color_code_file_path = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\MS_analysis\protein_annotations\COLORS_PolII_Spliceosome_250225_MV.csv" #Color code file that contains the same Class / family  column as the POI file and a column labelled "Color" that contains the color codes for the proteins of interest

# User-configurable settings for VOLCANO PLOT appearance
plot_width = 800  # Width of the plot
plot_height = 600  # Height of the plot

# Threshold line settings
show_threshold_lines = True  # Set to False to hide threshold lines
log2_fc_positive = 4  # Position of the positive threshold line
log2_fc_negative = -2  # Position of the negative threshold line
p_value_threshold = 0.05  # P-value threshold

# Dot size and opacity settings
unlabeled_dot_size = 5  # Size of dots for unlabeled proteins
labeled_dot_size = 10  # Size of dots for labeled proteins
unlabeled_opacity = 0.5  # Opacity of dots for unlabeled proteins
labeled_opacity = 1  # Opacity of dots for labeled proteins


#Direcoctory in which outputs will be saved
import sys
from pathlib import Path

output_directory = Path(r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\MS_analysis\MV_ce_ILS_release")

#set the prefix of the data columns that contain protein abundance (NOT RECOMMENDED TO CHANGE)
prefix = "apQuant Area"

# Initialize and load data
sys.path.append(r"\\storage.imp.ac.at\groups\plaschka\shared\software\scripts\python\ms_analysis_triplicates_src")
from ms_analysis_triplicates_utils import DataLoader

data_loader = DataLoader(ms_data_path, poi_file_path, color_code_file_path, prefix)
merged_data = data_loader.load_and_process_data(debug=False)  # This will set color_mapping_dict


Number of proteins in the mass spec file: 768
Number of POIs found in the mass spec file: 81
Number of proteins in the POI file: 572
Number of Classes / families in the POI file: 18
Number of Classes / families found in the mass spec file: 18


# Volcano plot
The following cell sets up an interactive interface for generating volcano plots. Dropdown menus allow you to select the columns for **Log2 Fold Change** (`log2 FC`) and **p-values**.

#### Adjusting Keywords for Column Filtering ####

If the drop-down menus for selecting **Log2 FC Columns** or **p-value Columns** do not show the expected options, you can look for these lines in the code vell below and modify the filtering keywords:

1. **Understand the Filtering Logic**:
  
   The variables below define the keywords used to identify relevant columns in the `merged_data` DataFrame. 
   Note that files from the MS facility don't always have the exact same naming, so you moght need to adjust the keywords, but try the default first
   ```python
   log2_fc_keywords = ['log2']  # Keywords for identifying Log2 FC columns
   pval_keywords = ['pval', 'p-value']  # Keywords for identifying p-value columns
   required_keyword = 'imputed'  # Common keyword required in all selected columns




In [2]:
from pathlib import Path
import sys
import pandas as pd
from ipywidgets import Dropdown, VBox, Output, Button, Layout
from IPython.display import display
import plotly.graph_objects as go  # Added import

# Define keywords for identifying the columns for plotting
log2_fc_keywords = ['log2']  # Keywords for log2 FC columns
pval_keywords = ['pval', 'p-value']  # Keywords for p-value columns
required_keyword = 'imputed'  # Common required keyword for both


from ms_analysis_triplicates_utils_dev import VolcanoPlotManager, InteractiveFieldSelector

plot_manager = VolcanoPlotManager(
    plot_width=plot_width,
    plot_height=plot_height,
    show_threshold_lines=show_threshold_lines,
    log2_fc_positive=log2_fc_positive,
    log2_fc_negative=log2_fc_negative,
    p_value_threshold=p_value_threshold,
    unlabeled_dot_size=unlabeled_dot_size,
    labeled_dot_size=labeled_dot_size,
    unlabeled_opacity=unlabeled_opacity,
    labeled_opacity=labeled_opacity,
    color_mapping_dict=data_loader.color_mapping_dict,
    output_directory=output_directory,
    debug=False
)



# 2) Use InteractiveFieldSelector for hover fields
field_selector = InteractiveFieldSelector(
    data=merged_data,
    default_hover_fields=["Accession", "Description"]
)
plot_manager.set_field_selector(field_selector)

# 3) Filter columns using keywords
log2_fc_cols = [
    col for col in merged_data.columns 
    if required_keyword in col.lower() and any(k in col.lower() for k in log2_fc_keywords)
]
pval_cols = [
    col for col in merged_data.columns 
    if required_keyword in col.lower() and any(k in col.lower() for k in pval_keywords)
]

# 4) Adjust dropdown widgets
fc_dropdown = Dropdown(
    options=log2_fc_cols,
    description="Log2 FC Column:",
    layout=Layout(width="400px"),
    style={'description_width': '200px'}
)
pval_dropdown = Dropdown(
    options=pval_cols,
    description="p-Value Column:",
    layout=Layout(width="400px"),
    style={'description_width': '200px'}
)

plot_output = Output()

# 5) Button to generate the volcano plot
def generate_plot(_):
    with plot_output:
        plot_output.clear_output()
        fc_col, pv_col = fc_dropdown.value, pval_dropdown.value
        if fc_col and pv_col:
            plot_manager.result = merged_data
            fig = plot_manager.create_volcano_plot(
                data=merged_data,
                log2fc_col=fc_col,
                pval_col=pv_col,
                title=f"Volcano Plot: {fc_col} vs {pv_col}",
                hover_fields=field_selector.get_selected_hover_fields()
            )
            plot_manager.current_figure = fig
            #lot_manager._update_layer_checkboxes()
            plot_manager._update_static_label_dropdown()
            print("Plot generated successfully!")

generate_button = Button(description="Generate Plot", button_style="success")
generate_button.on_click(generate_plot)

# 6) Simple UI for generating plot
ui = VBox([
    fc_dropdown,
    pval_dropdown,
    field_selector.hover_box,
    generate_button,
    plot_output
])
display(ui)

# 7) Finally, display the advanced manager interface (which includes the search buttons)
plot_manager.display_interface()


VBox(children=(Dropdown(description='Log2 FC Column:', layout=Layout(width='400px'), options=('log2Fold change…

VBox(children=(Output(), VBox(children=(VBox(children=(HBox(children=(Text(value='volcano_plot', description='…

### Heatmap plots

The next cells initializes the `HeatmapManager` and sets up an interactive interface for generating heatmaps. Here's what happens in this cell:


1. **Interactive Field Selector**:
   - The `InteractiveFieldSelector` is used to select additional fields for hover text in the heatmaps (e.g., `Accession` and `Description`).

2. **Displaying the Interface**:
   - The interface allows you to:
     - Select coloring methods for the heatmaps (`auto` or `manual`).
     - Choose specific classes, label fields, and log2 fold-change columns for plotting.
     - Generate heatmaps and save them interactively.
     - One plot will be generated for each selected Class/ family with the option to save as hrmla nd svg

Run this cell to activate the interface and start customizing your heatmap generation.


In [3]:
# Initialize HeatmapManager
from ms_analysis_triplicates_utils import (
    HeatmapManager,
    InteractiveFieldSelector
)
heatmap_manager = HeatmapManager(result=merged_data, output_directory=output_directory / "heatmaps" )  # Use pathlib for path joining

# 4) Use InteractiveFieldSelector for hover fields
field_selector = InteractiveFieldSelector(
    data=merged_data,
    default_hover_fields=["Accession", "Description"]
)
# Set the field selector
heatmap_manager.set_field_selector(field_selector)

# Display the interface
heatmap_manager.display_interface()


VBox(children=(VBox(children=(Dropdown(description='Heatmap coloring method:', layout=Layout(width='450px'), o…