# **Mass Spectrometry Analysis Notebook**

### **Overview**
The idea for this notebook is that you provide groups of proteins you are interested in (for example, all subunits of RNA Polymerase II would be one Protein Class or Family), and visualise their absolute and relative abundance in your samples.Some of the plots are searchable and hits (i.e top enriched proteins) can be quickly exported to CSV files.

### **Outputs** 

This notebook creates interactive plots from your data:

**Overview** plots: 
   - Visualise the overall composition for each samly by summing up all proteins belong to your classes of intertest
   - Ranked abundance plots: For each sample, protein abundances are sorted from highest to lowest and proteins of interest are highlighted (either using color code from your POI file, or using a search function)

**Log2-fold change plots**:
   - **Scatter plot** of Log2 fold changes between selected samples for your proteins of interest (POI)
   - **Violin plots** for your Protein classes, showing *individual* proteins
   - **Heatmaps**  for Log2 fold changes, showing changes *averaged* over all proteins belonging to a Class/ family

### **Required Files**

In this section, you will specify the file paths required for processing mass spectrometry data. These paths are crucial for the script to load and analyze the data correctly. Below is a list of files you need to provide and instructions for optional files.

 1.  **Mass Spectrometry Data File**
This Excel file obtained from the facility, which includes protein quantifications and associated metadata.  
Due to the merged header format of Excel file which is hard to parse, you need to specify a few values:
      -   The column index that contains the uniprot Accession values (i.e 'G'), and the the "Genes" column, which will be used for labelling data points in plots (i.e 'H')
      -   The  label of the columns that contains protein quantifications. They are usually called 'Area (Norm.)' or 'Norm. Area' , but in case yours is called different add  the column label to the list of strings

 2.  **Protein of Interest (POI) Data File**
      - **Purpose**: This Excel file includes details about proteins of interest, and an annotation
      - **Format**: The file needs to contain a column called "Class / family", and a column called "Other UniProt Accessions". This column may contain multiple comma seperated uniprot accessions IDs
 3. **Color Code File**
      - **Purpose**: This CSV file contains a mapping between protein classes/families and their respective color codes, which are used to visually distinguish these classes in plots.
      - **Format**: The file needs to contain a column called "Class / family", and a column color, that contains the desired color in **HEX code**


>Note:  In the code block below, Windows paths must be formatted as absolute paths like this `path = r"/path/to/your/file"`

In [None]:
# BLOCK 1: User Input Files

# Path to the mass spectrometry data file
mass_spec_file_path = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\nascent_transcription_complexes\20250207_E3_NEO7_Vorlaender_Plaschka_IMP_ID1952_coom_gel_dig_HLB_desalted_40per_9x_direct_nombr_nonorm.xlsm"

######################################################Mass spec data file parameter####################################################

# Specify the column indices for the 'Genes', 'Accession', and 'Description' columns in the Excel file
accession_col_letter = 'E'  # Example: column D
genes_col_letter = 'H'  # Example: column G
description_col_letter = 'G'  # Example: column H

# List possible text strings to identify abundance columns
abundance_column_label = ['Area (Norm.)', 'Norm. Area']  # Default should be fine, but in case your column that contains the protein quantification is called different, add it to the list of strings.

#r#################################################### Path to the POI data filesr####################################################

poi_file_path = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\MS_analysis\protein_annotations\POIs_PolII_Spliceosome_250225_MV.xlsx"
color_code_file_path = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\MS_analysis\protein_annotations\COLORS_PolII_Spliceosome_250225_MV.csv"

# If not provided, a default directory will be created in the same location as the mass spec file
output_directory_input = r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec\MS_analysis\YOUR_FOLDER"  # Set this where results should be saved. Note that you need write permissions in the folder as palschka.lab, so it must be a shared location
#output_directory_input = r"\\storage.imp.ac.at\groups\plaschka\shared\other\file_sharing\your_folder"


## Pre-process data, merge with annotations, and rename sample for plotting

The next code-block:
- loads your MS excel file and merges it with your POI annotations files. 
- After the data is loaded, you can rename the samples to a shorter name used for plotting. Enter the new names, click the save button, and then the apply names button. 
- The information is saved in a JSON file for late re-use

In [23]:
import sys
sys.path.append(r"\\storage.imp.ac.at\groups\plaschka\shared\software\scripts\python\ms_analysis_singlicates_src")
from utils_mass_spec_analysis_singlicates import MassSpecPreprocessing

required_path=r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec"

# Instantiate the preprocessing class with your parameters
preprocessor = MassSpecPreprocessing(
    mass_spec_file_path=mass_spec_file_path,
    poi_file_path=poi_file_path,
    color_code_file_path=color_code_file_path,
    abundance_column_label=abundance_column_label,
    genes_col_letter=genes_col_letter,
    accession_col_letter=accession_col_letter,
    description_col_letter=description_col_letter,
    required_path=r"\\storage.imp.ac.at\groups\plaschka\shared\data\mass-spec"
)

# Run the preprocessing workflow (this will print a reminder if the file is not in the correct folder)
preprocessor.run_preprocessing()
if not mass_spec_file_path.startswith(required_path):
    print("YOUR MASS SPEC DATA IS NOT IN THE SHARED FOLDER! COPY IT THERE AND UPDATE THE EXCEL FILE /plaschka/shared/data/mass-spec/Overview_of_Experiments.xlsx")

print("DON'T FORGET TO HIT THE 'APPLY NEW NAMES' BUTTON!")

Loading data, please be patient...

Preview of the specified columns:
  Unnamed: 7_level_0   Unnamed: 4_level_0  \
  Unnamed: 7_level_1   Unnamed: 4_level_1   
0                NaN                MPBAG   
1               Igkc        cont_P01837.1   
2                NaN        cont_P01864.1   
3                NaN          cont_P00761   
4               KRT1  cont_P04264; P04264   
5              KRT10  cont_P13645; P13645   
6               KRT9  cont_P35527; P35527   
7               KRT2  cont_P35908; P35908   
8           IGHV3-48               P01763   
9             PRPF19               Q9UMS4   

                                  Unnamed: 6_level_0  
                                  Unnamed: 6_level_1  
0                              MBP-ProteinA-ProteinG  
1                            Ig kappa chain C region  
2           Ig gamma-2A chain C region secreted form  
3                    Trypsin OS=Sus scrofa PE=1 SV=1  
4  Keratin, type II cytoskeletal 1 OS=Homo sapien...  
5  K

VBox(children=(Label(value="Edit sample names below or remove common prefix, then click 'Save Sample Names':")…

Button(button_style='success', description='Apply New Names & Rebuild Data', style=ButtonStyle())

Output()

Widget-based renaming logic initialized. Please rename samples (if needed),
then click 'Save Sample Names', and finally 'Apply New Names & Rebuild Data'.

DON'T FORGET TO HIT THE 'APPLY NEW NAMES' BUTTON!


## Overall sample composition

The next code is optional, but when executed sums up all proteins belonging to each class to eastimate the overall sample composition based on your MS results

In [None]:
# BLOCK 3 - Overall Sample Composition and Interactive Class/Condition Selection

# 1. Import the utility functions:
from utils_mass_spec_analysis_singlicates import plot_overall_sample_composition

# 2. First cell: Generate the overall sample abundance plots:
plot_overall_sample_composition(preprocessor.merged_data, preprocessor.abundance_cols, preprocessor.color_map)



TypeError: 'NoneType' object is not subscriptable

## Select Classes / proteins of interest and Conditions / MS samples to analyzse
In the next section, select one or more Classes and Conditions to analyze, and click the button to confirm

**Note**: You can Control+Click (windows) or Command+Click (mac) to select multiple items!

In [None]:
# 3. Second cell: Interactive selection of classes and conditions:
from utils_mass_spec_analysis_singlicates import display_class_condition_selection_widgets

selection_results = display_class_condition_selection_widgets(preprocessor.merged_data, preprocessor.abundance_cols)


SelectMultiple(description='Select Class(es)', layout=Layout(width='50%'), options=('Non-POI (241 proteins)', …

SelectMultiple(description='Select Conditions', layout=Layout(width='50%'), options=('1', '2', '3', '4', '5', …

Button(description='Add classes/conditions to analysis', style=ButtonStyle())

Output()

## Select fields to display upon mouse hovering

In the next block, you can configure which info you see when you hover the mouse over a data point in the subsequent plots 

In [None]:
from utils_mass_spec_analysis_singlicates import InteractiveFieldSelector

# Ensure preprocessor.merged_data is built
field_selector = InteractiveFieldSelector(
    data=preprocessor.merged_data,
    default_hover_fields=["Accession", "Description"],
    exclude_fields=preprocessor.abundance_cols  # Exclude abundance columns
)

# Display only the hover configuration (or the full interface)
field_selector.display_hover_only()



VBox(children=(Label(value='Select fields to include in hover text:'), VBox(children=(Checkbox(value=True, des…

## Ranked abundance plots

The next plot displazs the ranked abundance of your POIs. 
**Notes** - You can search for proteins using the seach box. All proteins that partially match the text string in either the **uniprot ID** or **Gene name** will be highlighzed.
- You can toggle which proteins are highlighted by clicking on the legend, and also configure if they should be included in the saved files or not using the buttons below.
- You can slo save all protiens above a certain cutoff to a csv file

In [None]:
from utils_mass_spec_analysis_singlicates import run_ranked_abundance_analysis

# Retrieve the hover fields from the field selector
hover_fields = field_selector.get_selected_hover_fields()

# Call run_ranked_abundance_analysis as before, now including hover_fields.
run_ranked_abundance_analysis(
    merged_data=preprocessor.merged_data,
    conditions_to_plot=selection_results["conditions_to_plot"],
    classes_to_plot=selection_results["classes_to_plot"],
    color_map=preprocessor.color_map,
    mass_spec_file_path=mass_spec_file_path,
    output_directory_input=output_directory_input,  # or "" for default
    hover_fields=hover_fields
)


FigureWidget({
    'data': [{'hoverinfo': 'text',
              'hovertext': array(['cont_P00761<br>Trypsin OS=Sus scrofa PE=1 SV=1',
                                  'cont_P04264; P04264<br>Keratin, type II cytoskeletal 1 OS=Homo sapiens GN=KRT1 PE=1 SV=6<|>Keratin, type II cytoskeletal 1 OS=Homo sapiens OX=9606 GN=KRT1 PE=1 SV=6',
                                  'cont_P13645; P13645<br>Keratin, type I cytoskeletal 10 OS=Homo sapiens GN=KRT10 PE=1 SV=6<|>Keratin, type I cytoskeletal 10 OS=Homo sapiens OX=9606 GN=KRT10 PE=1 SV=6',
                                  ...,
                                  'A0A0B4J1V6<br>Immunoglobulin heavy variable 3-73 OS=Homo sapiens OX=9606 GN=IGHV3-73 PE=1 SV=1',
                                  'P46776<br>Large ribosomal subunit protein uL15 OS=Homo sapiens OX=9606 GN=RPL27A PE=1 SV=2',
                                  'Q9Y388<br>RNA-binding motif protein, X-linked 2 OS=Homo sapiens OX=9606 GN=RBMX2 PE=1 SV=2'],
                                

Label(value='Click checkboxes below to include/exclude those traces in the saved figure:')

VBox(children=(HBox(children=(Button(description='Toggle All Traces', style=ButtonStyle()),)), HBox(children=(…

FigureWidget({
    'data': [{'hoverinfo': 'text',
              'hovertext': array(['cont_P00761<br>Trypsin OS=Sus scrofa PE=1 SV=1',
                                  'cont_P04264; P04264<br>Keratin, type II cytoskeletal 1 OS=Homo sapiens GN=KRT1 PE=1 SV=6<|>Keratin, type II cytoskeletal 1 OS=Homo sapiens OX=9606 GN=KRT1 PE=1 SV=6',
                                  'cont_P13645; P13645<br>Keratin, type I cytoskeletal 10 OS=Homo sapiens GN=KRT10 PE=1 SV=6<|>Keratin, type I cytoskeletal 10 OS=Homo sapiens OX=9606 GN=KRT10 PE=1 SV=6',
                                  ...,
                                  'A0A0B4J1V6<br>Immunoglobulin heavy variable 3-73 OS=Homo sapiens OX=9606 GN=IGHV3-73 PE=1 SV=1',
                                  'P46776<br>Large ribosomal subunit protein uL15 OS=Homo sapiens OX=9606 GN=RPL27A PE=1 SV=2',
                                  'Q9Y388<br>RNA-binding motif protein, X-linked 2 OS=Homo sapiens OX=9606 GN=RBMX2 PE=1 SV=2'],
                                

Label(value='Click checkboxes below to include/exclude those traces in the saved figure:')

VBox(children=(HBox(children=(Button(description='Toggle All Traces', style=ButtonStyle()),)), HBox(children=(…


All done!


## Protein abundance changes between conditions
**Notes:** 
- You have the option to normalise the abundance of proteins in the selected conditions/samples to a reference protein indicated by its uniprot accession. To enable, edit the first two lines in the code block
- You can export hits tha are enriched above /below a certain threshold using the Export Hits button. Negative values will export all hits below that fold change value, positive values will export all hits above that fold change value

In [None]:
# BLOCK 6: Reference Protein Input (Notebook cell)
plot_normalized_data = False  # Set True to plot normalized data, False for raw data
reference_protein_accession = 'Q9UMS4'  # Set to None if you don't want normalization


from utils_mass_spec_analysis_singlicates import (
    normalize_data,
    prepare_log2_fold_changes,
    create_log2_fc_figure_widget,
    show_log2_fc_plot,
    set_default_output_directory
)
from itertools import combinations
from datetime import datetime
import os


# Retrieve hover fields selected by the user:
hover_fields = field_selector.get_selected_hover_fields()
conditions_to_plot = selection_results["conditions_to_plot"]
if plot_normalized_data and reference_protein_accession:
    preprocessor.merged_data = normalize_data(preprocessor.merged_data, reference_protein_accession, conditions_to_plot)

preprocessor.merged_data = prepare_log2_fold_changes(
    preprocessor.merged_data,
    conditions_to_plot,
    reference_protein=reference_protein_accession,
    use_normalized=plot_normalized_data
)

output_directory = output_directory_input or set_default_output_directory(mass_spec_file_path)
os.makedirs(output_directory, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

for cond1, cond2 in combinations(conditions_to_plot, 2):
    fig_widget, sorted_data, col_name = create_log2_fc_figure_widget(
        data=preprocessor.merged_data,
        cond1=cond1,
        cond2=cond2,
        classes_to_highlight=selection_results["classes_to_plot"],
        color_map=preprocessor.color_map,
        reference_protein=reference_protein_accession,
        use_normalized=plot_normalized_data,
        add_labels_for_pdf=False,
        hover_fields=hover_fields   # Pass the selected hover fields here
    )
    show_log2_fc_plot(
        fig_widget=fig_widget,
        sorted_data=sorted_data,
        col_name=col_name,
        cond1=cond1,
        cond2=cond2,
        output_dir=output_directory,
        timestamp=timestamp,
        default_threshold=1.0
    )

print("\nAll log2-FC plots displayed with interactive UI.\nSearch, highlight, clear, toggle, save, or export hits as desired!")


FigureWidget({
    'data': [{'hoverinfo': 'text',
              'hovertext': array(['O75223<br>Gamma-glutamylcyclotransferase OS=Homo sapiens OX=9606 GN=GGCT PE=1 SV=1',
                                  'P52272<br>Heterogeneous nuclear ribonucleoprotein M OS=Homo sapiens OX=9606 GN=HNRNPM PE=1 SV=3',
                                  'P62316<br>Small nuclear ribonucleoprotein Sm D2 OS=Homo sapiens OX=9606 GN=SNRPD2 PE=1 SV=1',
                                  ...,
                                  'A0A0B4J1V6<br>Immunoglobulin heavy variable 3-73 OS=Homo sapiens OX=9606 GN=IGHV3-73 PE=1 SV=1',
                                  'P46776<br>Large ribosomal subunit protein uL15 OS=Homo sapiens OX=9606 GN=RPL27A PE=1 SV=2',
                                  'Q9Y388<br>RNA-binding motif protein, X-linked 2 OS=Homo sapiens OX=9606 GN=RBMX2 PE=1 SV=2'],
                                 dtype=object),
              'marker': {'color': '#D3D3D3', 'opacity': 0.4, 'size': 6},
              'mode

Label(value='Toggle traces below for saving (PDF/HTML):')

VBox(children=(HBox(children=(Button(description='Toggle All Traces', style=ButtonStyle()),)), HBox(children=(…


All log2-FC plots displayed with interactive UI.
Search, highlight, clear, toggle, save, or export hits as desired!


## Violin plots of proteins belonging to Class/Family for all conditiomns

The next plot will generate violin plots showing individual proteins belonging to the seleted Classes/Familys and their abundance in different samples!
**Note:** You can normalise the data to a reference protein as before!

In [None]:
plot_normalized_data = True  # Set True to plot normalized data, False for raw data
reference_protein_accession = 'Q6P2Q9'  # Set to None if you don't want normalization


from utils_mass_spec_analysis_singlicates import (
    ensure_normalization,
    generate_color_palette,
    plot_class_abundance_plotly,
    show_violin_plot_with_save_button
)
from datetime import datetime
import os

# Retrieve hover fields from the field selector
hover_fields = field_selector.get_selected_hover_fields()

# Use selection_results to set conditions and classes:
conditions_to_plot = selection_results["conditions_to_plot"]
classes_to_plot = selection_results["classes_to_plot"]

# Ensure normalization if required
if plot_normalized_data and reference_protein_accession:
    normalized_data = ensure_normalization(preprocessor.merged_data, reference_protein_accession, conditions_to_plot)
else:
    normalized_data = preprocessor.merged_data

# Create output directory for violin plots
violin_plot_dir = os.path.join(output_directory_input, "violin_plots")
os.makedirs(violin_plot_dir, exist_ok=True)

# Create a color map for the conditions (or use your preferred mapping)
color_map = generate_color_palette(conditions_to_plot)

# Timestamp for file names
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Loop over each class to plot
for class_to_plot in classes_to_plot:
    sanitized_class_name = class_to_plot.replace(' ', '_').replace('/', '_')
    fig = plot_class_abundance_plotly(
        class_to_plot,
        normalized_data,
        conditions_to_plot,
        color_map,
        reference_protein_accession=reference_protein_accession,
        plot_normalized_data=plot_normalized_data,
        label_genes=True,        # or False if you do not want gene labels
        hover_fields=hover_fields  # Use the hover fields from the InteractiveFieldSelector
    )
    show_violin_plot_with_save_button(fig, sanitized_class_name, violin_plot_dir, timestamp)

print("All figures displayed. Use the on-screen widgets to save plots as needed.")


Normalized 2 using Q6P2Q9.
Normalized 4 using Q6P2Q9.


Label(value='Select traces to include in the saved plot:')

HBox(children=(Text(value='', description='File Name Suffix:', layout=Layout(width='400px'), style=TextStyle(d…

Output()

Label(value='Select traces to include in the saved plot:')

HBox(children=(Text(value='', description='File Name Suffix:', layout=Layout(width='400px'), style=TextStyle(d…

Output()

All figures displayed. Use the on-screen widgets to save plots as needed.


## Heatmap visualisation of average log2 fold changes of Class/ Family between conditions
The next block generates heatmap of **averaged* log2fold changes of selected classes in the selected conditions. You can adjust the height of the heat map squares using the parameters at the top


In [None]:
from utils_mass_spec_analysis_singlicates import run_heatmap_analysis_with_widgets

# Set your plot layout parameters in the notebook:
BAR_HEIGHT = 30      # Vertical space per class
MARGIN = 200         # Extra margin for titles/labels
FIG_WIDTH = 750      # Fixed width for the figure

plot_normalized_data = False  # Set True to plot normalized data, False for raw data
reference_protein_accession = 'Q6P2Q9'  # Set to None if you don't want normalization


# Retrieve interactive selections from your previous cell:
conditions_to_plot = selection_results["conditions_to_plot"]
classes_to_plot = selection_results["classes_to_plot"]


# Call the heatmap analysis function with widget-based saving:
run_heatmap_analysis_with_widgets(
    merged_data=preprocessor.merged_data,
    conditions_to_plot=conditions_to_plot,
    classes_to_plot=classes_to_plot,
    reference_protein_accession=reference_protein_accession,
    plot_normalized_data=plot_normalized_data,
    mass_spec_file_path=mass_spec_file_path,
    output_directory_input=output_directory_input,  # or "" to use default
    bar_height_per_class=BAR_HEIGHT,
    additional_margin=MARGIN,
    fig_width=FIG_WIDTH
)


Normalization turned off or no reference provided; using non-normalized data.
Interactive saving for comparison: 2 vs 4


VBox(children=(FigureWidget({
    'data': [{'hoverinfo': 'text',
              'hovertext': [0.46, -2.23],
   …


Done! Files (if saved) will appear under: analysis\heatmap


## Z-score heatmap

The next heatmap calcualtes z-scores for each Class/ Family. These Z scores indicate how many standard deviations above or below the class’s mean a particular abundance value lies. This transformation is applied row-wise, and the resulting standardized values (Z scores) are then used to generate the heatmap. This is useful when your experiment contained may conditions. THe code also clusters the Classes/Families and conditions by similarity!

**Note**: You can turn of clustering by setting the perform_clustering=False in the code block!

In [None]:
from utils_mass_spec_analysis_singlicates import run_class_level_heatmap_analysis_with_dendrogram_and_widgets


plot_normalized_data = False  # Set True to plot normalized data, False for raw data
reference_protein_accession = 'Q6P2Q9'  # Set to None if you don't want normalization

# Retrieve interactive selections from your earlier cell:
conditions_to_plot = selection_results["conditions_to_plot"]
classes_to_plot = selection_results["classes_to_plot"]

# Call the heatmap analysis function with dendrograms and widget-based saving.
run_class_level_heatmap_analysis_with_dendrogram_and_widgets(
    merged_data=preprocessor.merged_data,
    conditions_to_plot=conditions_to_plot,
    classes_to_plot=classes_to_plot,
    reference_protein_accession=reference_protein_accession,
    plot_normalized_data=plot_normalized_data,
    mass_spec_file_path=mass_spec_file_path,
    output_directory_input=output_directory_input,  # or "" to use default
    perform_clustering=True,
    comparison_label="Class_Level_Heatmap",
    hover_fields_for_heatmap=["Genes", "Description"],
    bar_height_per_class=200,      # Passed from the main notebook
    additional_margin=200,         # Passed from the main notebook
    fig_width=1000                   # Passed from the main notebook (adjust as desired)
)


Significant changes: 2, Non-significant changes: 0


VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'black', 'width': 1},
              'mode': 'lin…

Unnamed: 0_level_0,2 vs 4
Class / family,Unnamed: 1_level_1
PRP19 related,"* p=2.19e-02, FC=-3.66"
Pol II Core,"*** p=4.66e-05, FC=0.46"



Done! Heatmap and any saved files are located under: analysis\class_level_heatmaps
