## notebook identifies consistent neutral loss for carnitine compounds

- carnitines determined by annotation description
- output is a boxplot describing the characteristic neutral loss and their stats

v_boxplot_carnitines_M+H_neutral_loss_for_manuscript.ipynb

## This notebook identifies **consistent neutral loss** for **carnitine compounds**
(original notebook content: v_boxplot_carnitines_M+H_neutral_loss_for_manuscript.ipynb)

---

- Carnitines were determined by annotation described in **annotation_search_identify_carnitine.ipynb**
- **output are dataframes** describing the characteristic neutral loss and their stats

---

### About the process
The process for determining consistent neutral loss is as follows:

1. M/Z neutral loss values are grouped in 0.01 M/Z bins
2. Peaks attributed to noise are removed by isolating peaks above a normalized intensity of 0.05
3. Minimum percent occurrence for neutral loss of interest (after removing noise) was specified to occur in at least 20% of the grouped spectra

Final output summarizes consistent neutral losses occurring at least 20% of the time.

---
## Notebook organization

### Section 1: Reading input data
- MS/MS peak data
- carnitines (identified by annotation)-specific metadata in GNPS Library

### Section 2: Matching peak data with metadata

### Section 3: Investigate neutral loss
- Make 0.01 M/Z neutral loss bins
- Define minimum intensity and percent occurrence parameters
- Identify peaks that satisfy parameters

## Input files needed for the Notebook
1. MS/MS peak data
2. Dataframe output of dihydroxy-BA metadata in GNPS Library from **substructure_search_identify_di_BA.ipynb**

In [1]:
import pandas as pd
import plotly
import plotly.express as px

### Section 1: Read input data

#### Read peak data
- from v_get_peaks_files.ipynb

(Neededed to break into 5 parts due to file size)

In [2]:
all_file_peaks_part_1 = pd.read_parquet('/home/jovyan/work/notebooks/outputs/all_file_peaks_part_1.gzip')

In [3]:
all_file_peaks_part_2 = pd.read_parquet('/home/jovyan/work/notebooks/outputs/all_file_peaks_part_2.gzip')

In [4]:
all_file_peaks_part_3 = pd.read_parquet('/home/jovyan/work/notebooks/outputs/all_file_peaks_part_3.gzip')

In [5]:
all_file_peaks_part_4 = pd.read_parquet('/home/jovyan/work/notebooks/outputs/all_file_peaks_part_4.gzip')

In [6]:
all_file_peaks_part_5 = pd.read_parquet('/home/jovyan/work/notebooks/outputs/all_file_peaks_part_5.gzip')

#### Read carnitine metadata from GNPS Library
- from annotation_search_identify_carnitine.ipynb

In [7]:
carnitine_table = pd.read_csv('/home/jovyan/work/notebooks/outputs/library_df_carnitine_case_insen_M+H.csv',sep=',', index_col='spectrum_id')

In [8]:
# list of spectrum_id
carnitine_table_ID = carnitine_table.index.to_list()

### Section 2: Matching peak data with metadata
- Need to identify MS/MS peak data for spectra identified as dihydroxy bile acids in GNPS Library by substructure search

In [11]:
all_file_peaks_part_1_carnitine = all_file_peaks_part_1[all_file_peaks_part_1.index.isin(carnitine_table_ID)]

In [12]:
all_file_peaks_part_2_carnitine = all_file_peaks_part_2[all_file_peaks_part_2.index.isin(carnitine_table_ID)]

In [13]:
all_file_peaks_part_3_carnitine = all_file_peaks_part_3[all_file_peaks_part_3.index.isin(carnitine_table_ID)]

In [14]:
all_file_peaks_part_4_carnitine = all_file_peaks_part_4[all_file_peaks_part_4.index.isin(carnitine_table_ID)]

In [15]:
all_file_peaks_part_5_carnitine = all_file_peaks_part_5[all_file_peaks_part_5.index.isin(carnitine_table_ID)]

In [16]:
# Combine individual dataframes to make complete dataframe of all peak data associated with carnitines
all_file_peaks_carnitine = pd.concat([all_file_peaks_part_1_carnitine, all_file_peaks_part_2_carnitine, all_file_peaks_part_3_carnitine,
                                      all_file_peaks_part_4_carnitine, all_file_peaks_part_5_carnitine], axis=0)

In [17]:
all_file_peaks_carnitine

Unnamed: 0_level_0,level_0,index,i,i_norm,i_tic_norm,mz,mz_nl,precmz
scan,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CCMSLIB00000216197,399442,0,2177.529053,0.056863,0.034625,73.961777,126.118223,200.080
CCMSLIB00000216197,399443,1,25.951593,0.000678,0.000413,74.977188,125.102812,200.080
CCMSLIB00000216197,399444,2,503.461060,0.013147,0.008006,84.048637,116.031363,200.080
CCMSLIB00000216197,399445,3,21725.027344,0.567317,0.345450,102.003937,98.076063,200.080
CCMSLIB00000216197,399446,4,37.872719,0.000989,0.000602,158.435135,41.644865,200.080
...,...,...,...,...,...,...,...,...
CCMSLIB00000221337,35834,3,473.000000,0.473473,0.229500,103.041199,59.071801,162.113
CCMSLIB00000221337,35835,4,999.000000,1.000000,0.484716,162.113007,-0.000007,162.113
CCMSLIB00000221529,36687,0,999.000000,1.000000,0.676371,85.029900,119.094100,204.124
CCMSLIB00000221529,36688,1,141.000000,0.141141,0.095464,145.053894,59.070106,204.124


In [18]:
# check that all spectrum_id are accounted for
len(all_file_peaks_carnitine.index.unique().to_list())

387

### Section 3: Investigate neutral loss

In [19]:
# rename combined dataframe
peak_df = all_file_peaks_carnitine

In [20]:
# Identifying small bins of neutral loss values
peak_df['mz_nl_binned_small'] = peak_df['mz_nl'].round(decimals = 2)
unique_mz_nl_binned = peak_df['mz_nl_binned_small'].unique()

In [21]:
peak_df.head(5)

Unnamed: 0_level_0,level_0,index,i,i_norm,i_tic_norm,mz,mz_nl,precmz,mz_nl_binned_small
scan,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
CCMSLIB00000216197,399442,0,2177.529053,0.056863,0.034625,73.961777,126.118223,200.08,126.12
CCMSLIB00000216197,399443,1,25.951593,0.000678,0.000413,74.977188,125.102812,200.08,125.1
CCMSLIB00000216197,399444,2,503.46106,0.013147,0.008006,84.048637,116.031363,200.08,116.03
CCMSLIB00000216197,399445,3,21725.027344,0.567317,0.34545,102.003937,98.076063,200.08,98.08
CCMSLIB00000216197,399446,4,37.872719,0.000989,0.000602,158.435135,41.644865,200.08,41.64


In [22]:
# Defining parameters --> can be modified by user

intensitynormmin  = 0.05
percentoccurmin = 20

In [23]:
# identifying peaks that satisfy minimum normalized intensity parameter

filtered_peak_df_i_norm = peak_df[peak_df["i_norm"] >= intensitynormmin]

In [24]:
# For counting percent occurrence of peaks above miniumum intensity
occurs_above_intensitynormmin = {}

# Total number of spectral IDs
total_ids = len(peak_df.index.unique())

for peak in unique_mz_nl_binned:
    mz_nl_df_above_intensitynormmin = filtered_peak_df_i_norm.loc[(filtered_peak_df_i_norm['mz_nl_binned_small'] == peak)]

    # Number of spectra where neutral losses occurs above miniumum intensity
    peak_occurs_above_intensitynormmin = len(mz_nl_df_above_intensitynormmin)

    if peak_occurs_above_intensitynormmin/total_ids >= (percentoccurmin/100):
        occurs_above_intensitynormmin[peak] = peak_occurs_above_intensitynormmin/total_ids

In [25]:
# Filtering to only include neutral losses that are present in at least 20% of the scans
filtered_peak_df = filtered_peak_df_i_norm[filtered_peak_df_i_norm["mz_nl_binned_small"].isin(occurs_above_intensitynormmin.keys())]

In [26]:
# reshaping and renaming
peak_ratio_df = pd.DataFrame.from_dict(occurs_above_intensitynormmin, orient='index')
peak_ratio_df.index.name = 'mz_nl_binned_small'
peak_ratio_df = peak_ratio_df.rename(columns={0: "percent_occurrence"})

In [27]:
# visualize neutral losses by percent occurrence
# NOTE: neutral loss of 0 is due to precursor
peak_ratio_df

Unnamed: 0_level_0,percent_occurrence
mz_nl_binned_small,Unnamed: 1_level_1
0.0,0.516796
59.07,0.521964
