## This notebook is used for **cleaning the carnitine MassQL query output**
(original notebook content: v_carnitine_massql_query_for_manuscript.ipynb)

- The MassQL query is based on the diagnostic peaks described in diagnostic_peaks_carnitine_M+H.ipynb and diagnostic_neutral_loss_carnitine_M+H.ipynb
- MassQL query data is specific to identifying carnitines

---
#### Shape of MassQL query output
The query output includes columns for each scan/spectrum_id (e.g. CCMSLIB00000221013) describing: precmz, ms1scan, rt, charge, i, i_norm, mslevel, and i_norm_ms1.

The query was applied to search the GNPS Library to investigate the ability to identify compounds by their diagnostic peaks; therefore, the MassQL query output data is related to data available in GNPS Libraries.

---
### The MassQL query output is cleaned by:
#### - removing scans/spectrum_ids associated with SUSPECT LIST data
#### - isolating scans/spectrum_ids associated with M+H adduct

Cleaning is done by comparing metadata available in GNPS Library to the MassQL query output to include and exclude MassQl data based on the above cleaning criteria.

---
## Cleaning MassQL data with cleaned GNPS Library data
### Section 1: Read clean GNPS Library data
- already removed spectrum_ids associated with SUSPECT LIST data
- only includes spectrum_ids associated with M+H adduct

### Section 2: Clean MassQL data
- Compare MassQL 'scan' column with cleaned GNPS Library 'spectrum_id' column to clean MassQL data

## Input files needed for the Notebook
1. Data exported from https://msql.ucsd.edu/ related to **specific MassQL query**:
QUERY scaninfo(MS2DATA) WHERE MS2PROD=85.03:TOLERANCEMZ=0.01:INTENSITYPERCENT=5 AND MS2PROD=60.08:TOLERANCEMZ=0.01:INTENSITYPERCENT=5 AND MS2NL=59.07:TOLERANCEMZ=0.01:INTENSITYPERCENT=5.
2. **Cleaned** GNPS Library metadata from clean_GNPS_Library_data.ipynb

In [2]:
import pandas as pd

#### read MassQL data 

In [3]:
massql_query_output = pd.read_csv('/home/jovyan/work/notebooks/outputs/massql_carnitine_query_peaks_nl.csv',sep=',', index_col='scan')

In [4]:
massql_query_output

Unnamed: 0_level_0,precmz,ms1scan,rt,charge,i,i_norm,mslevel,i_norm_ms1,Compound_Name,Adduct,library_membership
scan,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
CCMSLIB00000221013,162.113,0,0,1,2.058000e+03,1,2,,ReSpect:PT102690 L-Carnitine h,[M+H],RESPECT
CCMSLIB00000221015,162.113,0,0,1,3.170000e+03,1,2,,ReSpect:PT102693 L-Carnitine h,[M+H],RESPECT
CCMSLIB00000221337,162.113,0,0,1,2.061000e+03,1,2,,ReSpect:PT107170 L-Carnitine|V,[M+H],RESPECT
CCMSLIB00000222989,162.113,0,0,1,5.995100e+03,1,2,,Massbank:PR100159 L-Carnitine|,[M+H]+,MASSBANK
CCMSLIB00000222990,162.113,0,0,1,3.706970e+02,1,2,,Massbank:PR100160 L-Carnitine|,[M+H]+,MASSBANK
...,...,...,...,...,...,...,...,...,...,...,...
CCMSLIB00006683989,162.112,0,0,1,5.146000e+02,1,2,,,,
CCMSLIB00006684275,344.280,0,0,1,1.509000e+02,1,2,,,,
CCMSLIB00006686014,162.112,0,0,1,4.952127e+08,1,2,,,,
CCMSLIB00006686484,344.280,0,0,1,1.642600e+09,1,2,,,,


In [5]:
# need massql scans as list to compare with GNPS Library data
massql_query_ids = massql_query_output.index.to_list()

### Section 1: Read clean GNPS Library data

In [6]:
# from shape_GNPS_Library_data.ipynb

input_library_cleaned = pd.read_csv('/home/jovyan/work/notebooks/outputs/CLEANED_GNPS_input_library.csv',sep=',', low_memory=False)

In [7]:
len(input_library_cleaned)

245648

### Section 2: Clean MassQL data

##### Generate cleaned MassQL data
- comparing MassQL query 'scan' column with 'spectrum_id' column of cleaned GNPS Library data

In [9]:
massql_query_output_matched = input_library_cleaned[input_library_cleaned["spectrum_id"].isin(massql_query_ids)]

In [10]:
massql_query_output_matched

Unnamed: 0,index,spectrum_id,source_file,task,scan,ms_level,library_membership,spectrum_status,peaks_json,splash,...,Ion_Mode,create_time,task_id,user_id,InChIKey_smiles,InChIKey_inchi,Formula_smiles,Formula_inchi,url,annotation_history
5802,9475,CCMSLIB00005884289,madeleine.mgf,f92bb49bd6e64337b6cc3a154bb87d6b,1136,2,GNPS-LIBRARY,1,"[[50.167000,175.009995],[51.291000,1927.599976...",null-null-null-null,...,Positive,2021-03-18 10:55:10.0,f92bb49bd6e64337b6cc3a154bb87d6b,,PHIQHXFUZVPYII-ZCFIWIBFSA-N,,C7H15NO3,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': '6645-46-1', ..."
5803,9476,CCMSLIB00005884290,madeleine.mgf,f92bb49bd6e64337b6cc3a154bb87d6b,1137,2,GNPS-LIBRARY,1,"[[50.518299,6257.680176],[51.285301,5857.77002...",null-null-null-null,...,Positive,2021-03-18 10:55:10.0,f92bb49bd6e64337b6cc3a154bb87d6b,,PHIQHXFUZVPYII-ZCFIWIBFSA-N,,C7H15NO3,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': '6645-46-1', ..."
5804,9477,CCMSLIB00005884291,madeleine.mgf,f92bb49bd6e64337b6cc3a154bb87d6b,1138,2,GNPS-LIBRARY,1,"[[50.335499,173.570007],[52.941200,175.500000]...",null-null-null-null,...,Positive,2021-03-18 10:55:10.0,f92bb49bd6e64337b6cc3a154bb87d6b,,PHIQHXFUZVPYII-ZCFIWIBFSA-N,,C7H15NO3,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': '6645-46-1', ..."
5805,9478,CCMSLIB00005884292,madeleine.mgf,f92bb49bd6e64337b6cc3a154bb87d6b,1139,2,GNPS-LIBRARY,1,"[[51.347500,195.000000],[52.074001,2091.510010...",null-null-null-null,...,Positive,2021-03-18 10:55:10.0,f92bb49bd6e64337b6cc3a154bb87d6b,,PHIQHXFUZVPYII-ZCFIWIBFSA-N,,C7H15NO3,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': '6645-46-1', ..."
5903,9576,CCMSLIB00005884390,madeleine.mgf,f92bb49bd6e64337b6cc3a154bb87d6b,1237,2,GNPS-LIBRARY,1,"[[50.608002,74.110001],[50.686501,158.160004],...",null-null-null-null,...,Positive,2021-03-18 10:55:10.0,f92bb49bd6e64337b6cc3a154bb87d6b,,FUJLYHJROOYKRA-QGZVFWFLSA-N,,C19H37NO4,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': '25518-54-1',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229180,560686,CCMSLIB00006683989,library_mgf.mgf,cbf7a3321fc2435bb2fccc544e58a335,3836,2,MONA,1,"[[41.039059,3.800000],[43.018051,41.099998],[4...",null-null-null-null,...,positive,2021-07-20 19:45:32.0,cbf7a3321fc2435bb2fccc544e58a335,,PHIQHXFUZVPYII-ZCFIWIBFSA-N,,C7H15NO3,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': 'N/A', 'Charg..."
229366,560972,CCMSLIB00006684275,library_mgf.mgf,cbf7a3321fc2435bb2fccc544e58a335,4122,2,MONA,1,"[[57.033741,0.500000],[57.069820,1.400000],[60...",null-null-null-null,...,positive,2021-07-20 19:45:32.0,cbf7a3321fc2435bb2fccc544e58a335,,FUJLYHJROOYKRA-QGZVFWFLSA-N,,C19H37NO4,,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': 'M+H', 'CAS_Number': 'N/A', 'Charg..."
236145,568238,CCMSLIB00000221013,respect_8_1_2014_GNPS_peaks.mgf,819707e5bc284f80b9d421d106481dbb,6222,2,RESPECT,1,"[[60.082802,129.000000],[85.030502,257.000000]...",splash10-00fr-7900000000-7900000000,...,Positive,2014-08-01 16:37:39.0,819707e5bc284f80b9d421d106481dbb,,PHIQHXFUZVPYII-UHFFFAOYSA-N,PHIQHXFUZVPYII-ZCFIWIBFNA-N,C7H15NO3,C7H15NO3,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': '[M+H]', 'CAS_Number': '541-15-1',..."
236146,568239,CCMSLIB00000221015,respect_8_1_2014_GNPS_peaks.mgf,819707e5bc284f80b9d421d106481dbb,6223,2,RESPECT,1,"[[57.033699,52.000000],[58.066101,426.000000],...",splash10-00fr-7900000000-7900000000,...,Positive,2014-08-01 16:37:39.0,819707e5bc284f80b9d421d106481dbb,,PHIQHXFUZVPYII-UHFFFAOYSA-N,PHIQHXFUZVPYII-ZCFIWIBFNA-N,C7H15NO3,C7H15NO3,https://gnps.ucsd.edu/ProteoSAFe/gnpslibrarysp...,"[{'Adduct': '[M+H]', 'CAS_Number': '541-15-1',..."


### Save file

In [13]:
# save cleaned MassQL data for further analysis
massql_query_output_matched.reset_index().to_csv(
    '/home/jovyan/work/notebooks/outputs/massql_carnitine_query_peaks_nl_output_matched_for_venn_diagram.csv', sep=',', index=False)