## This notebook is used for **cleaning metadata from GNPS Libraries**
---
Metadata is cleaned by:
- removing spectrum_ids associated with SUSPECT LIST data
- isolating spectrum_ids associated with M+H adduct

## Input files needed for the Notebook
1. GNPS Library metadata from https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.json

In [1]:
import pandas as pd

#### read GNPS Library data

In [2]:
# from v_get_ALL_GNPS_input_library.ipynb

input_library_full_df_loaded = pd.read_parquet('/home/jovyan/work/notebooks/outputs/ALL_GNPS_input_library.gzip')

### Remove suspect list data from GNPS Library data

In [3]:
# remove rows corresponding to GNPS-SUSPECTLIST
input_library_no_suspect_list = input_library_full_df_loaded[~input_library_full_df_loaded['library_membership'].str.contains('suspect',case=False,na=False)]

### Isolate M+H adduct data in GNPS Library data

In [4]:
# identifying potential M+H adduct labels in GNPS Library metadata
adduct_labels = ['M+H','[M+H]','[M+H]+']

##### Generate cleaned GNPS Library data
- excluding suspect list data
- exclusively describing M+H adduct data

In [5]:
# Cleaned GNPS Library data
input_library = input_library_no_suspect_list[input_library_no_suspect_list["Adduct"].isin(adduct_labels)]

In [6]:
len(input_library)

245648

In [7]:
input_library.reset_index().to_csv(
    '/home/jovyan/work/notebooks/outputs/CLEANED_GNPS_input_library.csv', sep=',', index=False)