# 1. Download Data

This notebook downloads the necessary example data that will be used in other notebooks. In particular, the notebook does the following:

- Download beer and urine .mzML files used as examples in the paper
- Download the HMDB database and extract metabolites.
- Trains kernel density estimators on the mzML files.
- Extract regions of interests from the mzML files.

**Please run this notebook first to make sure the data files are available for subsequent notebooks.**

In [1]:
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from pathlib import Path
import glob

In [4]:
import sys
sys.path.append('..')

In [5]:
from vimms.DataGenerator import download_file, extract_hmdb_metabolite, extract_zip_file, get_data_source, train_kdes
from vimms.MassSpec import IndependentMassSpectrometer
from vimms.Controller import SimpleMs1Controller
from vimms.Common import *
from vimms.Roi import make_roi, RoiToChemicalCreator, extract_roi

## a. Download beer and urine files

Here we download the beer and urine .mzML files used as examples in the paper if they don't exist.

In [6]:
url = 'http://researchdata.gla.ac.uk/870/2/example_data.zip'
base_dir = os.path.join(os.getcwd(), 'example_data')

In [7]:
if not os.path.isdir(base_dir): # if not exist then download the example data and extract it
    print('Creating %s' % base_dir)    
    out_file = 'example_data.zip'
    download_file(url, out_file)
    extract_zip_file(out_file, delete=True)
else:
    print('Found %s' % base_dir)

Creating /home/joewandy/git/vimms/examples/example_data


  0%|          | 448/869k [00:00<05:13, 2.78kKB/s]

Downloading example_data.zip


869kKB [01:27, 9.92kKB/s]                           
  5%|▌         | 6/110 [00:00<00:03, 33.27it/s]

Extracting example_data.zip


100%|██████████| 110/110 [00:08<00:00, 12.71it/s]


Deleting example_data.zip


## b. Download metabolites from HMDB

Next we load a pre-processed pickled file of database metabolites in the `data_dir` folder. If it is not found, then create the file by downloading and extracting the metabolites from HMDB.

In [8]:
compound_file = Path(base_dir, 'hmdb_compounds.p')
hmdb_compounds = load_obj(compound_file)
if hmdb_compounds is None: # if file does not exist

    # download the entire HMDB metabolite database, big and slow!!
    # url = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip'

    # download a smaller urine metabolite database for testing
    url = 'http://www.hmdb.ca/system/downloads/current/urine_metabolites.zip'

    out_file = download_file(url)
    compounds = extract_hmdb_metabolite(out_file, delete=True)
    save_obj(compounds, compound_file)

else:
    print('Loaded %d DatabaseCompounds from %s' % (len(hmdb_compounds), compound_file))

Old, invalid or missing pickle in /home/joewandy/git/vimms/examples/example_data/hmdb_compounds.p. Please regenerate this file.
  0%|          | 29.0/26.7k [00:00<02:32, 174KB/s]

Downloading urine_metabolites.zip


26.7kKB [00:45, 587KB/s]                             


Extracting HMDB metabolites from urine_metabolites.zip
Loaded 4236 DatabaseCompounds from urine_metabolites.zip
Deleting urine_metabolites.zip
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/hmdb_compounds.p


## c. Train the KDEs

In this section we demonstrate how ViMMS can be used to train kernel density estimators (KDEs) on the example Beer mzML files. The KDEs will be used to sample the MS1/MS2 data for chemicals, scan durations and number of peaks during simulation.

The following two methods `get_data_source` and `train_kdes` from ViMMS will be used. 
- `get_data_source` loads a `DataSource` object that stores information on a set of .mzML files
- `train_kdes` trains KDEs on the .mzML files that have been loaded into the DataSource. 

The parameter below should work for most cases, however for different data, it might be necessary to adjust the `min_rt` and `max_rt` values.

In [9]:
filename = None                    # if None, use all mzML files found
min_ms1_intensity = 0              # min MS1 intensity threshold to include a data point for density estimation
min_ms2_intensity = 0              # min MS2 intensity threshold to include a data point for density estimation
min_rt = 0                         # min RT to include a data point for density estimation
max_rt = 1440                      # max RT to include a data point for density estimation
bandwidth_mz_intensity_rt = 1.0    # kernel bandwidth parameter to sample (mz, RT, intensity) values during simulation
bandwidth_n_peaks = 1.0            # kernel bandwidth parameter to sample number of peaks per scan during simulation

### Load fullscan data and train KDEs

In [10]:
mzml_path = Path(base_dir, 'beers', 'fullscan', 'mzML')
xcms_output = Path(mzml_path, 'extracted_peaks_ms1.csv')
out_file = Path(base_dir, 'peak_sampler_mz_rt_int_19_beers_fullscan.p')

In [11]:
ds_fullscan = get_data_source(mzml_path, filename, xcms_output)

/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_17_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_4_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_1_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_6_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_13_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_15_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_5_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_2_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_19_fullscan1.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fullscan/mzML/Beer_multibeers_7_fullscan1.m

In [12]:
ps = train_kdes(ds_fullscan, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

Saving <class 'vimms.DataGenerator.PeakSampler'> to /home/joewandy/git/vimms/examples/example_data/peak_sampler_mz_rt_int_19_beers_fullscan.p


In [13]:
ps.sample(1, 10) # try to sample 10 MS1 peaks

[Peak mz=119.8217 rt=477.74 intensity=130278.16 ms_level=1,
 Peak mz=384.4174 rt=621.15 intensity=4993.29 ms_level=1,
 Peak mz=203.0931 rt=430.72 intensity=493097.54 ms_level=1,
 Peak mz=292.2981 rt=456.44 intensity=101442.36 ms_level=1,
 Peak mz=151.1022 rt=116.68 intensity=38494.69 ms_level=1,
 Peak mz=211.4234 rt=1105.94 intensity=29309.86 ms_level=1,
 Peak mz=429.7858 rt=232.60 intensity=16410.08 ms_level=1,
 Peak mz=394.7228 rt=260.06 intensity=210988.60 ms_level=1,
 Peak mz=97.0928 rt=8.10 intensity=287018.34 ms_level=1,
 Peak mz=312.1693 rt=230.70 intensity=18033.43 ms_level=1]

### Load fragmentation data and train KDEs

In [14]:
mzml_path = Path(base_dir, 'beers', 'fragmentation', 'mzML')
xcms_output = Path(mzml_path, 'extracted_peaks_ms1.csv')
out_file = Path(base_dir, 'peak_sampler_mz_rt_int_19_beers_fragmentation.p')

In [15]:
ds_fragmentation = get_data_source(mzml_path, filename, xcms_output)

/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_10_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_2_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_5_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_13_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_9_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_7_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_4_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_19_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_17_T10_POS.mzML
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mz

In [16]:
ps = train_kdes(ds_fragmentation, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

Saving <class 'vimms.DataGenerator.PeakSampler'> to /home/joewandy/git/vimms/examples/example_data/peak_sampler_mz_rt_int_19_beers_fragmentation.p


In [17]:
ps.sample(1, 10)

[Peak mz=428.2632 rt=262.36 intensity=23630.50 ms_level=1,
 Peak mz=388.6023 rt=624.28 intensity=58264.41 ms_level=1,
 Peak mz=242.6519 rt=355.95 intensity=115256.14 ms_level=1,
 Peak mz=415.3920 rt=271.24 intensity=848848.95 ms_level=1,
 Peak mz=542.5228 rt=216.60 intensity=34425.32 ms_level=1,
 Peak mz=171.4193 rt=245.71 intensity=1325290.26 ms_level=1,
 Peak mz=237.0404 rt=684.06 intensity=12122597.97 ms_level=1,
 Peak mz=280.6160 rt=246.55 intensity=298850.20 ms_level=1,
 Peak mz=104.7488 rt=696.48 intensity=1549023.02 ms_level=1,
 Peak mz=95.1720 rt=251.53 intensity=37004.51 ms_level=1]

In [18]:
ps.sample(2, 10)

[Peak mz=151.2007 rt=505.30 intensity=85002.68 ms_level=2,
 Peak mz=380.4412 rt=260.39 intensity=1140237.89 ms_level=2,
 Peak mz=550.2856 rt=968.73 intensity=173314.34 ms_level=2,
 Peak mz=185.2839 rt=525.87 intensity=451504.46 ms_level=2,
 Peak mz=312.6461 rt=301.00 intensity=64348.59 ms_level=2,
 Peak mz=198.4973 rt=302.34 intensity=196700.68 ms_level=2,
 Peak mz=373.7324 rt=591.23 intensity=51343.29 ms_level=2,
 Peak mz=649.5024 rt=239.38 intensity=17729.73 ms_level=2,
 Peak mz=160.2805 rt=762.63 intensity=128448.17 ms_level=2,
 Peak mz=134.4097 rt=379.92 intensity=247078.93 ms_level=2]

## d. Extract the ROIs for DsDA Experiments

In [19]:
roi_mz_tol = 10
roi_min_length = 2
roi_min_intensity = 1.75E5
roi_start_rt = min_rt
roi_stop_rt = max_rt

#### Extract beer ROIs

In [20]:
file_names = Path(base_dir, 'beers', 'fragmentation', 'mzML').glob('*.mzML')
out_dir = Path(base_dir,'DsDA', 'DsDA_Beer', 'beer_t10_simulator_files')
mzml_path = Path(base_dir, 'beers', 'fragmentation', 'mzML')

extract_roi(list(file_names), out_dir, 'beer_%d.p', mzml_path, ps)

/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_10_T10_POS.mzML
Created /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files/beer_10.p
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_2_T10_POS.mzML
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files/beer_2.p
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_5_T10_POS.mzML
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files/beer_5.p
/home/joewandy/git/vimms/examples/example_data/beers/fragmentation/mzML/Beer_multibeers_13_T10_POS.mzML
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files/beer_13.p
/home

#### Extract urine ROIs

In [21]:
file_names = Path(base_dir, 'urines', 'fragmentation', 'mzML').glob('*.mzML')
out_dir = Path(base_dir,'DsDA', 'DsDA_Urine', 'urine_t10_simulator_files')
mzml_path = Path(base_dir, 'urines', 'fragmentation', 'mzML')

extract_roi(list(file_names), out_dir, 'urine_%d.p', mzml_path, ps)

/home/joewandy/git/vimms/examples/example_data/urines/fragmentation/mzML/Urine_StrokeDrugs_28_T10_POS.mzML
Created /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files/urine_28.p
/home/joewandy/git/vimms/examples/example_data/urines/fragmentation/mzML/Urine_StrokeDrugs_57_T10_POS.mzML
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files/urine_57.p
/home/joewandy/git/vimms/examples/example_data/urines/fragmentation/mzML/Urine_StrokeDrugs_17_T10_POS.mzML
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files/urine_17.p
/home/joewandy/git/vimms/examples/example_data/urines/fragmentation/mzML/Urine_StrokeDrugs_18_T10_POS.mzML
Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_si