# 01. Download Data

This notebook downloads the necessary example data that will be used in other notebooks. In particular, the notebook does the following:

- Download beer and urine .mzML files used as examples in the paper
- Download the HMDB database and extract metabolites.
- Trains kernel density estimators on the mzML files.
- Extract regions of interests from the mzML files.

**Please run this notebook first to make sure the data files are available for subsequent notebooks.**

In [1]:
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
import glob

In [4]:
import sys
sys.path.append('..')

In [37]:
from vimms.DataGenerator import download_file, extract_hmdb_metabolite, extract_zip_file, get_data_source, train_kdes
from vimms.MassSpec import IndependentMassSpectrometer
from vimms.Controller import SimpleMs1Controller
from vimms.Common import *
from vimms.Roi import make_roi, RoiToChemicalCreator, extract_roi

## a. Download beer and urine files

Here we download the beer and urine .mzML files used as examples in the paper if they don't exist.

In [6]:
url = 'https://www.dropbox.com/s/e31prr0qlr625tv/example_data.zip?dl=1'
base_dir = os.path.join(os.getcwd(), 'example_data')

In [7]:
if not os.path.isdir(base_dir): # if not exist then download the example data and extract it
    print('Creating %s' % base_dir)    
    out_file = 'example_data.zip'
    download_file(url, out_file)
    extract_zip_file(out_file, delete=True)
else:
    print('Found %s' % base_dir)

Creating C:\Users\Vinny\work\vimms\examples\example_data
Downloading example_data.zip


861kKB [01:28, 9.69kKB/s]                                                                                                                                                                                                                    


Extracting example_data.zip


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:11<00:00,  9.18it/s]


Deleting example_data.zip


## b. Download metabolites from HMDB

Next we load a pre-processed pickled file of database metabolites in the `data_dir` folder. If it is not found, then create the file by downloading and extracting the metabolites from HMDB.

In [8]:
compound_file = os.path.join(base_dir, 'hmdb_compounds.p')
hmdb_compounds = load_obj(compound_file)
if hmdb_compounds is None: # if file does not exist

    # download the entire HMDB metabolite database, big and slow!!
    url = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip'

    # download a smaller urine metabolite database for testing
    # url = 'http://www.hmdb.ca/system/downloads/current/urine_metabolites.zip'

    out_file = download_file(url)
    compounds = extract_hmdb_metabolite(out_file, delete=True)
    save_obj(compounds, compound_file)

else:
    print('Loaded %d DatabaseCompounds from %s' % (len(hmdb_compounds), compound_file))

Old, invalid or missing pickle in C:\Users\Vinny\work\vimms\examples\example_data\hmdb_compounds.p. Please regenerate this file.


Downloading hmdb_metabolites.zip


629kKB [05:58, 1.75kKB/s]                                                                                                                                                                                                                    


Extracting HMDB metabolites from hmdb_metabolites.zip
Loaded 114087 DatabaseCompounds from hmdb_metabolites.zip
Deleting hmdb_metabolites.zip
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\hmdb_compounds.p


## c. Train the KDEs

In this section we demonstrate how ViMMS can be used to train kernel density estimators (KDEs) on the example Beer mzML files. The KDEs will be used to sample the MS1/MS2 data for chemicals, scan durations and number of peaks during simulation.

The following two methods `get_data_source` and `train_kdes` from ViMMS will be used. 
- `get_data_source` loads a `DataSource` object that stores information on a set of .mzML files
- `train_kdes` trains KDEs on the .mzML files that have been loaded into the DataSource. 

The parameter below should work for most cases, however for different data, it might be necessary to adjust the `min_rt` and `max_rt` values.

In [9]:
filename = None                    # if None, use all mzML files found
min_ms1_intensity = 0              # min MS1 intensity threshold to include a data point for density estimation
min_ms2_intensity = 0              # min MS2 intensity threshold to include a data point for density estimation
min_rt = 0                         # min RT to include a data point for density estimation
max_rt = 1440                      # max RT to include a data point for density estimation
bandwidth_mz_intensity_rt = 1.0    # kernel bandwidth parameter to sample (mz, RT, intensity) values during simulation
bandwidth_n_peaks = 1.0            # kernel bandwidth parameter to sample number of peaks per scan during simulation

### Load fullscan data and train KDEs

In [10]:
mzml_path = os.path.join(base_dir, 'beers\\fullscan\\mzML')
xcms_output = os.path.join(mzml_path, 'extracted_peaks_ms1.csv')
out_file = os.path.join(base_dir, 'peak_sampler_mz_rt_int_19_beers_fullscan.p')

In [11]:
ds_fullscan = get_data_source(mzml_path, filename, xcms_output)

C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_10_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_11_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_12_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_13_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_14_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_15_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_16_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_17_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_18_fullscan1.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeer

In [12]:
ps = train_kdes(ds_fullscan, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

Saving <class 'vimms.DataGenerator.PeakSampler'> to C:\Users\Vinny\work\vimms\examples\example_data\peak_sampler_mz_rt_int_19_beers_fullscan.p


In [13]:
ps.sample(1, 10) # try to sample 10 MS1 peaks

[Peak mz=152.3247 rt=363.66 intensity=422007.00 ms_level=1,
 Peak mz=154.6593 rt=389.55 intensity=897237.00 ms_level=1,
 Peak mz=121.8909 rt=190.14 intensity=25077.75 ms_level=1,
 Peak mz=183.1966 rt=432.24 intensity=72298.41 ms_level=1,
 Peak mz=575.3722 rt=788.41 intensity=717232.76 ms_level=1,
 Peak mz=347.0060 rt=268.98 intensity=395210.81 ms_level=1,
 Peak mz=234.4411 rt=885.68 intensity=1143554.72 ms_level=1,
 Peak mz=503.0397 rt=278.52 intensity=391711.84 ms_level=1,
 Peak mz=184.7087 rt=239.91 intensity=98432.67 ms_level=1,
 Peak mz=106.4837 rt=819.51 intensity=162446.04 ms_level=1]

### Load fragmentation data and train KDEs

In [14]:
mzml_path = os.path.join(base_dir, 'beers\\fragmentation\\mzML')
xcms_output = os.path.join(mzml_path, 'extracted_peaks_ms1.csv')
out_file = os.path.join(base_dir, 'peak_sampler_mz_rt_int_19_beers_fragmentation.p')

In [15]:
ds_fragmentation = get_data_source(mzml_path, filename, xcms_output)

C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_10_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_11_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_12_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_13_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_14_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_15_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_16_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_17_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_18_T10_POS.mzML
C:\Users\Vinny\work\vimms\examples\example_data\beers\f

In [16]:
ps = train_kdes(ds_fragmentation, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

Saving <class 'vimms.DataGenerator.PeakSampler'> to C:\Users\Vinny\work\vimms\examples\example_data\peak_sampler_mz_rt_int_19_beers_fragmentation.p


In [17]:
ps.sample(1, 10)

[Peak mz=242.9349 rt=216.27 intensity=200350.80 ms_level=1,
 Peak mz=347.0537 rt=215.20 intensity=17077863.42 ms_level=1,
 Peak mz=143.2256 rt=622.98 intensity=852572.79 ms_level=1,
 Peak mz=521.2346 rt=915.40 intensity=8678.54 ms_level=1,
 Peak mz=249.1908 rt=1429.12 intensity=35968.48 ms_level=1,
 Peak mz=186.8856 rt=289.78 intensity=145998.84 ms_level=1,
 Peak mz=190.7221 rt=366.63 intensity=12419526.92 ms_level=1,
 Peak mz=327.0626 rt=454.04 intensity=81783.17 ms_level=1,
 Peak mz=353.5483 rt=1014.62 intensity=33889.83 ms_level=1,
 Peak mz=84.4429 rt=311.53 intensity=1465491.45 ms_level=1]

In [18]:
ps.sample(2, 10)

[Peak mz=340.9330 rt=250.82 intensity=653130.95 ms_level=2,
 Peak mz=83.7039 rt=798.45 intensity=280898.96 ms_level=2,
 Peak mz=394.7491 rt=239.99 intensity=965731.59 ms_level=2,
 Peak mz=332.3457 rt=253.90 intensity=37007.23 ms_level=2,
 Peak mz=233.4652 rt=533.29 intensity=238301.41 ms_level=2,
 Peak mz=449.6352 rt=253.34 intensity=26503.94 ms_level=2,
 Peak mz=211.1745 rt=439.53 intensity=41537.50 ms_level=2,
 Peak mz=340.2570 rt=91.85 intensity=110816.13 ms_level=2,
 Peak mz=148.1131 rt=357.05 intensity=3103692.52 ms_level=2,
 Peak mz=401.4385 rt=286.09 intensity=684009.30 ms_level=2]

## d. Extract the ROIs for DsDA Experiments

In [29]:
roi_mz_tol = 10
roi_min_length = 2
roi_min_intensity = 1.75E5
roi_start_rt = min_rt
roi_stop_rt = max_rt

#### Extract beer ROIs

In [38]:
file_names = glob.glob(os.path.join(base_dir, 'beers\\fragmentation\\mzML\\*.mzML'))
out_dir = os.path.join(base_dir,'DsDA\\DsDA_Beer\\beer_t10_simulator_files\\')
mzml_path = os.path.join(base_dir, 'beers\\fragmentation\\mzML')

extract_roi(file_names, out_dir, 'beer_%d.p', mzml_path, ps)

C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_10_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Beer\beer_t10_simulator_files\beer_10.p
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_11_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Beer\beer_t10_simulator_files\beer_11.p
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_12_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Beer\beer_t10_simulator_files\beer_12.p
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_13_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Beer\beer_t10_simulator_files\beer_13.p
C:\Users\Vinny\work\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers

#### Extract urine ROIs

In [39]:
file_names = glob.glob(os.path.join(base_dir, 'urines\\fragmentation\\mzML\\*.mzML'))
out_dir = os.path.join(base_dir,'DsDA\\DsDA_Urine\\urine_t10_simulator_files\\')
mzml_path = os.path.join(base_dir, 'urines\\fragmentation\\mzML')

extract_roi(file_names, out_dir, 'urine_%d.p', mzml_path, ps)

C:\Users\Vinny\work\vimms\examples\example_data\urines\fragmentation\mzML\Urine_StrokeDrugs_02_T10_POS.mzML
Created C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Urine\urine_t10_simulator_files
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Urine\urine_t10_simulator_files\urine_2.p
C:\Users\Vinny\work\vimms\examples\example_data\urines\fragmentation\mzML\Urine_StrokeDrugs_03_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Urine\urine_t10_simulator_files\urine_3.p
C:\Users\Vinny\work\vimms\examples\example_data\urines\fragmentation\mzML\Urine_StrokeDrugs_08_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Urine\urine_t10_simulator_files\urine_8.p
C:\Users\Vinny\work\vimms\examples\example_data\urines\fragmentation\mzML\Urine_StrokeDrugs_09_T10_POS.mzML
Saving <class 'list'> to C:\Users\Vinny\work\vimms\examples\example_data\DsDA\DsDA_Urine\urine_