# 01. Download Data

This notebook downloads the necessary example data that will be used in other notebooks. In particular, the notebook does the following:

- Download beer and urine .mzML files used as examples in the paper
- Download the HMDB database and extract metabolites.
- Trains kernel density estimators on the mzML files.
- Extract regions of interests from the mzML files.

**Please run this notebook first to make sure the data files are available for subsequent notebooks.**

In [1]:
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

In [20]:
import os
import glob

In [4]:
import sys
sys.path.append('..')

In [33]:
from vimms.DataGenerator import download_file, extract_hmdb_metabolite, extract_zip_file, get_data_source, train_kdes
from vimms.MassSpec import IndependentMassSpectrometer
from vimms.Controller import SimpleMs1Controller
from vimms.Common import *
from vimms.Roi import make_roi, RoiToChemicalCreator

## a. Download beer and urine files

Here we download the beer and urine .mzML files used as examples in the paper if they don't exist.

In [6]:
url = 'https://www.dropbox.com/s/e31prr0qlr625tv/example_data.zip?dl=1'
base_dir = os.path.join(os.getcwd(), 'example_data')

In [7]:
if not os.path.isdir(base_dir): # if not exist then download the example data and extract it
    print('Creating %s' % base_dir)    
    out_file = 'example_data.zip'
    download_file(url, out_file)
    extract_zip_file(out_file, delete=True)
else:
    print('Found %s' % base_dir)

Creating C:\Users\joewa\Work\git\vimms\examples\example_data
Downloading example_data.zip


861kKB [01:35, 8.99kKB/s]                                                                                                         


Extracting example_data.zip


100%|███████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:36<00:00,  6.02it/s]


Deleting example_data.zip


## b. Download metabolites from HMDB

Next we load a pre-processed pickled file of database metabolites in the `data_dir` folder. If it is not found, then create the file by downloading and extracting the metabolites from HMDB.

In [8]:
compound_file = os.path.join(base_dir, 'hmdb_compounds.p')
hmdb_compounds = load_obj(compound_file)
if hmdb_compounds is None: # if file does not exist

    # download the entire HMDB metabolite database, big and slow!!
    url = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip'

    # download a smaller urine metabolite database for testing
    # url = 'http://www.hmdb.ca/system/downloads/current/urine_metabolites.zip'

    out_file = download_file(url)
    compounds = extract_hmdb_metabolite(out_file, delete=True)
    save_obj(compounds, compound_file)

else:
    print('Loaded %d DatabaseCompounds from %s' % (len(hmdb_compounds), compound_file))

Old, invalid or missing pickle in C:\Users\joewa\Work\git\vimms\examples\example_data\hmdb_compounds.p. Please regenerate this file.


Downloading urine_metabolites.zip


26.7kKB [00:51, 518KB/s]                                                                                                          


Extracting HMDB metabolites from urine_metabolites.zip
Loaded 4236 DatabaseCompounds from urine_metabolites.zip
Deleting urine_metabolites.zip
Saving <class 'list'> to C:\Users\joewa\Work\git\vimms\examples\example_data\hmdb_compounds.p


## c. Train the KDEs

In this section we demonstrate how ViMMS can be used to train kernel density estimators (KDEs) on the example Beer mzML files. The KDEs will be used to sample the MS1/MS2 data for chemicals, scan durations and number of peaks during simulation.

The following two methods `get_data_source` and `train_kdes` from ViMMS will be used. 
- `get_data_source` loads a `DataSource` object that stores information on a set of .mzML files
- `train_kdes` trains KDEs on the .mzML files that have been loaded into the DataSource. 

The parameter below should work for most cases, however for different data, it might be necessary to adjust the `min_rt` and `max_rt` values.

In [9]:
filename = None                    # if None, use all mzML files found
min_ms1_intensity = 0              # min MS1 intensity threshold to include a data point for density estimation
min_ms2_intensity = 0              # min MS2 intensity threshold to include a data point for density estimation
min_rt = 0                         # min RT to include a data point for density estimation
max_rt = 1440                      # max RT to include a data point for density estimation
bandwidth_mz_intensity_rt = 1.0    # kernel bandwidth parameter to sample (mz, RT, intensity) values during simulation
bandwidth_n_peaks = 1.0            # kernel bandwidth parameter to sample number of peaks per scan during simulation

### Load fullscan data and train KDEs

In [10]:
mzml_path = os.path.join(base_dir, 'beers\\fullscan\\mzML')
xcms_output = os.path.join(mzml_path, 'extracted_peaks_ms1.csv')
out_file = os.path.join(base_dir, 'peak_sampler_mz_rt_int_19_beers_fullscan.p')

In [11]:
ds_fullscan = get_data_source(mzml_path, filename, xcms_output)

C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_10_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_11_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_12_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_13_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_14_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_15_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_16_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_17_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fullscan\mzML\Beer_multibeers_18_fullscan1.mzML
C:\Users\joewa\Work\git\vimms\examples\example

In [12]:
ps = train_kdes(ds_fullscan, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

Saving <class 'vimms.DataGenerator.PeakSampler'> to C:\Users\joewa\Work\git\vimms\examples\example_data\peak_sampler_mz_rt_int_19_beers_fullscan.p


In [13]:
ps.sample(1, 10) # try to sample 10 MS1 peaks

[Peak mz=316.4197 rt=281.67 intensity=2543492.23 ms_level=1,
 Peak mz=583.4724 rt=843.11 intensity=99232.01 ms_level=1,
 Peak mz=353.0622 rt=86.82 intensity=18960.96 ms_level=1,
 Peak mz=87.0348 rt=21.23 intensity=7428.11 ms_level=1,
 Peak mz=212.6987 rt=405.33 intensity=303436.80 ms_level=1,
 Peak mz=138.9501 rt=885.20 intensity=103075.46 ms_level=1,
 Peak mz=99.4881 rt=701.33 intensity=64470.01 ms_level=1,
 Peak mz=253.6707 rt=437.19 intensity=14910756.25 ms_level=1,
 Peak mz=149.6830 rt=836.28 intensity=207035.59 ms_level=1,
 Peak mz=686.5891 rt=877.72 intensity=26729.55 ms_level=1]

### Load fragmentation data and train KDEs

In [14]:
mzml_path = os.path.join(base_dir, 'beers\\fragmentation\\mzML')
xcms_output = os.path.join(mzml_path, 'extracted_peaks_ms1.csv')
out_file = os.path.join(base_dir, 'peak_sampler_mz_rt_int_19_beers_fragmentation.p')

In [15]:
ds_fragmentation = get_data_source(mzml_path, filename, xcms_output)

C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_10_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_11_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_12_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_13_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_14_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_15_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_16_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_17_T10_POS.mzML
C:\Users\joewa\Work\git\vimms\examples\example_data\beers\fragmentation\mzML\Beer_multibeers_18_T10_POS.mzML
C:\Users\joewa\Work

In [16]:
ps = train_kdes(ds_fragmentation, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

Saving <class 'vimms.DataGenerator.PeakSampler'> to C:\Users\joewa\Work\git\vimms\examples\example_data\peak_sampler_mz_rt_int_19_beers_fragmentation.p


In [17]:
ps.sample(1, 10)

[Peak mz=189.3259 rt=88.98 intensity=106524.58 ms_level=1,
 Peak mz=814.4592 rt=12.13 intensity=19607.26 ms_level=1,
 Peak mz=341.8021 rt=245.03 intensity=1315290.19 ms_level=1,
 Peak mz=348.8904 rt=1219.51 intensity=214658.35 ms_level=1,
 Peak mz=423.7101 rt=263.68 intensity=617912.08 ms_level=1,
 Peak mz=332.5372 rt=1024.80 intensity=65674.70 ms_level=1,
 Peak mz=323.4865 rt=247.99 intensity=526561.93 ms_level=1,
 Peak mz=191.1670 rt=345.05 intensity=6732591.55 ms_level=1,
 Peak mz=253.1869 rt=755.69 intensity=10215486.47 ms_level=1,
 Peak mz=323.1448 rt=257.33 intensity=40547.19 ms_level=1]

In [18]:
ps.sample(2, 10)

[Peak mz=319.4794 rt=558.47 intensity=90183.42 ms_level=2,
 Peak mz=136.3747 rt=736.58 intensity=418773.96 ms_level=2,
 Peak mz=129.1146 rt=94.13 intensity=2195271.72 ms_level=2,
 Peak mz=212.5952 rt=286.69 intensity=1024591.33 ms_level=2,
 Peak mz=110.7040 rt=671.13 intensity=472752.02 ms_level=2,
 Peak mz=515.8680 rt=280.79 intensity=91414.86 ms_level=2,
 Peak mz=188.9713 rt=1354.77 intensity=109416.12 ms_level=2,
 Peak mz=619.2535 rt=1006.72 intensity=23857.46 ms_level=2,
 Peak mz=688.8448 rt=872.93 intensity=2549969.43 ms_level=2,
 Peak mz=333.5049 rt=879.18 intensity=55417.64 ms_level=2]

## d. Extract the ROIs for DsDA Experiments

In [36]:
def extract_roi(file_names, out_dir, pattern):
    roi_mz_tol = 10
    roi_min_length = 2
    roi_min_intensity = 1.75E5
    roi_start_rt = min_rt
    roi_stop_rt = max_rt

    for i in range(len(file_names)): # for all mzML files in file_names
        # extract ROI
        mzml_file = os.path.join(mzml_path, file_names[i])
        good_roi, junk = make_roi(mzml_file, mz_tol=roi_mz_tol, mz_units='ppm', min_length=roi_min_length,
                                  min_intensity=roi_min_intensity, start_rt=roi_start_rt, stop_rt=roi_stop_rt)
        all_roi = good_roi
        
        # turn ROI to chemicals
        rtcc = RoiToChemicalCreator(ps, all_roi)
        data = rtcc.chemicals
        
        # save extracted chemicals
        basename = os.path.basename(file_names[i])
        out_name = pattern % int(basename.split('_')[2])
        save_obj(data, os.path.join(out_dir, out_name))

#### Extract beer ROIs

In [None]:
file_names = glob.glob(os.path.join(base_dir, 'beers\\fragmentation\\mzML\\*.mzML'))
out_dir = os.path.join(base_dir,'DsDA\\DsDA_Beer\\beer_t10_simulator_files\\')

extract_roi(file_names, out_dir, 'beer_%d.p')

#### Extract urine ROIs

In [None]:
file_names = glob.glob(os.path.join(base_dir, 'urines\\fragmentation\\mzML\\*.mzML'))
out_dir = os.path.join(base_dir,'DsDA\\DsDA_Urine\\urine_t10_simulator_files\\')

extract_roi(file_names, out_dir, 'urine_%d.p')