# 1. Download Data

This notebook downloads the necessary example data that will be used in other notebooks. In particular, the notebook does the following:

- Download beer and urine .mzML files used as examples in the paper
- Download the HMDB database and extract metabolites.
- Trains kernel density estimators on the mzML files.
- Extract regions of interests from the mzML files.

**Please run this notebook first to make sure the data files are available for subsequent notebooks. It might take a while, so please be patient and let the notebook runs to its completion**

The data files downloaded above should contain nearly everything needed to replicate the results in the paper using your own data. Please replace the paths below to point to your files if you want to run the simulation based on your own data.

Alternatively if you just want to try running some controllers (fragmentation strategies) quickly using our test fixtures, please take a look at the test cases instead.

In [1]:
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from pathlib import Path
import glob

In [4]:
import sys
sys.path.append('../..')

In [5]:
from vimms.DataGenerator import extract_hmdb_metabolite, get_data_source, get_spectral_feature_database
from vimms.MassSpec import IndependentMassSpectrometer
from vimms.Controller import SimpleMs1Controller
from vimms.Common import *
from vimms.Roi import make_roi, RoiToChemicalCreator, extract_roi

In [6]:
# set_log_level_info()
set_log_level_debug()

## a. Download beer and urine files

Here we download the beer and urine .mzML files used as examples in the paper if they don't exist.

In [7]:
url = 'http://researchdata.gla.ac.uk/870/2/example_data.zip'
base_dir = os.path.join(os.getcwd(), 'example_data')

In [8]:
if not os.path.isdir(base_dir): # if not exist then download the example data and extract it
    print('Creating %s' % base_dir)    
    out_file = 'example_data.zip'
    download_file(url, out_file)
    extract_zip_file(out_file, delete=True)
else:
    print('Found %s' % base_dir)

2019-12-12 10:26:22.763 | INFO     | vimms.Common:download_file:180 - Downloading example_data.zip
  0%|          | 579/869k [00:00<02:30, 5.79kKB/s]

Creating /home/joewandy/git/vimms/examples/example_data


869kKB [01:10, 12.3kKB/s]                           
2019-12-12 10:27:33.470 | INFO     | vimms.Common:extract_zip_file:192 - Extracting example_data.zip
100%|██████████| 110/110 [00:10<00:00, 10.88it/s]
2019-12-12 10:27:43.586 | INFO     | vimms.Common:extract_zip_file:198 - Deleting example_data.zip


## b. Download metabolites from HMDB

Next we load a pre-processed pickled file of database metabolites in the `data_dir` folder. If it is not found, then create the file by downloading and extracting the metabolites from HMDB.

In [9]:
compound_file = Path(base_dir, 'hmdb_compounds.p')
hmdb_compounds = load_obj(compound_file)
if hmdb_compounds is None: # if file does not exist

    # download the entire HMDB metabolite database
    url = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip'

    out_file = download_file(url)
    compounds = extract_hmdb_metabolite(out_file, delete=True)
    save_obj(compounds, compound_file)

else:
    print('Loaded %d DatabaseCompounds from %s' % (len(hmdb_compounds), compound_file))

2019-12-12 10:27:43.869 | INFO     | vimms.Common:download_file:180 - Downloading hmdb_metabolites.zip
629kKB [01:26, 7.27kKB/s]                           
2019-12-12 10:29:10.440 | DEBUG    | vimms.DataGenerator:extract_hmdb_metabolite:21 - Extracting HMDB metabolites from hmdb_metabolites.zip
2019-12-12 10:31:09.144 | INFO     | vimms.DataGenerator:extract_hmdb_metabolite:57 - Loaded 114087 DatabaseCompounds from hmdb_metabolites.zip
2019-12-12 10:31:09.145 | INFO     | vimms.DataGenerator:extract_hmdb_metabolite:64 - Deleting hmdb_metabolites.zip
2019-12-12 10:31:28.216 | INFO     | vimms.Common:save_obj:61 - Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/hmdb_compounds.p


## c. Generate Spectral Feature Database

In this section we demonstrate how ViMMS constructs the spectral feature database containing information, such as the densities of m/z, RT and intensities, scan durations, MS2 peaks, from the example Beer mzML files. The spectral feature database will be used to sample for various features during the simulation later.

The following two methods `get_data_source` and `get_spectral_feature_database` from ViMMS will be used. 
- `get_data_source` loads a `DataSource` object that stores information on a set of .mzML files
- `get_spectral_feature_database` extracts relevant features from .mzML files that have been loaded into the DataSource. 

The parameter below should work for most cases, however for different data, it might be necessary to adjust the `min_rt` and `max_rt` values.

In [10]:
filename = None                    # if None, use all mzML files found
min_ms1_intensity = 0              # min MS1 intensity threshold to include a data point for density estimation
min_ms2_intensity = 0              # min MS2 intensity threshold to include a data point for density estimation
min_rt = 0                         # min RT to include a data point for density estimation
max_rt = 1440                      # max RT to include a data point for density estimation
bandwidth_mz_intensity_rt = 1.0    # kernel bandwidth parameter to sample (mz, RT, intensity) values during simulation
bandwidth_n_peaks = 1.0            # kernel bandwidth parameter to sample number of peaks per scan during simulation

### Load fullscan data and train spectral feature database

In [11]:
mzml_path = Path(base_dir, 'beers', 'fullscan', 'mzML')
xcms_output = Path(mzml_path, 'extracted_peaks_ms1.csv')
out_file = Path(base_dir, 'peak_sampler_mz_rt_int_19_beers_fullscan.p')

In [12]:
ds_fullscan = get_data_source(mzml_path, filename, xcms_output)

2019-12-12 10:31:29.903 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_17_fullscan1.mzML
2019-12-12 10:31:30.748 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_4_fullscan1.mzML
2019-12-12 10:31:31.542 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_1_fullscan1.mzML
2019-12-12 10:31:32.334 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_6_fullscan1.mzML
2019-12-12 10:31:33.247 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_13_fullscan1.mzML
2019-12-12 10:31:34.065 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_15_fullscan1.mzML
2019-12-12 10:31:34.849 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_5_fullscan1.mzML
2019-12-12 10:31:35.629 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_2_fullscan1.mzML
2019-12-12 10:31:36.414 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer

In [13]:
ps = get_spectral_feature_database(ds_fullscan, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

2019-12-12 10:31:45.296 | DEBUG    | vimms.DataGenerator:__init__:436 - Extracted 0 MS2 scans
2019-12-12 10:31:45.297 | DEBUG    | vimms.DataGenerator:_compute_intensity_props:614 - Computing parent intensity proportions
2019-12-12 10:31:45.298 | DEBUG    | vimms.DataGenerator:__init__:445 - Extracting scan durations
2019-12-12 10:31:45.300 | DEBUG    | vimms.DataGenerator:_kde:626 - Training KDEs for ms_level=1
2019-12-12 10:31:45.300 | DEBUG    | vimms.DataGenerator:_kde:637 - Retrieving mz_intensity_rt values from <vimms.DataGenerator.DataSource object at 0x7f1ed650ada0>
2019-12-12 10:31:45.301 | INFO     | vimms.DataGenerator:get_data:278 - Using values from XCMS peaklist
2019-12-12 10:31:45.360 | DEBUG    | vimms.DataGenerator:_kde:637 - Retrieving n_peaks values from <vimms.DataGenerator.DataSource object at 0x7f1ed650ada0>
2019-12-12 10:32:35.025 | DEBUG    | vimms.DataGenerator:_kde:626 - Training KDEs for ms_level=2
2019-12-12 10:32:35.026 | DEBUG    | vimms.DataGenerator:_kde

In [14]:
ps.get_peak(1, 10) # try to sample 10 MS1 peaks

[Peak mz=548.7613 rt=850.36 intensity=556234.03 ms_level=1,
 Peak mz=367.6491 rt=125.38 intensity=25227.32 ms_level=1,
 Peak mz=239.1899 rt=491.71 intensity=168206.85 ms_level=1,
 Peak mz=365.8156 rt=211.44 intensity=5949036.80 ms_level=1,
 Peak mz=387.9753 rt=1342.40 intensity=95478.40 ms_level=1,
 Peak mz=257.4990 rt=437.52 intensity=965905.36 ms_level=1,
 Peak mz=251.5781 rt=967.22 intensity=532116.44 ms_level=1,
 Peak mz=347.3595 rt=293.30 intensity=154163.06 ms_level=1,
 Peak mz=75.2708 rt=255.89 intensity=157945.37 ms_level=1,
 Peak mz=171.8691 rt=1372.60 intensity=8399688.69 ms_level=1]

### Load fragmentation data and train spectral feature database

In [15]:
mzml_path = Path(base_dir, 'beers', 'fragmentation', 'mzML')
xcms_output = Path(mzml_path, 'extracted_peaks_ms1.csv')
out_file = Path(base_dir, 'peak_sampler_mz_rt_int_19_beers_fragmentation.p')

In [16]:
ds_fragmentation = get_data_source(mzml_path, filename, xcms_output)

2019-12-12 10:32:36.045 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_10_T10_POS.mzML
2019-12-12 10:32:41.635 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_2_T10_POS.mzML
2019-12-12 10:32:47.495 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_5_T10_POS.mzML
2019-12-12 10:32:52.976 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_13_T10_POS.mzML
2019-12-12 10:32:58.045 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_9_T10_POS.mzML
2019-12-12 10:33:03.562 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_7_T10_POS.mzML
2019-12-12 10:33:08.908 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_4_T10_POS.mzML
2019-12-12 10:33:14.083 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_19_T10_POS.mzML
2019-12-12 10:33:20.202 | INFO     | vimms.DataGenerator:load_data:161 - Loading Beer_multibeers_17_T

In [17]:
ps = get_spectral_feature_database(ds_fragmentation, filename, min_ms1_intensity, min_ms2_intensity, min_rt, max_rt,
               bandwidth_mz_intensity_rt, bandwidth_n_peaks, out_file)

2019-12-12 10:34:38.557 | DEBUG    | vimms.DataGenerator:__init__:436 - Extracted 138969 MS2 scans
2019-12-12 10:34:38.558 | DEBUG    | vimms.DataGenerator:_compute_intensity_props:614 - Computing parent intensity proportions
2019-12-12 10:34:53.152 | DEBUG    | vimms.DataGenerator:__init__:445 - Extracting scan durations
2019-12-12 10:34:53.159 | DEBUG    | vimms.DataGenerator:_kde:626 - Training KDEs for ms_level=1
2019-12-12 10:34:53.159 | DEBUG    | vimms.DataGenerator:_kde:637 - Retrieving mz_intensity_rt values from <vimms.DataGenerator.DataSource object at 0x7f2392a0c4a8>
2019-12-12 10:34:53.160 | INFO     | vimms.DataGenerator:get_data:278 - Using values from XCMS peaklist
2019-12-12 10:34:53.322 | DEBUG    | vimms.DataGenerator:_kde:637 - Retrieving n_peaks values from <vimms.DataGenerator.DataSource object at 0x7f2392a0c4a8>
2019-12-12 10:37:48.230 | DEBUG    | vimms.DataGenerator:_kde:626 - Training KDEs for ms_level=2
2019-12-12 10:37:48.233 | DEBUG    | vimms.DataGenerator

In [18]:
ps.get_peak(1, 10)

[Peak mz=241.1763 rt=596.45 intensity=2420390.65 ms_level=1,
 Peak mz=495.2376 rt=417.20 intensity=16536.77 ms_level=1,
 Peak mz=122.5876 rt=299.28 intensity=70720.02 ms_level=1,
 Peak mz=150.7984 rt=306.76 intensity=17147842.22 ms_level=1,
 Peak mz=430.8960 rt=237.57 intensity=187340.68 ms_level=1,
 Peak mz=240.5786 rt=1096.17 intensity=204671.49 ms_level=1,
 Peak mz=341.3785 rt=218.03 intensity=319904.91 ms_level=1,
 Peak mz=241.6549 rt=476.10 intensity=226572.62 ms_level=1,
 Peak mz=411.8167 rt=823.82 intensity=28295.69 ms_level=1,
 Peak mz=357.9952 rt=235.56 intensity=56875.11 ms_level=1]

In [19]:
ps.get_peak(2, 10)

[Peak mz=187.0235 rt=625.89 intensity=5083.84 ms_level=2,
 Peak mz=97.8833 rt=1296.09 intensity=2558.77 ms_level=2,
 Peak mz=202.2899 rt=422.44 intensity=23230.41 ms_level=2,
 Peak mz=162.0548 rt=641.48 intensity=5896.36 ms_level=2,
 Peak mz=96.3014 rt=1080.46 intensity=330694.48 ms_level=2,
 Peak mz=67.2237 rt=457.03 intensity=235.22 ms_level=2,
 Peak mz=206.5379 rt=319.08 intensity=17635.36 ms_level=2,
 Peak mz=68.2419 rt=256.39 intensity=373.43 ms_level=2,
 Peak mz=105.4112 rt=684.87 intensity=3813.46 ms_level=2,
 Peak mz=90.8609 rt=876.06 intensity=21018.38 ms_level=2]

## d. Extract the ROIs for DsDA Experiments

In [20]:
roi_mz_tol = 10
roi_min_length = 2
roi_min_intensity = 1.75E5
roi_start_rt = min_rt
roi_stop_rt = max_rt

#### Extract beer ROIs

In [21]:
file_names = Path(base_dir, 'beers', 'fragmentation', 'mzML').glob('*.mzML')
out_dir = Path(base_dir,'DsDA', 'DsDA_Beer', 'beer_t10_simulator_files')
mzml_path = Path(base_dir, 'beers', 'fragmentation', 'mzML')

extract_roi(list(file_names), out_dir, 'beer_%d.p', mzml_path, ps)

2019-12-12 10:40:05.581 | DEBUG    | vimms.Roi:__init__:314 -      0/ 11377
2019-12-12 10:40:26.437 | INFO     | vimms.Roi:__init__:338 - Found 11377 ROIs above thresholds
2019-12-12 10:40:26.438 | INFO     | vimms.Common:create_if_not_exist:48 - Created /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files
2019-12-12 10:40:26.440 | INFO     | vimms.Common:save_obj:61 - Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files/beer_10.p
2019-12-12 10:40:47.923 | DEBUG    | vimms.Roi:__init__:314 -      0/ 14333
2019-12-12 10:41:14.143 | INFO     | vimms.Roi:__init__:338 - Found 14333 ROIs above thresholds
2019-12-12 10:41:14.180 | INFO     | vimms.Common:save_obj:61 - Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Beer/beer_t10_simulator_files/beer_2.p
2019-12-12 10:41:35.517 | DEBUG    | vimms.Roi:__init__:314 -      0/  9803
2019-12-12 10:41:53.760 | INFO     | vi

#### Extract urine ROIs

In [22]:
file_names = Path(base_dir, 'urines', 'fragmentation', 'mzML').glob('*.mzML')
out_dir = Path(base_dir,'DsDA', 'DsDA_Urine', 'urine_t10_simulator_files')
mzml_path = Path(base_dir, 'urines', 'fragmentation', 'mzML')

extract_roi(list(file_names), out_dir, 'urine_%d.p', mzml_path, ps)

2019-12-12 10:53:30.820 | DEBUG    | vimms.Roi:__init__:314 -      0/ 15929
2019-12-12 10:53:57.799 | INFO     | vimms.Roi:__init__:338 - Found 15929 ROIs above thresholds
2019-12-12 10:53:57.800 | INFO     | vimms.Common:create_if_not_exist:48 - Created /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files
2019-12-12 10:53:57.803 | INFO     | vimms.Common:save_obj:61 - Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files/urine_28.p
2019-12-12 10:54:29.542 | DEBUG    | vimms.Roi:__init__:314 -      0/ 20011
2019-12-12 10:55:03.403 | INFO     | vimms.Roi:__init__:338 - Found 20011 ROIs above thresholds
2019-12-12 10:55:03.441 | INFO     | vimms.Common:save_obj:61 - Saving <class 'list'> to /home/joewandy/git/vimms/examples/example_data/DsDA/DsDA_Urine/urine_t10_simulator_files/urine_57.p
2019-12-12 10:55:22.994 | DEBUG    | vimms.Roi:__init__:314 -      0/ 12606
2019-12-12 10:55:44.300 | INFO