# 3. Generating Multiple Samples using MS1 Controller

In this notebook, we demonstrate how ViMMS can be used to generate multiple samples (sets of chemicals) that are biological and technical replicates. The MS1 controller is then used to produce mass spectral data in form of .mzML files for the multiple samples.

In [1]:
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd
from collections import defaultdict
import os
from pathlib import Path

In [4]:
import sys
sys.path.append('../..')

In [5]:
from vimms.Chemicals import ChemicalCreator, MultiSampleCreator
from vimms.MassSpec import IndependentMassSpectrometer
from vimms.Controller import SimpleMs1Controller
from vimms.Environment import Environment
from vimms.Common import *

Load previously trained KDEs in `PeakSampler` and the list of extracted metabolites, created in **01. Download Data.ipynb**.

In [6]:
base_dir = os.path.abspath('example_data')
ps = load_obj(Path(base_dir, 'peak_sampler_mz_rt_int_19_beers_fullscan.p'))
hmdb = load_obj(Path(base_dir, 'hmdb_compounds.p'))

Set ViMMS logging level

In [7]:
set_log_level_warning()
# set_log_level_info()
# set_log_level_debug()

## Create Initial Chemical

Define an output folder containing our results

In [8]:
out_dir = Path(base_dir, 'results', 'MS1_multiple')

Here we generate multiple chemical objects that will be used across samples. The chemical objects are generated by sampling from metabolites in the HMDB database.

In [9]:
# the list of ROI sources created in the previous notebook '01. Download Data.ipynb'
ROI_Sources = [str(Path(base_dir,'DsDA', 'DsDA_Beer', 'beer_t10_simulator_files'))]

# minimum MS1 intensity of chemicals
min_ms1_intensity = 1.75E5

# m/z and RT range of chemicals
rt_range = [(400, 800)]
mz_range = [(100, 400)]

# the number of chemicals in the sample
n_chems = 1000

# maximum MS level (we do not generate fragmentation peaks when this value is 1)
ms_level = 1

# for this experiment, we restrict the sampled chromatograms to be within 20 - 40s in length
# so they are not too big and too small
roi_rt_range = [20, 40]

In [10]:
chems = ChemicalCreator(ps, ROI_Sources, hmdb)
dataset = chems.sample(mz_range, rt_range, min_ms1_intensity, n_chems, ms_level, roi_rt_range=roi_rt_range)
save_obj(dataset, Path(out_dir, 'BaseDataset', 'dataset.p'))

In [11]:
for chem in dataset[0:10]:
    print(chem)

KnownChemical - 'C5H12N2O' rt=707.41 max_intensity=1567498579.15
KnownChemical - 'C9H18O2' rt=732.15 max_intensity=1188980.50
KnownChemical - 'C15H12O10S' rt=421.13 max_intensity=568706.93
KnownChemical - 'C19H26' rt=556.11 max_intensity=791810.73
KnownChemical - 'C19H35NO' rt=412.62 max_intensity=3530450.33
KnownChemical - 'C2H4Cl2O2' rt=692.57 max_intensity=1991866.40
KnownChemical - 'C2H7NO2S' rt=514.64 max_intensity=802075.41
KnownChemical - 'C10H20O2S' rt=541.53 max_intensity=8164111.00
KnownChemical - 'C7H5NS' rt=734.46 max_intensity=53629606.41
KnownChemical - 'C17H34O2' rt=392.41 max_intensity=14368080.46


## Create Multiple Samples

The next section allows us to define classes of biological replicates, each having multiple technical replicates. 

Below we create two biological classes ('class0', 'class1'), each having 10 technical replicates with some noise on the chemical's intensity.

In [12]:
n_samples = [10, 10] # number of files per class
classes = ["class%d" % i for i in range(len(n_samples))] # creates default list of classes
intensity_noise_sd = [1000] # noise on max intensity

In [13]:
classes

['class0', 'class1']

Add intensity changes between different classes

In [14]:
change_probabilities = [0 for i in range(len(n_samples))] # probability of intensity changes between different classes
change_differences_means = [0 for i in range(len(n_samples))] # mean of those intensity changes
change_differences_sds = [0 for i in range(len(n_samples))] # SD of those intensity changes

Add experimental variables (examples in comments)

In [15]:
experimental_classes = None # [["male","female"],["Positive","Negative","Unknown"]]
experimental_probabilitities = None # [[0.5,0.5],[0.33,0.33,0.34]]
experimental_sds = None # [[250],[250]]

Dropout chemicals in different classes

In [16]:
# drop-out chemicals by their probabilities
dropout_probability = 0.2
dropout_probabilities = [dropout_probability for i in range(len(n_samples))]
dropout_numbers = None # drop-out chemicals by an absolute number

# dropout_probabilities = None
# dropout_numbers = 2 # number of chemicals dropped out in each class

Generate multiple samples

In [17]:
save_location = os.path.join(out_dir, 'ChemicalFiles')

In [18]:
multiple_samples = MultiSampleCreator(dataset, n_samples, classes, intensity_noise_sd, 
                                      change_probabilities, change_differences_means, change_differences_sds, dropout_probabilities, dropout_numbers,
                                     experimental_classes, experimental_probabilitities, experimental_sds, save_location=save_location)

In [19]:
total_samples = np.sum(multiple_samples.n_samples)
total_samples

20

We can also print the chemicals that are missing (removed by drop-out) in each class.

In [20]:
save_obj(multiple_samples.missing_chemicals, Path(out_dir, 'MissingChemicals', 'missing_chemicals.p'))
multiple_samples.missing_chemicals

[[KnownChemical - 'C19H35NO' rt=412.62 max_intensity=3530450.33,
  KnownChemical - 'C8H12NO6P' rt=572.12 max_intensity=2461859.41,
  KnownChemical - 'C5H6Cl6N2O3' rt=488.60 max_intensity=372050.12,
  KnownChemical - 'C10H19NO3' rt=530.82 max_intensity=3381347.00,
  KnownChemical - 'C6H13NO8S' rt=407.75 max_intensity=20679709.08,
  KnownChemical - 'C8H9NO6S' rt=634.37 max_intensity=237635.70,
  KnownChemical - 'C17H15O7' rt=523.17 max_intensity=669669.21,
  KnownChemical - 'C17H22' rt=392.18 max_intensity=1734070.17,
  KnownChemical - 'C8H13NO5' rt=558.94 max_intensity=188779.35,
  KnownChemical - 'C20H40' rt=580.68 max_intensity=877148.50,
  KnownChemical - 'C6H15NO' rt=774.94 max_intensity=1230183.64,
  KnownChemical - 'C14H10' rt=426.45 max_intensity=1035516.57,
  KnownChemical - 'C23H30N2O4' rt=680.06 max_intensity=1035314.70,
  KnownChemical - 'C15H12O9' rt=655.88 max_intensity=2478870.53,
  KnownChemical - 'C3H3Br2IO' rt=398.65 max_intensity=949078.79,
  KnownChemical - 'C14H24' r

## Run MS1 controller on the samples and generate .mzML files

We can now take the multiple samples created above and generate mass spectral data (.mzML files) using the MS1 controller in ViMMS.

In [21]:
min_rt = rt_range[0][0]
max_rt = rt_range[0][1]
controllers = defaultdict(list)
controller_to_mzml = {}

mzml_dir = Path(out_dir, 'mzMLFiles')
num_classes = len(n_samples)
sample_idx = 0
for j in range(num_classes): # loop over classes
    num_samples = n_samples[j]
    for i in range(num_samples): # loop over samples for each class
        
        # load the sample
        fname = Path(save_location, 'sample_%d.p' % sample_idx) 
        sample = load_obj(fname)
        sample_idx += 1
        
        # define output .mzML filename
        out_file = 'number_%d_class_%d.mzML' % (i, j)
        out_path = Path(mzml_dir, out_file)

        # run it through the MS1 controller        
        mass_spec = IndependentMassSpectrometer(POSITIVE, sample, ps)
        controller = SimpleMs1Controller()
        
        # create an environment to run both the mass spec and controller
        env = Environment(mass_spec, controller, min_rt, max_rt, progress_bar=True)

        # set the log level to WARNING so we don't see too many messages when environment is running
        set_log_level_warning()

        # run the simulation
        env.run()

        set_log_level_debug()
        env.write_mzML(mzml_dir, out_file)        
        
        # save the resulting controller
        controllers[j].append(controller)
        controller_to_mzml[controller] = (j, out_file, )

(800.492s) ms_level=1: 100%|█████████▉| 399.25193999999993/400 [00:04<00:00, 86.83it/s] 
2019-12-12 11:41:42.274 | DEBUG    | vimms.Environment:write_mzML:142 - Writing mzML file to /home/joewandy/git/vimms/examples/example_data/results/MS1_multiple/mzMLFiles/number_0_class_0.mzML
2019-12-12 11:41:43.870 | DEBUG    | vimms.Environment:write_mzML:149 - mzML file successfully written!
(800.622s) ms_level=1: 100%|█████████▉| 399.3478500000002/400 [00:04<00:00, 83.88it/s]  
2019-12-12 11:41:48.652 | DEBUG    | vimms.Environment:write_mzML:142 - Writing mzML file to /home/joewandy/git/vimms/examples/example_data/results/MS1_multiple/mzMLFiles/number_1_class_0.mzML
2019-12-12 11:41:49.237 | DEBUG    | vimms.Environment:write_mzML:149 - mzML file successfully written!
(801.108s) ms_level=1: 100%|█████████▉| 399.8484800000002/400 [00:04<00:00, 84.49it/s]  
2019-12-12 11:41:53.992 | DEBUG    | vimms.Environment:write_mzML:142 - Writing mzML file to /home/joewandy/git/vimms/examples/example_data

## Print out the missing peaks

The controller object contains all the information about the state of the mass spectrometry process over time. Below we demonstrate this by generating a report of peaks corresponding to a chemical that are present in one class but is missing from the other class. This can be useful in the benchmark evaluation of peak picking or alignment algorithms.

In [22]:
def get_chem_to_peaks(controller):
    chem_to_peaks = defaultdict(list)
    frag_events = controller.environment.mass_spec.fragmentation_events
    for frag_event in frag_events:
        chem = frag_event.chem
        peaks = frag_event.peaks
        chem_to_peaks[chem].extend(peaks)
    return chem_to_peaks

In [23]:
for controller, (current_class, mzml_filename) in controller_to_mzml.items():
    controller_peaks = get_chem_to_peaks(controller)
    basename = os.path.basename(mzml_filename)
    front, back = os.path.splitext(mzml_filename)
    outfile = front + '.csv'

    missing_peaks = []            
    for other_class in range(num_classes):
        if current_class == other_class:
            continue

        # get the peaks that are present in current_class but missing in other_class
        missing_chems = multiple_samples.missing_chemicals[other_class]
        for chem in missing_chems:
            peaks = controller_peaks[chem]
            for peak in peaks:
                row = (chem.formula.formula_string, current_class, other_class, peak.mz, peak.rt, peak.intensity)
                missing_peaks.append(row)
    
    # convert to dataframe
    columns = ['formula', 'present_in', 'missing_in', 'mz', 'RT', 'intensity']
    missing_df = pd.DataFrame(missing_peaks, columns=columns)
    missing_df.to_csv(os.path.join(out_dir, 'MissingChemicals', os.path.basename(outfile)))

In [24]:
missing_df.head(20)

Unnamed: 0,formula,present_in,missing_in,mz,RT,intensity
0,C19H35NO,1,0,294.279134,413.229,248626.544988
1,C19H35NO,1,0,316.261076,413.229,59585.785134
2,C19H35NO,1,0,370.190898,413.229,235838.554105
3,C19H35NO,1,0,372.293078,413.229,192729.935184
4,C19H35NO,1,0,295.282489,413.229,52540.897778
5,C19H35NO,1,0,296.285844,413.229,5259.402305
6,C19H35NO,1,0,294.279117,414.474,338849.93278
7,C19H35NO,1,0,316.261059,414.474,81208.70315
8,C19H35NO,1,0,370.190881,414.474,321421.343843
9,C19H35NO,1,0,372.293061,414.474,262669.159421
