# Loading and Saving

## Loading

### mzML

To accommodate disparate instrument types and manufacturers (e.g. Bruker, Waters, Thermo, Agilent), DEIMoS operates under the assumption that input data are in an open, standard format.
As of this publication, the accepted file format for DEIMoS is mzML (or mzML.gz), which contains metadata, separation, and spectrometry data that reproduce the contents of vendor formats.
Conversion to mzML from several other formats can be performed using the free and open-source [ProteoWizard](https://proteowizard.sourceforge.io/) msconvert utility.

By default, DEIMoS will load frame, scan, *m/z*, and intensity from the mzML, as well as precursor *m/z* for MS2, as available.
Additional "accession" fields may be specified for data of higher dimension.
To view these fields, a convenience function is provided.

In [1]:
import deimos

accessions = deimos.get_accessions('example_data.mzML.gz')
accessions

{'positive scan': 'MS:1000130',
 'ms level': 'MS:1000511',
 'MSn spectrum': 'MS:1000580',
 'profile spectrum': 'MS:1000128',
 'lowest observed m/z': 'MS:1000528',
 'highest observed m/z': 'MS:1000527',
 'no combination': 'MS:1000795',
 'scan start time': 'MS:1000016',
 'ion mobility drift time': 'MS:1002476',
 'scan window lower limit': 'MS:1000501',
 'scan window upper limit': 'MS:1000500',
 'isolation window target m/z': 'MS:1000827',
 'isolation window lower offset': 'MS:1000828',
 'isolation window upper offset': 'MS:1000829',
 'selected ion m/z': 'MS:1000744',
 'collision-induced dissociation': 'MS:1000133',
 'collision energy': 'MS:1000045',
 '32-bit float': 'MS:1000521',
 'zlib compression': 'MS:1000574',
 'm/z array': 'MS:1000514',
 'intensity array': 'MS:1000515'}

The example data referenced is from an Agilent 6560 Ion Mobility LC/Q-TOF system. Thus, we will additionally need to parse retention time and ion mobility drift times.
Consulting the list above, we are able to supply appropriate accession fields to the `load` function, renaming as convenient (here, "scan start time" becomes "retention_time" and "ion mobility drift time" becomes "drift_time").
The `load` function will infer file type based on extension (here, .mzML or .mzML.gz)

In [2]:
%%time
data = deimos.load('example_data.mzML.gz',
                   accession={'retention_time': 'MS:1000016',
                              'drift_time': 'MS:1002476'})

CPU times: user 6min 20s, sys: 5.78 s, total: 6min 26s
Wall time: 6min 31s


The resulting data will be returned as a dictionary containing data frames, with keys per MS level. The example data contains MS1 and MS2 (collected at 20 eV).

In [3]:
data['ms1']

Unnamed: 0,scanId,retention_time,drift_time,mz,intensity
0,416.0,0.07125,0.00000,71.677490,0.0
1,416.0,0.07125,0.00000,71.680420,11.0
2,416.0,0.07125,0.00000,71.683350,4.0
3,416.0,0.07125,0.00000,71.686279,0.0
4,416.0,0.07125,0.00000,71.703850,0.0
...,...,...,...,...,...
120486132,472575.0,21.98105,49.90292,1606.741089,4.0
120486133,472575.0,21.98105,49.90292,1606.755005,5.0
120486134,472575.0,21.98105,49.90292,1606.768921,0.0
120486135,472575.0,21.98105,49.90292,1608.308472,0.0


In [4]:
data['ms2']

Unnamed: 0,scanId,retention_time,drift_time,mz,intensity,precursor_mz
0,0.0,0.051783,0.00000,61.058502,0.0,813.195496
1,0.0,0.051783,0.00000,61.061207,7.0,813.195496
2,0.0,0.051783,0.00000,61.063911,29.0,813.195496
3,0.0,0.051783,0.00000,61.066612,5.0,813.195496
4,0.0,0.051783,0.00000,61.069317,6.0,813.195496
...,...,...,...,...,...,...
108650686,472159.0,21.961750,49.90292,1637.126221,6.0,843.569336
108650687,472159.0,21.961750,49.90292,1637.140259,2.0,843.569336
108650688,472159.0,21.961750,49.90292,1637.154297,5.0,843.569336
108650689,472159.0,21.961750,49.90292,1637.168213,6.0,843.569336


### HDF5

If the data is already parsed and saved in the Hierarchical Data Format, loading will be much faster. The function does not change, as the loader will again infer format by file extension. However, arguments will be different: specifing accessions is no longer required, but the relevant MS level must be selected using the `key` flag.

In [5]:
%%time
ms1 = deimos.load('example_data.h5', key='ms1')
ms1

CPU times: user 7.92 s, sys: 4.85 s, total: 12.8 s
Wall time: 14 s


Unnamed: 0,scanId,retention_time,drift_time,mz,intensity
0,416.0,0.07125,0.00000,71.677490,0
1,416.0,0.07125,0.00000,71.680420,11
2,416.0,0.07125,0.00000,71.683350,4
3,416.0,0.07125,0.00000,71.686279,0
4,416.0,0.07125,0.00000,71.703850,0
...,...,...,...,...,...
120486132,472575.0,21.98105,49.90292,1606.741089,4
120486133,472575.0,21.98105,49.90292,1606.755005,5
120486134,472575.0,21.98105,49.90292,1606.768921,0
120486135,472575.0,21.98105,49.90292,1608.308472,0


### Multi-file Loading

For certain alignment applications, a high number of input files bars reading each into memory simultaneously.
In these situations, [Dask](https://dask.org/) is used to virtually load multiple data frames, thus more amenable for downstream computation.
The `load` function will detect whether a list of inputs is passed and read using the appropriate backend.
Dask chunksize (see [docs](https://docs.dask.org/en/stable/array-chunks.html)) may be specified by the `chunksize` flag, and additional meta data per input file can be passed as a dictionary with keys for each path (e.g. date, sample type, etc.). Only HDF5 format is support for multi-file loading.

In [6]:
ms1 = deimos.load(['example_data.h5', 'example_data.h5'], key='ms1', chunksize=1E7, meta=None)
ms1

Unnamed: 0_level_0,scanId,retention_time,drift_time,mz,intensity,sample_idx,sample_id
npartitions=26,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,float64,float64,float64,float64,int64,int64,object
,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...
,...,...,...,...,...,...,...


Note that additional columns are appended to indicate each source file name and index.
As the data frames are loaded virtually, the output is a placeholder for would-be data.
For more on loading multiple files, see the section on [alignment](alignment.ipynb).

## Saving

### HDF5

By default, DEIMoS  exports a lightweight, data frame-based representation in Hierarchical Data Format version 5 (HDF5) file format. One must specify a path, the data frame to be saved, and a key for the container. Multiple keys may be saved to the same container (i.e. MS1 and MS2). The `mode` flag is used to indicate file overwrite (`mode='w'`) or append (`mode='a'`), the latter to be used when saving multiple data frames to the file.

In [7]:
# Save ms1 to new file
deimos.save('example_data.h5', data['ms1'], key='ms1', mode='w')

# Save ms2 to same file
deimos.save('example_data.h5', data['ms2'], key='ms2', mode='a')

### mzML

We are currently refactoring the code to export to mzML. Check back soon!

### MGF

We are currently refactoring the code to export to mzML. Check back soon!