NOTE: This feature is available only if using a version of pyOpenMS >= 3.0, at the time of writing this means using one of the nightly builds as described in the Installation Instructions.
In pyOpenMS some data structures can be converted to a tabular format as a pandas.DataFrame
. This allows convenient access to data and meta values of spectra, features and identifications.
Required imports for the examples:
from pyopenms import *
import pandas as pd
from urllib.request import urlretrieve
url = 'https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master/src/data/'
- pyopenms.MSExperiment.get_df( long=False )
Generates a pandas DataFrame with all peaks in the MSExperiment
Parameters:
long : default False
set to True if you want to have a long/expanded/melted dataframe with one row per peak. Faster but replicated RT information. If False, returns rows in the style: rt, np.array(mz), np.array(int)
Returns:
pandas.DataFrame
peak map information stored in a DataFrame
Examples:
urlretrieve(url+'BSA1.mzML', 'BSA1.mzML')
exp = MSExperiment()
MzMLFile().load('BSA1.mzML', exp)
df = exp.get_df() # default: long = False
df.head(2)
RT | mzarray | intarray | |
---|---|---|---|
0 | 1501.41394 | [300.0897645621494, 300.18132740129533, 300.20... | [3431.0261, 1181.809, 1516.1746, 1719.8547, 11... |
1 | 1503.03125 | [300.06577092599525, 300.08932376441896, 300.2... | [914.79034, 1842.2311, 2395.1025, 851.4738, 16... |
df = exp.get_df(long=True)
df.head(2)
RT | mz | inty | |
---|---|---|---|
0 | 1501.41394 | 300.089752 | 3431.026123 |
1 | 1501.41394 | 300.181335 | 1181.808960 |
- pyopenms.peptide_identifications_to_df( peps, decode_ontology=True, default_missing_values={bool: False, int: -9999, float: np.nan, str: ''}, export_unidentified=True )
Generates a pandas DataFrame with all peaks in the MSExperiment
Parameters:
peps :
list of PeptideIdentification objects
decode_ontology : default True
if meta values contain CV identifer (e.g., from PSI-MS) they will be automatically decoded into the human readable CV term name.
default_missing_values : default {bool: False, int: -9999, float: np.nan, str: ''}
default value for missing values for each data type
export_unidentified : default True
export PeptideIdentifications without PeptideHit
Returns:
pandas.DataFrame
peptide identifications in a DataFrame
Example:
urlretrieve(url+'small.idXML', 'small.idXML')
prot_ids = []
pep_ids = []
IdXMLFile().load('small.idXML', prot_ids, pep_ids)
df = peptide_identifications_to_df(pep_ids)
df.head(2)
id | RT | mz | q-value | charge | protein_accession | start | end | NuXL:z2 mass | NuXL:z3 mass | ... | isotope_error | NuXL:peptide_mass_z0 | NuXL:XL_U | NuXL:sequence_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | OpenNuXL_2019-12-04T16:39:43_1021782429466859437 | 900.425415 | 414.730865 | 0.368649 | 4 | DECOY_sp|Q86UQ0|ZN589_HUMAN | 255 | 267 | 828.458069 | 552.641113 | ... | 0 | 1654.901611 | 0 | 0.173912 |
1 | OpenNuXL_2019-12-04T16:39:43_7293634134684008928 | 903.565186 | 506.259521 | 0.422779 | 2 | sp|P61313|RL15_HUMAN | 179 | 187 | 0.0 | 0.0 | ... | 0 | 1010.504639 | 0 | 0.290786 |
- pyopenms.FeatureMap.get_df( meta_values = None )
Generates a pandas DataFrame with information contained in the FeatureMap.
Optionally the feature meta values and information for the assigned PeptideHit can be exported.
Parameters:
meta_values : default None
meta values to include (None, [custom list of meta value names] or 'all')
export_peptide_identifications (bool): default True
export sequence and score for best PeptideHit assigned to a feature. Additionally the ID_filename (file name of the corresponding ProteinIdentification) and the ID_native_id (spectrum ID of the corresponding Feature) are exported. They are also annotated as meta values when collecting all assigned PeptideIdentifications from a FeatureMap with FeatureMap.get_assigned_peptide_identifications(). A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])
Returns:
pandas.DataFrame
feature information stored in a DataFrame
Examples:
urlretrieve(url+'BSA1_F1_idmapped.featureXML', 'BSA1_F1_idmapped.featureXML')
feature_map = FeatureMap()
FeatureXMLFile().load('BSA1_F1_idmapped.featureXML', feature_map)
df = feature_map.get_df() # default: meta_values = None
df.head(2)
id | peptide_sequence | peptide_score | ID_filename | ID_native_id | charge | RT | mz | RTstart | RTend | mzstart | mzend | quality | intensity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 | LVTDLTK | 0.000000 | unknown | spectrum=1270 | 2 | 1942.600083 | 395.239277 | 1932.484009 | 1950.834351 | 395.239199 | 397.245758 | 0.808494 | 157572000.0 |
18416216708636999474 | DDSPDLPK | 0.034483 | unknown | spectrum=1167 | 2 | 1749.138335 | 443.711224 | 1735.693115 | 1763.343506 | 443.711122 | 445.717531 | 0.893553 | 54069300.0 |
df = feature_map.get_df(meta_values = 'all', export_peptide_identifications = False)
df.head(2)
id | charge | RT | mz | RTstart | RTend | mzstart | mzend | quality | intensity | FWHM | spectrum_index | spectrum_native_id | label | score_correlation | score_fit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 | 2 | 1942.600083 | 395.239277 | 1932.484009 | 1950.834351 | 395.239199 | 397.245758 | 0.808494 | 157572000.0 | 10.061090 | 259 | spectrum=1270 | 168 | 0.989969 | 0.660286 |
18416216708636999474 | 2 | 1749.138335 | 443.711224 | 1735.693115 | 1763.343506 | 443.71112 | 445.717531 | 0.893553 | 54069300.0 | 14.156094 | 156 | spectrum=1167 | 169 | 0.999002 | 0.799234 |
df = feature_map.get_df(meta_values = [b'FWHM', b'label'])
df.head(2)
id | charge | RT | mz | RTstart | RTend | mzstart | mzend | quality | intensity | FWHM | label |
---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 | 2 | 1942.600083 | 395.239277 | 1932.484009 | 1950.834351 | 395.239199 | 397.245758 | 0.808494 | 157572000.0 | 10.061090 | 168 |
18416216708636999474 | 2 | 1749.138335 | 443.711224 | 1735.693115 | 1763.343506 | 443.71112 | 445.717531 | 0.893553 | 54069300.0 | 14.156094 | 169 |
Extract assigned peptide identifications from a feature map
Peptide identifications can be mapped to their corresponding features in a FeatureMap
. It is possible to extract them using the function
pyopenms.FeatureMap.get_assigned_peptide_identifications()
returning a list of PeptideIdentification
objects.
- pyopenms.FeatureMap.get_assigned_peptide_identifications()
Generates a list with peptide identifications assigned to a feature.
Adds 'ID_native_id' (feature spectrum id), 'ID_filename' (primary MS run path of corresponding ProteinIdentification) and 'feature_id' (unique ID of corresponding Feature) as meta values to the peptide hits. A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])
Returns:
[PeptideIdentification]
list of PeptideIdentification objects
A DataFrame
can be created on the resulting list of PeptideIdentification
objects using pyopenms.peptide_identifications_to_df(assigned_peptides)
.
Feature map and peptide data frames contain columns, on which they can be merged together to contain the complete information for peptides and features in a single data frame.
The columns for unambiguously merging the data frames:
feature_id
: the unique feature identifierID_native_id
: the feature spectrum native identifierID_filename
: the filename (primary MS run path) of the correspondingProteinIdentification
Example:
feature_df = feature_map.get_df()
assigned_peptides = feature_map.get_assigned_peptide_identifications()
assigned_peptide_df = peptide_identifications_to_df(assigned_peptides)
merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])
merged_df.head(2)
feature_id | peptide_sequence | peptide_score | ID_filename | ID_native_id | charge_x | RT_x | mz_x | RTstart | RTend | ... | id | RT_y | mz_y | q-value | charge_y | protein_accession | start | end | OMSSA_score | target_decoy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 | LVTDLTK | 0.000000 | unknown | spectrum=1270 | 2 | 1942.600083 | 395.239277 | 1932.484009 | 1950.834351 | ... | OMSSA_2009-11-17T11:11:11_4731105163044641872 | 1933.405151 | 395.239349 | 0.000000 | 2 | P02769|ALBU_BOVIN | -1 | -1 | 0.001084 | True |
18416216708636999474 | DDSPDLPK | 0.034483 | unknown | spectrum=1167 | 2 | 1749.138335 | 443.711224 | 1735.693115 | 1763.343506 | ... | OMSSA_2009-11-17T11:11:11_4731105163044641872 | 1738.033447 | 443.711243 | 0.034483 | 2 | P02769|ALBU_BOVIN | -1 | -1 | 0.003951 | True |
- pyopenms.ConsensusMap.get_df()
Generates a pandas DataFrame with both consensus feature meta data and intensities from each sample.
Returns:
pandas.DataFrame
consensus map meta data and intensity stored in pandas DataFrame
- pyopenms.ConsensusMap.get_intensity_df()
Generates a pandas DataFrame with feature intensities from each sample in long format (over files).
For labelled analyses channel intensities will be in one row, therefore resulting in a semi-long/block format. Resulting DataFrame can be joined with result from get_metadata_df by their index 'id'.
Returns:
pandas.DataFrame
intensity DataFrame
- pyopenms.ConsensusMap.get_metadata_df()
Generates a pandas DataFrame with feature meta data (sequence, charge, mz, RT, quality).
Resulting DataFrame can be joined with result from get_intensity_df by their index 'id'.
Returns:
pandas.DataFrame
DataFrame with metadata for each feature (such as: best identified sequence, charge, centroid RT/mz, fitting quality)
Examples:
urlretrieve(url+'ProteomicsLFQ_1_out.consensusXML', 'ProteomicsLFQ_1_out.consensusXML')
consensus_map = ConsensusMap()
ConsensusXMLFile().load('ProteomicsLFQ_1_out.consensusXML', consensus_map)
df = consensus_map.get_df()
df.head(2)
df = consensus_map.get_intensity_df()
df.head(2)
id | BSA1_F1.mzML | ... | BSA1_F2.mzML |
---|---|---|---|
2935923263525422257 | 0.0 | ... | 0.0 |
10409195546240342212 | 1358151.0 | ... | 0.0 |
df = consensus_map.get_metadata_df()
df.head(2)
id | sequence | charge | RT | mz | quality |
---|---|---|---|---|---|
2935923263525422257 | DGDIEAEISR | 3 | 1523.370634 | 368.843773 | 0.000000 |
10409195546240342212 | SHC(Carbamidomethyl)IAEVEK | 3 | 1552.032973 | 358.174576 | 0.491247 |