# Tutorial of Aligstein usage

In this tutorial we show how to use the Alignstein package by reproducing the biomarkes detection experiment from original paper.

Start with importing all needed packages

In [7]:
import gc

import numpy as np

from Alignstein import (gather_mids, precluster_mids,
                        big_clusters_to_clusters, find_consensus_features,
                        detect_features_from_file, features_to_weight)

Determination of memory status is not supported on this 
 platform, measuring for memoryleaks will never fail


We start with obtaining datasets to be analysed. Create `data` directory and download chromatograms from [PRIDE repository](https://www.ebi.ac.uk/pride/archive/projects/PXD013805) as below. It may took some time, thus it is commented.

Files enumerated from 38 to 39 represent replicates of experiment for 0 $\mu$g/L, 40-42 represent 5 $\mu$g/L, 43-45 represent 50 $\mu$g/L, 46-48 represent 100 $\mu$g/L.

In [2]:
# !mkdir data
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS37.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS38.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS39.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS40.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS41.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS42.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS43.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS44.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS42.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS43.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS44.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS45.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS46.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS47.mzML
# !wget https://ftp.ebi.ac.uk/pride-archive/2019/07/PXD013805/20171124VS48.mzML

We perform analysis for replicates of 0 $\mu$g/L experiment, but it can easily reproduced for the rest experiments.

Thus, prepare filenames, which will usable for us.

In [5]:
filenames = [
    "data/20171124VS37.mzML",
    "data/20171124VS38.mzML",
    "data/20171124VS39.mzML"
]

Detect features in chromatograms. Alignstein uses Feature Finder algorithm from pyOpenMS in centroided mode. It may take some time and logging may be not fully visible in notebook.

In [8]:
feature_sets_list = []
for fname in filenames:
    feature_sets_list.append(detect_features_from_file(fname))
    # Generating features may took significant amount of memory not longer used
    # It's better to clear cached objects before further run.
    gc.collect()

RuntimeError: the file 'data/20171124VS37.mzML' could not be found

We scale features' RT by factor proportional (by `SCALE_FACTOR`) to ratio of everage feature width and length. Aim of this scaling is to obatin M/Z axis and RT axis at the same order of magnitude. Thus, further we won't use parameter maximum RT distance below which features are matched. Instead we will talk about maximum feature distance in both dimensions expressed in M/Z order of manitude (Daltons).

We scale all datasets by the same factor. This results in different scaling of every dataset, but allows more precise matching.

In [11]:
# Scale by average weight
SCALE_FACTOR = 5 # Something between 5 and usually work fine but it highly depends on properties of your dataset.

weights = [features_to_weight(f_set) for f_set in feature_sets_list]
average_weight = np.mean(weights)

scale = average_weight * SCALE_FACTOR

print("Weights:", weights, "\n", "Average weight", average_weight)

for feature_set in feature_sets_list:
    for feature in feature_set:
        feature.scale_rt(scale)

Weights: [] 
 Average weight nan


Start with the first phase - clustering. `gather_mids` function collects feature centroids to be further clustered, then centroid are clustered into several (8-16) areas of by `precluster_mids` function. Finally the main clustering is done by `big_clusters_to_clusters`. This two-step clustering is crucial for proper memory handling.

`distance_threshold` parameters controls maximum distance of centroids in one cluster. It is expressed in M/Z order of magnitude (Daltons). The distance is expressed as $\ell_1$ distance, so it should by about 2 times maximum M/Z variability (to incorporate variability of both M/Z and RT).

In [12]:
mids = gather_mids(feature_sets_list)
gc.collect()
big_clusters = precluster_mids(mids)

clusters = big_clusters_to_clusters(mids, big_clusters, distance_threshold=0.4)

ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required.

And finally we do the matching and consensus feature creation. It is done by `find_consensus_features` function. The most important parameters are:
- `centroid_upper_bound` which controls the maximum centroid distance for which GWD is computed. For efficiency reasons should be reasonably small, but should not be singnificantly smaller than the maximum distance for which we want to match features;
- `gwd_upper_bound` which controls is a parameter of GWD computing (aka. lambda parameter) and allows to omit transporting singal over distance equal to `gwd_upper_bound`. Should be big enough so that the most distant but matchable features are still comparable;
- `matching_penalty` - which is penalty for feature not matching. Can be interpreted as maximum distance so that features still should be matched. Above this threshold features are considered as to distant to be matched.
- `turns` - in one feature matching not all features may be matched, because features are limited to matched at most one to one cluster. Still, there can be more possible features to be matched which can be matched in next turns. Usually, 2-3 turns are enough, algorithm loops next turn iff there are features which can be matched.



In [13]:
consensus_features = find_consensus_features(
    clusters,
    feature_sets_list,
    centroid_upper_bound=15, gwd_upper_bound=15, matching_penalty=1, turns=10)

NameError: name 'clusters' is not defined

Finally dump obained consensus features to `consensus_fetures.out` file with regard to initial features locations.

In [14]:
dump_consensus_features(consensus_features, "consensus_fetures.out",
                        feature_sets_list)

NameError: name 'dump_consensus_features' is not defined