Skip to content

Commit

Permalink
paper update
Browse files Browse the repository at this point in the history
  • Loading branch information
maxibor committed Aug 30, 2019
1 parent 89b5cdf commit 068358e
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 51 deletions.
54 changes: 46 additions & 8 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
@article{kraken,
title={Kraken: ultrafast metagenomic sequence classification using exact alignments},
author={Wood, Derrick E and Salzberg, Steven L},
journal={Genome biology},
journal={Genome Biology},
volume={15},
number={3},
pages={R46},
Expand Down Expand Up @@ -31,9 +31,9 @@ @article{metagenomics
doi={10.1038/455481a}
}
@article{scikit-learn,
title={Scikit-learn: Machine learning in Python},
title={{Scikit-learn: Machine learning in Python}},
author={Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others},
journal={Journal of machine learning research},
journal={Journal of Machine Learning Research},
volume={12},
number={Oct},
pages={2825--2830},
Expand All @@ -42,7 +42,7 @@ @article{scikit-learn
@article{platt,
title={Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods},
author={Platt, John and others},
journal={Advances in large margin classifiers},
journal={Advances in Large Margin Classifiers},
volume={10},
number={3},
pages={61--74},
Expand All @@ -52,7 +52,7 @@ @article{platt
@article{ete3,
title={ETE 3: reconstruction, analysis, and visualization of phylogenomic data},
author={Huerta-Cepas, Jaime and Serra, Fran{\c{c}}ois and Bork, Peer},
journal={Molecular biology and evolution},
journal={Molecular Biology and Evolution},
volume={33},
number={6},
pages={1635--1638},
Expand All @@ -72,9 +72,9 @@ @article{wu
doi={10.1128/AEM.01996-06}
}
@article{tsne,
title={Visualizing data using t-SNE},
title={Visualizing data using {t-SNE}},
author={Maaten, Laurens van der and Hinton, Geoffrey},
journal={Journal of machine learning research},
journal={Journal of Machine Learning Research},
volume={9},
number={Nov},
pages={2579--2605},
Expand All @@ -92,7 +92,7 @@ @article{sourcetracker
doi={10.1038/nmeth.1650}
}
@article{bray-curtis,
title={An ordination of the upland forest communities of southern Wisconsin},
title={An ordination of the upland forest communities of {southern Wisconsin}},
author={Bray, J Roger and Curtis, John T},
journal={Ecological monographs},
volume={27},
Expand All @@ -102,6 +102,44 @@ @article{bray-curtis
publisher={Wiley Online Library},
doi={10.2307/1942268}
}
@misc{scikit-bio,
author = {Jai Ram Rideout and
Greg Caporaso and
Evan Bolyen and
Daniel McDonald and
Yoshiki Vázquez Baeza and
Jorge Cañardo Alastuey and
Anders Pitman and
Jamie Morton and
Jose Navas and
Kestrel Gorlick and
Justine Debelius and
Zech Xu and
llcooljohn and
adamrp and
Joshua Shorenstein and
Laurent Luce and
Will Van Treuren and
John Chase and
charudatta-navare and
Colin Brislawn and
Antonio Gonzalez and
Weronika Patena and
Karen Schwarzberg and
teravest and
Jens Reeder and
shiffer1 and
nbresnick and
Kevin Murray and
alexbrc and
Karan Sharma},
title = {{biocore/scikit-bio: scikit-bio 0.5.5: More
compositional methods added}},
month = dec,
year = 2018,
doi = {10.5281/zenodo.2254379},
url = {https://doi.org/10.5281/zenodo.2254379}
}



Expand Down
54 changes: 11 additions & 43 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ bibliography: paper.bib

# Summary

SourcePredict [(github.com/maxibor/sourcepredict)](https://github.com/maxibor/sourcepredict) is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.
SourcePredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.

DNA shotgun sequencing of human, animal, and environmental samples has opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [@metagenomics]. One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers, such as Kraken [@kraken].

In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the research question to predict and/or confirm the source. For example, in microbial archaelogy, it is sometimes necessary to rely on metagenomics to validate the source of paleofaeces.
Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, *i.e.* the organisms identified in the samples as features, and the sources of the samples as class labels.
Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, i.e., the organisms identified in the samples as features, and the sources of the samples as class labels.

With this reference dataset, a machine learning algorithm can be trained to predict the source of unknown samples (sinks) from their taxonomic composition.

Expand All @@ -38,60 +38,28 @@ However, the Sourcepredict results are more easily interpreted since the samples

Starting with a numerical organism count matrix (samples as columns, organisms as rows, obtained by a taxonomic classifier) of merged references and sinks datasets, samples are first normalized relative to each other, to correct for uneven sequencing depth using the geometric mean of pairwise ratios (GMPR) method (default) [@gmpr].

After normalization, Sourcepredict performs a two-step prediction algorithm. First, it predicts the proportion of unknown sources, *i.e.* which are not represented in the reference dataset. Second it predicts the proportion of each known source of the reference dataset in the sink samples.
After normalization, Sourcepredict performs a two-step prediction algorithm. First, it predicts the proportion of unknown sources, i.e., which are not represented in the reference dataset. Second, it predicts the proportion of each known source of the reference dataset in the sink samples.

Organisms are represented by their taxonomic identifiers (TAXID).

### Prediction of unknown sources proportion
### Prediction of the proportion of unknown sources

Let $S_i \in \{S_1, .., S_n\}$ be a sample from the normalized sinks dataset $D_{sink}$, $o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}$ be an organism in $S_i$, and $n_o^{\ i}$ be the total number of organisms in $S_i$, with $o_{j}^{\ i} \in \mathbb{Z}+$.
Let $S_i \in \{S_1, .., S_n\}$ be a sample from the normalized sinks dataset $D_{sink}$, $o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}$ an organism in $S_i$, and $n_o^{\ i}$ the total number of organisms in $S_i$, with $o_{j}^{\ i} \in \mathbb{Z}+$. Let $m$ be the mean number of samples per source in the reference dataset, such that $m = \frac{1}{O}\sum_{i=1}^{O}S_i$. For each $S_i$ sample, I define $||m||$ derivative samples $U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}$ to add to the reference dataset to account for the unknown source proportion in a test sample. Separately for each $S_i$, a proportion denoted $\alpha \in [0,1]$ (default = $0.1$) of each $o_{j}^{\ i}$ organism of $S_i$ is added to each $U_k^{S_i}$ sample such that $U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}$ , where $x_{i \ j}$ is sampled from a Gaussian distribution $\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)$. The $||m||$ $U_k^{S_i}$ samples are then added to the reference dataset $D_{ref}$, and labeled as *unknown*, to create a new reference dataset denoted ${}^{unk}D_{ref}$. To predict the proportion of unknown sources, a Bray-Curtis [@bray-curtis] pairwise dissimilarity matrix of all $S_i$ and $U_k^{S_i}$ samples is computed using scikit-bio [@scikit-bio]. This distance matrix is then embedded in two dimensions (default) with the scikit-bio implementation of PCoA. This sample embedding is divided into three subsets: ${}^{unk}D_{train}$ ($64\%$), ${}^{unk}D_{test}$ ($20\%$), and ${}^{unk}D_{validation}$($16\%$). The scikit-learn [@scikit-learn] implementation of KNN algorithm is then trained on ${}^{unk}D_{train}$, and the training accuracy is computed with ${}^{unk}D_{test}$. This trained KNN model is then corrected for probability estimation of the unknown proportion using the scikit-learn implementation of Platt's scaling method [@platt] with ${}^{unk}D_{validation}$. The proportion of unknown sources in $S_i$, $p_u \in [0,1]$ is then estimated using this trained and corrected KNN model. Ultimately, this process is repeated independently for each sink sample $S_i$ of $D_{sink}$.

Let $m$ be the mean number of samples per source in the reference dataset, such that $m = \frac{1}{O}\sum_{i=1}^{O}S_i$.
### Prediction of the proportion of known sources

For each $S_i$ sample, I define $||m||$ derivative samples $U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}$ to add to the reference dataset to account for the unknown source proportion in a test sample.
First, only organism TAXIDs corresponding to the species taxonomic level are retained using the ETE toolkit [@ete3]. A weighted Unifrac (default) [@wu] pairwise distance matrix is then computed on the merged and normalized training dataset $D_{ref}$ and test dataset $D_{sink}$ with scikit-bio, using the NCBI taxonomy as a reference tree. This distance matrix is then embedded in two dimensions (default) using the scikit-learn implementation of t-SNE [@tsne]. The 2-dimensional embedding is then split back to training ${}^{tsne}D_{ref}$ and testing dataset ${}^{tsne}D_{sink}$. The KNN algorithm is then trained on the train subset, with a five (default) cross validation to look for the optimum number of K-neighbors.
The training dataset ${}^{tsne}D_{ref}$ is further divided into three subsets: ${}^{tsne}D_{train}$ ($64\%$), ${}^{tsne}D_{test}$ ($20\%$), and ${}^{tsne}D_{validation}$ ($16\%$). The training accuracy is then computed with ${}^{tsne}D_{test}$. Finally, this second trained KNN model is also corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with ${}^{tsne}D_{validation}$. The proportion $p_{c_s} \in [0,1]$ of each of the $n_s$ sources $c_s \in \{c_{1},\ ..,\ c_{n_s}\}$ in each sample $S_i$ is then estimated using this second trained and corrected KNN model.

Separately for each $S_i$, a proportion denoted $\alpha \in [0,1]$ (default = $0.1$) of each of the $o_{j}^{\ i}$ organism of $S_i$ is added to each $U_k^{S_i}$ samples such that $U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}$ , where $x_{i \ j}$ is sampled from a Gaussian distribution $\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)$.
### Combining unknown and source proportions

The $||m||$ $U_k^{S_i}$ samples are then added to the reference dataset $D_{ref}$, and labeled as *unknown*, to create a new reference dataset denoted ${}^{unk}D_{ref}$.

To predict the proportion of unknown sources, a Bray-Curtis [@bray-curtis] pairwise dissimilarity matrix of all $S_i$ and $U_k^{S_i}$ samples is computed using scikit-bio. This distance matrix is then embedded in two dimensions (default) with the scikit-bio implementation of PCoA.

This sample embedding is divided into three subsets: ${}^{unk}D_{train}$ ($64\%$), ${}^{unk}D_{test}$ ($20\%$), and ${}^{unk}D_{validation}$($16\%$).

The scikit-learn implementation of KNN algorithm is then trained on ${}^{unk}D_{train}$, and the training accuracy is computed with ${}^{unk}D_{test}$.

This trained KNN model is then corrected for probability estimation of the unknown proportion using the scikit-learn implementation of Platt's scaling method [@platt] with ${}^{unk}D_{validation}$.

The proportion of unknown sources in $S_i$, $p_u \in [0,1]$ is then estimated using this trained and corrected KNN model.

Ultimately, this process is repeated independantly for each sink sample $S_i$ of $D_{sink}$.

### Prediction of known source proportion

First, only organism TAXIDs corresponding to the species taxonomic level are retained using the ETE toolkit [@ete3].
A weighted Unifrac (default) [@wu] pairwise distance matrix is then computed on the merged and normalized training dataset $D_{ref}$ and test dataset $D_{sink}$ with scikit-bio, using the NCBI taxonomy as a reference tree.

This distance matrix is then embedded in two dimensions (default) using the scikit-learn implementation of t-SNE [@tsne].

The 2-dimensional embedding is then split back to training ${}^{tsne}D_{ref}$ and testing dataset ${}^{tsne}D_{sink}$.

The KNN algorithm is then trained on the train subset, with a five (default) cross validation to look for the optimum number of K-neighbors.
The training dataset ${}^{tsne}D_{ref}$ is further divided into three subsets: ${}^{tsne}D_{train}$ ($64\%$), ${}^{tsne}D_{test}$ ($20\%$), and ${}^{tsne}D_{validation}$ ($16\%$).

The training accuracy is then computed with ${}^{tsne}D_{test}$.
Finally, this second trained KNN model is also corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with ${}^{tsne}D_{validation}$.

The proportion $p_{c_s} \in [0,1]$ of each of the $n_s$ sources $c_s \in \{c_{1},\ ..,\ c_{n_s}\}$ in each sample $S_i$ is then estimated using this second trained and corrected KNN model.

### Combining unknown and source proportion

Then for each sample $S_i$ of the test dataset $D_{sink}$, the predicted unknown proportion $p_{u}$ is then combined with the predicted proportion $p_{c_s}$ for each of the $n_s$ sources $c_s$ of the training dataset such that $\sum_{c_s=1}^{n_s} s_c + p_u = 1$ where $s_c = p_{c_s} \cdot p_u$.
For each sample $S_i$ of the test dataset $D_{sink}$, the predicted unknown proportion $p_{u}$ is then combined with the predicted proportion $p_{c_s}$ for each of the $n_s$ sources $c_s$ of the training dataset such that $\sum_{c_s=1}^{n_s} s_c + p_u = 1$ where $s_c = p_{c_s} \cdot p_u$.

Finally, a summary table gathering the estimated sources proportions is returned as a `csv` file, as well as the t-SNE embedding sample coordinates.

## Acknowledgements

Thanks to Dr. Christina Warinner, Dr. Alexander Herbig, Dr. AB Rohrlach, and Alexander Hübner for their valuable comments and for proofreading this manuscript.
Thanks to Dr.\ Christina Warinner, Dr.\ Alexander Herbig, Dr.\ AB Rohrlach, and Alexander Hübner for their valuable comments and for proofreading this manuscript.
This work was funded by the Max Planck Society and the Deutsche Forschungsgemeinschaft, project code: EXC 2051 #390713860.

# References
Binary file removed paper/paper.pdf
Binary file not shown.

0 comments on commit 068358e

Please sign in to comment.